This is a short and sweet post to remind you of who I am, what to expect from this blog, and test to see if everything is still working.
I'm Jeffrey Sherman, and this blog is about software performance and scaling. Along the way I'll also talk a lot about development team management, which is often far more important and more effective than writing code.
My current goal is to publish one article per week, mostly on Wednesday.
If that sounds good you should absolutely subscribe to get this blog as an email. For those who have already subscribed, thank you for letting me into your inbox!
To paraphrase Charles Babbage: Racism In, Racism Out. When your inputs are racist, your outputs will be racist. Even if you didn’t do anything racist.
Years ago, I worked for one of the largest residential mortgage brokers in the US and often tried to impress upon my fellow developers the need to be aware of the past so that we could try to reduce the racism flowing through our code.
The conversations were usually flat. They would thank me for the interesting history lesson and walk away assured that since they weren’t racist, and weren’t coding anything racist, there was no racism in the software.
Housing in the US has a long and sored history of racism.
Here are two quick examples of how race and racism leaks into mortgage software. The first is pretty blunt, the second is subtle.
Colorblind Code Not Allowed
For residential mortgages, the US Government requires asking borrowers race and gender. The government uses the data to find racism (and sexism) in lending. This data is how we know that Black borrowers pay higher rates and get rejected for loans more often. The data paints a depressing picture, racism is prevalent in the mortgage industry. You have to add race to your code, and it is a good bet that some of your users will exploit that data to discriminate.
Pricing Is Based Off Racist History
As a part of the appraisal process, the appraiser will find “comparable” houses nearby to validate the price. In areas where the value of homes has been depressed by racist history like redlining, comparables act as a racist anchor. Using comparables is like saying “houses in this neighborhood are worth less than other neighborhoods because 60 years ago racists decided that predominantly Black neighborhoods are worth less, and we have decided to continue the process.”
Many states ban asking about salary history because it reinforces past discrimination. There are companies out there pushing back against the use of comps. As a developer you won’t be able to choose the company’s risk models, but with a little work, you can code up alternative models and make better data available.
Don’t be Passive
You have an obligation to understand your inputs. You may not be able to sanitize them, but understanding is a vital first step. Google the history of your industry to find where racism has come from in the past, and think about how your code makes it easier or harder for history to repeat itself.
When designing a system, what are the differences between bidirectional links and tags?
Tags build value Asynchronously
The most obvious difference is that tags are asynchronous and become more useful as tags are added. Tags can return results with as few as 1 member, and grow over time. Links require two items, can’t be set until both items exist, and will never contain more than the two items.
Links are static, while Tags are living documents. Links will be the same when you come back to them, tags will be different over time.
Tags are Clusters, Links are Paths
Tags can quickly provide a cluster of related items, but don’t offer guidance over what to read next. Links highlight related, highly branching items. Readers can swim in a pool of tags, or follow a path of links.
Tags are great if you want more on the topic, links help you find the next topic.
Links are Curated
To create a link, you have to know that the other item exists. Your ability to create links will always be constrained by your knowledge of previously published items, or your time and desire to search out related content. Tags are a shot in the dark. They work regardless of your knowledge of the past. As a result, links are a more curated source.
Tags are Context
Tag names are context. If you add a “business” tag to a bunch of articles, someone else will know why you grouped the articles together. Links do not retain any context, later users (including yourself in 6 months) will need to examine both items to know why you linked them.
Bidirectional Links are more DB Intensive
Assuming your database is set up something like this:
Bidirectional links require 2 rows for each entry.
Tags require 1 row per entry plus 1 row for the tag.
Big O says that 2N and N + 1 are both O(n). Anyone who has worked on an overwhelmed database will tell you that 2 insertions can be way more than twice as expensive as 1.
Conclusion
As a practical matter, tags are more friendly to casual content creation and casual users.
Links are better when created by subject matter experts and consumed by people trying to learn an entire topic.
The Strangler is an extremely effective technique for phasing out legacy systems over time. Instead of spending months getting a new system up to parity with the current system so that clients can use it, you place The Strangler between the original system and the clients.
The Strangler passes any request that the new system can’t handle on to the legacy system. Over time, the new system handles more and more, the legacy system does less and less. Most importantly, your clients won’t have to do any migration work and won’t notice a thing as the legacy system fades away.
A common objection to setting up a Strangler is that it Yet Another Thing that your overloaded team needs to build. Write a request proxy on top of rewriting the original requests! Who has time?
Except, AWS customers already have a fully featured Strangler on their shelf. The Elastic Load Balancer (ELB) is a tool that takes incoming requests and forwards them on to another server.
The only requirement is that your clients access your application through a DNS name.
With an afternoon’s worth of effort you can set up a Strangler for your legacy application.
You no longer need to get the new system up to feature parity for clients to start using it! Instead, new features get routed to the new server, while old ones stay with the legacy system. When you do have time or a business reason to replace an existing feature the release is nothing more than a config change.
Getting a new system up to parity with the legacy system is a long process with little business value. The Strangler lets your new system leverage the legacy system, and you don’t even have to let your clients know about the migration. The Strangler is your Best Alternative to a Total Rewrite!
Do they know the frustrating pain of needing features that can’t be delivered? The maddening pain of redoing work destroyed by a bug? The teeth grinding pain of slow systems that steal minutes with deadlines hours away?
Which pains are they addressing, and how much of the system do they need to rewrite to give clients some relief?
Pain provides focus and a metric for success. Without a specific pain goal, the project will metastasize and grow into “replace everything”.
How will you support current users during the rewrite?
Feature freeze is the most common answer, but it should not be acceptable. Neither is only patching critical bugs.
One solution that has worked well for me: The Senior proposing the rewrite gets to architect and oversee the new solution, but all the coding will be done by the juniors. The senior will be responsible for fixing bugs and implementing new features on the legacy system.
With that much skin in the game, and no juniors doing junior things, you’ll be amazed at how quickly the legacy system stabilizes. I have seen this be so successful that the rewrite gets canceled, and the work written off as a moral boosting learning experience.
What lessons have you learned from the original system, and how will you prevent repeating the same mistakes?
This is really a question of ownership. Are your developers ready to acknowledge and own their mistakes?
Are their expectations for the future realistic, or do they expect the shiny new technology to prevent them from making the same mistakes?
Going from monoliths to micro services, micro services to serverless, or serverless back to a monolith won't help if your developers are repeatedly implementing a fundamentally flawed design.
Rewrites can destroy your company
Companies have been destroyed by endlessly stalled rewrites.
Work with your developers to answer these questions. It’s likely that you’ll find a much less risky path than the full rewrite. If you do agree to the rewrite, asking questions will give you confidence that your developers have a realistic plan that they can deliver.
Here are six questions to ask yourself, or a developer, before dancing to the automation music:
How often is the task likely to change?
Weekly Business Intelligence reports change monthly, monthly ones change every quarter, and quarterly ones change every year. They are never stable enough to be worth automating by an outside developer. This is why BI tools that let non-technical users semi-automate reports are a 5 billion dollar industry.
On the other hand, regulatory and compliance reports are likely to be stable for years and make great targets.
If a task won’t be executed at least 10 times between changes, it probably won’t be worth automating.
How long is the task likely to continue?
Some tasks are likely to continue “forever”. Archiving old client data, scrubbing databases and other client onboarding/offloading tasks fall into this category.
Some tasks are never going to come up again.
If a task won’t be executed at least 5 more times, it probably won’t be worth automating.
How much human effort does the task consume, and for whom?
You can automate turning on the office lights in the morning with a motion detector, but it won’t pay off in terms of time saved from flipping a switch.
How much of an interruption is doing the task? Turning on the lights on your way in the door isn’t an interruption for anyone. Phone support manually resetting a user password isn’t an interruption, but having the CFO process refunds for clients is a giant interruption.
Even if the reset and refund are both a single button click that takes 15 seconds, pulling the CFO away is a much bigger deal. Also the context switch for the CFO will be measured in minutes because she’s not processing refunds all day long.
Use a sliding scale based on time and title. For entry level, don’t automate until the task takes more than an hour per person per day. For the C-Suite anything over 5 minutes per day is a good target.
How much lag can automation save your clients?
Clients don’t care how long the task takes, they care about the lag between asking and receiving. It doesn’t matter that processing a refund only take 5 minutes if your team only processes refunds once a week.
If the client lag is more than a day, consider automating.
Is the Task a real process, or are you cleaning up the effects of a bug?
Software bugs can do all sorts of terrible things to your data and process, but after the first couple of times, the damage becomes predictable and you’ll get better at fixing the damage.
Automating the fix is one way of fixing the bug. That’s how bugs become features.
If you don’t want to make the bug an official part of your software, don’t automate the fix.
How common and expensive are mistakes?
Mistakes are inevitable when humans are manually performing routine tasks.
Mistakes are inevitable when developers first automate a routine task. Assume that developer mistakes will equal one instance of a manual mistake.
For an automation to save money you have to expect to prevent at least 2 manual errors.
As an equation:
[Cost to Automate] + [Cost of a mistake] < [Cost of a mistake] * [Frequency of mistakes]
Because the cost of mistakes is relatively easy to quantify, tasks with expensive mistakes are usually automated early on.
Conclusion
Developers always want to automate things, sometimes it pays off, sometimes it’s a mistake.
If you ask these six questions before automating you’re much more likely to make the right choice:
How often is the task likely to change?
How long is the task likely to continue?
How much human effort does the task consume, and for whom?
How much lag can automation save your clients?
Is the Task a real process, or are you cleaning up the effects of a bug?
By the time a company gets to midsize, there will be a pool of people to perform ad hoc, automatable tasks. Most of these tasks involve manually generating reports for someone else. To someone who writes software, this can seem infuriatingly inefficient and wasteful. A sign of tech debt and bad software!
Counterintuitively, it is extremely difficult to automate reports and actually save the company time or money.
Imagine Alice, the CEO, has Bob, in accounting, make a weekly report by hand pulling numbers from 3 different systems and plugging the data into a spreadsheet. Alice doesn’t care how Bob generates the report, so Carol, a developer, decides to help Bob by automating the report.
To keep the numbers simple, Bob earns 75k/year (~$36/hr) and Carol earns 150k/yr (~$72/hr). Let’s say it takes Bob 1 hour/week to generate the report, making it cost $1,872/year.
All the data is in databases, and Bob is doing specific steps in a specific order. He’s miserable because making the report is dull, error prone, repetitive and endless. Worst of all, the report is meaningless to Bob! He’s generating it for Alice, because he’s capable, understands what the numbers mean, and has access. Carol is Bob’s friend and wants to help!
Carol’s time is more valuable than Bob’s, so she can only spend 3 days automating the report for the company to break even. That includes testing, formatting, and deployment.
Let’s assume Carol is able to automate the report on time. The report shows up in Alice’s inbox every week, same as before, but Bob is able to work on other things. Break out the champagne!
Except, Alice needs changes to the report twice a year. When Bob was generating the report she could just tell him and it would take maybe 30 minutes to change the process because Bob works in accounting and understands all of the numbers. There was no lag between the ask and the update, Bob would make the updated report next time.
Now that Carol has automated the report, Alice will have to talk to Bob, and Bob will have to talk to Carol. Alice can’t ask Carol directly because Carol doesn’t work in accounting and doesn’t know what the numbers mean. Additionally, Carol doesn’t spend her days working on reports so she’ll have to get up to speed, talk to Bob and make the change. Let’s say it takes one full 8hr work day.
As the CEO, Alice’s changes are important, but probably not important enough to pull Carol out of her current work. It usually takes 2 report cycles to get to the top of her list.
Now the automated report costs $1,152/yr to maintain, and it’s not the exact report that the CEO wants 4 weeks, or 7% of the time. 4 out of date reports easily outweighs the $720 in savings and Alice will probably have Bob manually generate the report when the code is out of date. Finally, the sunk cost of building the software will result in Carol having to maintain the software going forward, even though the software generates no real value.
The conflict between “can be automated” and “can’t be done profitably” often results in DBAs become a report generation team. It’s also the impetus behind business intelligence tools; find a way to get developer style automation, without using a developer.
I’ll go over a developer’s “should I automate this task” framework next time.
It was the best of times, it was the worst of times. The company hired contractors to do maintenance work so that the full time developers can write a new system, the company hired new developers to write a new system stuck original developers with the maintenance work.
When software has been bankrupted by tech debt, a common strategy is to start over. Just like financial bankruptcy, the declaration is a way to buy time while continuing to run the business. The software can’t be turned off, someone has to keep the old system running while the new system is built. Depending on your faith and trust in your current employees you will be tempted to either hire “short term” contractors to keep things running, or hire a new team to write the new system.
The two tactics play out differently, but the end result is the same, the new system will fail.
Why Maintenance Contractors Fail
Contractors fail because they won’t have the historical context for the code. They may know how it works today, but not how it got there. Without deep business knowledge their actions are limited to superficial bug fixes and tactical features. The job is to keep the lights on, not to push back on requests or ask Who, What, Where, When, Why.
Since the contractors are keeping the lights on for you while current employees build a new system, they aren’t going to care about the long term quality of the system. It’s already bad, it’s already going away, why spend the extra time refactoring.
Contractors won’t be able to stabilize the system and buy your employees enough time to build the new system. Equally bad, sometimes the contractors *will* stabilize the old system which becomes an excuse to cancel the new system and not restructure your debts.
Why Hiring a New Team Fails Too
The flip side is to leave the current team in place and hire a new team to write the new system. This is super attractive when you suspect that the original team will recreate the same mistakes that led to disaster.
This tactic fails because the new team won’t have the historical context for the code. Documentation and details will be superficial and lack all of the critical edge cases. All of which delays the project. Meanwhile, your original team will become extremely demoralized.
If they couldn’t keep things running before, wait until you tell them their whole job is to keep things running. The most capable members will quickly leave and you’ll have to pull developers from the new team onto the old system.
You’ll end up pulling the new developers onto the old team and praying they don’t see it as bait-and-switch.
Don’t split on Old vs New
Both tactics fail because the teams are divided between old vs new. In the end it doesn’t matter who is on which team, because you need old *and* new to be successful.
Keeping the team together ensures that everyone has context into how the system works and has skin in the game.
Instead of splitting old vs new, find a way to split the system in half. Have each team own half the old system responsibilities, and one of the two new systems. This gives everyone a stake in keeping the old system alive, a chance to work on the new system, and reduces the business risk of either team failing.
Once you find a way to split the responsibilities in half, you might even find a way to make iterative improvements, it's your Best Alternative to a Total Rewrite!
I often talk about the Best Alternative To A Total Rewrite. The idea that you should know the alternatives to a rewrite before making a decision.
Even when there is no alternative to a rewrite you still need to work through the common problems that cause the project to fail.
These open ended questions are designed to guide your preparations:
How will a rewrite solve your users’ pain?
Where is the current system failing your users? How will a rewrite fix those problems? You and your fellow developers may hate the current codebase, but that isn’t a compelling argument.
Now that you know what user pain you are trying to solve, is there really no way to resolve the problem in the current codebase?
If the system is slow, is there really nothing you can do to make it faster? Maybe a rewrite would make things so fast that users won’t notice any lag. But if a refactor would reduce frustrating lag by 50% in a month, you should refactor.
Once you've focused on your client's problems, reducing pain today is preferable to fixing the problem 6 months from now.
Do you have to recreate the existing system before adding new features?
After bugs and latency, the number one issue with old codebases is that it’s hard to add new features. The original code was designed to expand in one way, and now it needs to expand differently. A rewrite will let you design a system to meet those new requirements.
What about a new system that is nothing but new features? Can you find a way to make a new system that is a client of the old system?
A rewrite requires you have to recreate all the features of the existing system before adding the new features that create new business value.
How will you ensure that the new system doesn’t have the same problems as the original system?
The forces that caused the original system to go off the rails are going to work against the rewrite too.
Management didn't want to spend time on unit tests? They still won’t.
Constantly changing business requirements? Just wait until the demands aren’t constrained by the existing code.
Inadequate resources, logging, metrics and other tools? A rewrite won’t make any of that better.
After a brief honeymoon a rewrite will face more pressure than the original system. You need a plan to keep all of those external forces at bay.
What’s the release plan?
Replacement is not a release plan. How is the rewrite getting into production? Can you find a way to run the two systems in parallel? Can you replace things a piece at a time?
You will have to release the new system eventually. "When it's done" takes forever.
Conclusion
When you propose a rewrite, you need to be able to answer these questions to your teammates and managers. Bringing the questions and answers up in your initial pitch will show that you’ve thought things through.
Don’t proceed until you can answer all five questions! You just might happen to discover an alternative to a total rewrite. Even if you don’t use it, you can move more confidently once you can speak to the alternatives.
Pixel Tracking is a common Marketing SaaS activity used to track page loads. Today I am going to try and tie several earlier posts together and show how to evolve a frustrating Pixel Tracking architecture into one that can survive database outages.
This design is governed by database performance. As the load ramps up, users are going to notice lagging page loads. Worse, each event recorded will have to be processed, tripling the database load.
Users are now completely insulated from scale and processing issues.
Dead Database Design
There is no database in the final design because it is no longer relevant to the users’ interactions with your services. The performance is the same whether your database is at 0% or 100% load.
The performance is the same if your database falls over and you have to switch to a hot standby or even restore from a backup.
With a bit of effort your SaaS could have a database fall over on the run up to Black Friday and recover without data loss or clients noticing. If you are using SNS/SQS on AWS the queue defaults are over 100,000 events! It may take a while to chew through the queues, but the data won't disappear.
When your Pixel Tracking is causing your users headaches, going asynchronous is your Best Alternative to a Total Rewrite.