The Roadmap to Hell Is Written In Features

The road to hell is paved with good intentions; the roadmap is written in features.

The Road To Hell Is Paved With Good Intentions points to the common disconnect between someone’s intent, and the result of their actions.  Good intentions, and good actions, do not guarantee good outcomes.

Roadmapping features has the same disconnect between intent and results.  Good intentions, and good features, do not guarantee good outcomes.

Each feature makes sense, and is good by itself.

The problem is that the goal isn’t a feature, a whole series of features, or even projects.  The goal is the outcome.  Roadmaps full of features don’t track the progress towards an outcome, they track progress on features.

Roadmaps based on features present a linear path.  Each feature unlocks the next.  This, then that, and then that, until the project is complete.

Customer feedback isn’t baked into the process.  Will all of the good features lead to good results?  Probably not!

The opposite of roadmapping features, is roadmapping outcomes.  This requires getting clear on where things are, where they need to end up, and the mechanisms that drive change.

The only feature that comes out of an outcome based roadmap is the first one.  The first attempt to change the system.  After that?  Depends on the outcome of the feature!

Outcome based roadmaps are built on mechanisms that can drive change, and feedback loops.  The features are unknown because the impact of each feature is unknown.  

A roadmap built of features is like a road to hell, paved with good intentions.

How I Will Talk You Out Of A Rewrite

If you come to me saying “we need a rewrite”, I will run you through a lot of “why” questions to discover if you actually need a rewrite.  I am basically trying to answer these three questions:

  1. What will the new system do differently?
  2. Why can’t you do that in the current system?
  3. Why can’t you build the new things separately?

Answering these questions yourself will help you think through whether you really need to rewrite your existing system.

Let’s examine each more closely.

What Will The New System Do Differently

What will be different about the new system?

For this exercise you have to ignore code quality, bugs, and developer happiness.  Those are all important, but corporate reality hasn’t changed.  The same forces that resulted in low quality code, endless bugs, and developer unhappiness are still there.  A rewrite might get you temporary relief, but over the long term, the same negative forces will be at work on the new system.

So, what is different about the new system?  The differences could be technical, like a new programming language, framework, or architecture.  Maybe you need to support a new line of business and the existing system can’t stretch to cover the needs.

Get clear on what will be different, and write it down.  These are your New Things.

Why Can’t You Do It In The Current System

Now that you are clear on what New Things you need, why can’t you build the New Things into the existing system?

We are still putting aside issues like the existing code quality, bugs, and developer happiness.  If those forces are all that is stopping you from doing the new work in the existing system, I have bad news, those forces will wreck the new system as well.  You don’t need new software, you need to change how you write software.  Stop now, while you only have one impossible system to maintain.

Other than the forces that cause you to write bad software, why you can’t use the current system.  Get clear.  Write it down.  These are your Forcing Functions.

Why Can’t You Build The New Things Separately?

At this point we know what the New Things are and we know the Forcing Functions that prevent you from extending the current system.  Why can’t the New Things live alongside the existing system?

A rewrite requires rewriting all of your existing system.  Building New Things in a new system because of Forcing Functions, only requires building the New Things.  Why can’t you do that?

By this point office politics are out.  Office politics can’t overcome Forcing Functions.

Quality and bugs are also out, because there is no existing code to consider.

Get clear.  Write it down.  This is your Reasoning.

Now, take your Reasoning, backed by the Forcing Functions, and you have explained how getting the New Things requires a rewrite.  If your Reasoning can convince your coworkers, then I’m sorry, you do actually need a rewrite.  

If not, it is time to talk about alternatives.

What Happens After I Talk You Out Of A Rewrite

Most rewrites are driven by development culture issues, not the software itself.  This brings us back to code quality, bugs, and developer happiness.  A rewrite won’t fix any of those issues.

The good news is that you can fix all of them without a rewrite.  Even better news is that fixing them will only take about as much effort as you think a rewrite would take.  The bad news is that your culture is pushing against making the fixes.

Take it one step at a time, and keep delivering.

Latency, Throughput, And Spherical Cows

My post about latency and throughput featured an extremely simplistic model to demonstrate that Latency and Throughput are independent.  An astute reader called it a spherical cow, a model so over simplified that it is a bit ridiculous.

So, let’s deflate the cow, just a bit, and see how things hold up.  I hope you like tables and cow jokes!

(Keenan Crane; GIF by username:Nepluno, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons)

(Keenan Crane; GIF by username:Nepluno, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons)

Chewing The Cud

The original model was a streaming system that receives 1 million messages a second.  Perfectly spherical.

There were two systems, one with 5s latency, one with 2s latency.

We will leave our processors completely spherical - they each process 100,000 events simultaneously.  Our pipelines then look like this

5s Latency

TimeNew Events/sProcess InstancesEvents Being ProcessedThroughputExtra Capacity
11,000,000501,000,00004,000,000
21,000,000502,000,00003,000,000
31,000,000503,000,00002,000,000
41,000,000504,000,00001,000,000
51,000,000505,000,0001,000,0000
61,000,000505,000,0001,000,0000
71,000,000505,000,0001,000,0000
81,000,000505,000,0001,000,0000

2s Latency

TimeNew Events/sProcess InstancesEvents Being ProcessedThroughputExtra Capacity
11,000,000201,000,00001,000,000
21,000,000202,000,0001,000,0000
31,000,000202,000,0001,000,0000
41,000,000202,000,0001,000,0000
51,000,000202,000,0001,000,0000
61,000,000202,000,0001,000,0000
71,000,000202,000,0001,000,0000
81,000,000202,000,0001,000,0000

Conclusion: Same Throughput

The Throughput of the two systems is the same.

The first system, with 5s of latency, takes longer to warm up and needs 2.5x more instances, but it still produces the same throughput.  3 seconds later..

What Happens If You Add Scaling?

Maybe that model is too simple.  Let’s deflate the cow a little bit, vary the input and add auto-scaling.

Let’s make it an average of 1 million messages a second, with peaks and valleys between 500,000 and 1.5 million per second.  20 second period, so it changes +/- 100,000 messages every second.  But, we’re only deflating the cow a little bit, so the changes will be step changes at the end of the second.

We will leave our processors completely spherical - they each process 100,000 events simultaneously.  It takes 1 second to start a processor, and 1 second to shut down.  The only difference between the two is that one takes 2s to process a message and the other takes 5s.

Now our input looks like this:

5s Latency

TimeNew Events/sProcess InstancesEvents Being ProcessedEvents Waiting to be ProcessedThroughputExtra Capacity
11,000,000001,000,00000
21,100,000101,000,0001,100,00000
31,200,000212,100,0001,200,00000
41,300,000333,300,0001,300,00000
51,400,000464,600,0001,400,00000
61,500,000606,000,0001,500,0001,000,0000
71,400,000656,500,000300,0001,100,0000
81,300,000686,800,000200,0001,200,0000
91,200,000707,000,00001,300,0000
101,100,000706,900,00001,400,0001
111,000,000696,500,00001,500,0004
12900,000655,900,00001,400,0006
13800,000595,300,00001,300,0006
14700,000534,700,00001,200,0006
15600,000474,100,00001,100,0006
16500,000413,500,00001,000,0006
17600,000353,100,0000900,0004
18700,000312,900,0000800,0002
19800,000292,900,0000700,0000
20900,000292,900,000200,000600,0000
211,000,000313,100,000400,000500,0000

2s Latency

TimeNew Events/sProcess InstancesEvents Being ProcessedEvents Waiting to be ProcessedThroughputExtra Capacity
11,000,000001,000,00000
21,100,000101,000,0001,100,00000
31,200,000212,100,0001,200,0001,000,0000
41,300,000232,300,0001,300,0001,100,0000
51,400,000252,500,0001,400,0001,200,0000
61,500,000272,700,0001,500,0001,300,0000
71,400,000292,900,0001,400,0001,400,0000
81,300,000292,900,0001,300,0001,500,0000
91,200,000292,700,00001,400,0002
101,100,000272,500,00001,300,0002
111,000,000252,300,00001,200,0002
12900,000232,100,00001,100,0002
13800,000211,900,00001,000,0002
14700,000191,700,0000900,0002
15600,000171,500,0000800,0002
16500,000151,300,0000700,0002
17600,000131,100,0000600,0002
18700,000111,100,000100,000500,0000
19800,000121,200,000300,000600,0000
20900,000151,500,000300,000700,0000
211,000,000181,800,000300,000800,0000

Result - Latency Does Not Impact Throughput

Our slightly less spherical model with perfect step changes produced the same fundamental result:

You can’t increase the throughput of a streaming system to be higher than the input.

Latency has a huge impact on the amount of resources required!  The first system, with 5s latency, fluctuated between 29 and 70 instances.  The second system, with 2s latency, fluctuated between 11 and 29.

The second system’s maximum scale out was equal in size to the first system’s minimum.

And yet, neither system was able to get above 1.5 million events/s.

No matter how non-spherical the cow may be, you can’t sustain a throughput faster than then inputs.

You Can’t Change Your Answers

Without showing your work. 

When you have customer reports, you will eventually want to change how and what you measure.  That’s fine!  You have to explain the differences. 

Especially if the change is because the old system’s measurements were wrong.

Your customers calibrated their business decisions around the old system. Changing it, even for the better, throws off their calculations. 

Always improve your systems, and when calculations share, you need to overshare.

Remember the golden rule of SaaS: Do unto others as you would have AWS do unto you.

Rewrite Anti-Patterns: The Writeback

In my post, The Strangler-Fig Pattern Has An Implementation Order; Outputs First, I mentioned The Writeback Anti-Pattern.  Turns out, that this anti-pattern hasn’t been named and described before; so I get to be first!

The Writeback Anti-Pattern is when a new Source Of Truth has to write data back to the legacy Source Of Truth because consumers are still getting data from the legacy source.

The Anti-Pattern allows you to pretend that the new system is indeed The Source Of Truth, and the legacy system has become an adapter.  This is a lie that lets teams declare success and get the new system into production.

The reality is that the new system is really just another bolt on to the old.  The new system now needs to transform all the inputs into the old format, creating tons of technical debt.  You have the data model you want, the data model you don’t want, and all the business logic in between.

Another way to explain it, is that the new system tried to do a strangler-fig backwards.  Instead of inserting itself between the old system’s outputs first, the new system intercepted the inputs.  For the strangler-fig pattern to work it needs to replace the outputs first, or both the inputs and outputs simultaneously.  Redirecting the inputs first leads to The Writeback.

The Strangler-Fig Pattern Has An Implementation Order; Outputs First

The Strangler-Fig is a critical refactoring tool.  The implementation sounds easy: wrap the existing code with the strangler, and replace the references over time.  This glosses over an important implementation detail - there’s an order to wrapping code: The outputs have to go before the inputs.

Strangler-Fig Examples Are Single Step

Oftentimes you can fully wrap the strangler around the existing code in a single step.  For example, Amazon’s Elastic Load Balancer can be used as a Strangler.  It instantly proxies all requests to a service; and makes it easy to migrate routes to a new service over time.

Systems with Multiple Outputs Need Multiple Stranglers

That works great for simple architectures with a single entrance and exit.  But what about something more complicated?  What if your system has a RESTful interface for data inputs and then publishes events onto Kafka?

Because the system has one input and two outputs, a single strangler like ELB is insufficient.

This system requires 2 different strangler-fig implementations.  An ELB, or web proxy, to handle RESTful communication, and kafka to proxy the communications from the service to kafka.

Even in an ideal world, where you can put the proxies into place with just simple config changes, this is a two step process.  More realistically, there will be weeks or months between setting up the first strangler, and the second.

The Partial Strangle Limits Options

Remember, the goal of the Strangler-Fig pattern is to squeeze out the original system over time.

A partial strangle limits your options.  

Doing ELB first lets you redirect RESTful inputs and outputs, but won’t populate the kafka stream.  This will break your system if you attempt to migrate any endpoints.  The Writeback anti-pattern, where the new system will write to the original system for the benefit of downstream listeners, is a common solution.

Putting the first strangler around the Kafka connection won’t break anything, and creates limited opportunity for migration.  Any messages that can be generated outside of the REST inputs can be migrated in this state.

For Complex Systems, Strangle The Outputs First

The Strangler-Fig pattern should be in every developer’s toolbox; there is nothing better for reducing risk during refactors.  If you have a complicated system, with more than one type of output, you will need multiple stranglers, and the order of application is critical.  Apply the stranglers to pure outputs first, input/output groups second, and pure inputs last.

Applying the stranglers in the wrong order will lead you to implement anti-patterns like The Writeback.  Far from making the transition easier and lower risk, the anti-patterns will make your work harder and riskier.

Real Life Testing In Production – Hotel Opening Disaster

This past week I stayed at a brand new resort hotel during opening week.  Opening week was a major disaster for the hotel, culminating with them giving away free alcoholic drinks to all guests.  For a week.

This post isn’t about all of the problems, it is about testing in production, and how much worse things could have been.  You see, the resort will eventually be 9 buildings, and they opened with 3.  As a firm believer in iteration I commend them because their opening would have been so much worse if they waited.

Eventually, the resort will be nine 4-story buildings in a U shape around a water park.  3 buildings were complete and receiving guests, the exteriors of 3 were complete, and 3 were being constructed.  Too late to learn any architectural lessons from the early guests, but not too early to learn a lot about the hotel’s operations.

First up - you need maintenance staff on site, because our first room had no hot water.  Maybe we were the first guests, maybe the system was installed wrong and broke.  Whatever the reason, there was no hot water, and no one who could diagnose and fix the issue.

You could have someone walk through with a checklist to test for things like hot water.  Or you could let your customers do it!  You could have maintenance staff on site, or you could literally file a ticket and wait for someone to show up and fix the problem.

The hotel wasn’t full, so we were moved to another room.  Another new room!

This one had no cold water in the bathroom sink.  Fortunately I could debug this issue!  The shutoff valve was closed; I opened it.  Yes, I flipped a switch to turn on a feature.

Another opening week oops - the hotel had purchased cone coffee makers and basket coffee filters.

But the ultimate config issue wasn't no hot water, or no cold water, or the wrong shaped coffee filters.  The ultimate config issue was no liquor license.  At a resort hotel.  For over a week.  They had a fully stocked bar, 2 bar tenders, and 4 wait staff.  But, no liquor license, means no alcohol sales.  They resolved the “alcohol permission issue” by giving it away.

All of these issues would have been so much worse with 3 times more guests.  They could have caught the quality issues with better testing.  They could have avoided giving away thousands in alcohol by waiting until they were ready.  There were many, many ways the hotel could have opened better.

But at least they iterated!  By opening with ⅓ of their rooms available, they were able to limit the fallout from testing in production.

SaaS Code Does Not Have a Final State

I was making a diplomatic comment about some software I was refactoring, “Regardless of the original programmer’s vision, the original code doesn’t work with the final state.”  I realized that my version of the code won’t be the final state either.  When you work in SaaS, code doesn’t have a final state.

SaaS code has history; it was written to solve a problem.  It may have evolved to solve the problem better, it may have evolved because the problem changed.


SaaS code has a present; it is what it is.  It solves a problem for some customers.  The code might be amazing, but frustrates customers because the problem has changed.  The code could be terrible but delights customers because it perfectly solves a static problem.

The code may have a future.  The problem can change, the implementation can change.

What SaaS code doesn’t have is a final state.  Until you delete it, you can never look at a piece of code and say that won’t be changed again.

Don’t make the mistake of thinking that once you refactor some code, the code will be in its final state.  The Service in SaaS will change over time, your code will change with it.

Sherman’s Law of Doomed Software Projects

Sherman’s Law of Doomed Software Projects - Software projects with the word “Next” or “New” in the name are doomed.

Sherman’s Law only applies to the internal name used by the people working on the software itself.  Names used in marketing or external communication don’t count.

Projects with the word “Next” in the title are doomed because no one wants the next version of your software.  Paying customers aren’t paying today because of the promise of a Next generation of the software.  Internal customers want software that helps them do their job today.  They are using the software today, because it solves a problem today.

Could there be future customers out there who have the problem your current software solves, but who will only buy if you release the next generation?  Sure, but if they’re willing to wait the problem isn’t that pressing and they probably won’t buy the next generation either.

Projects with the word “New” in the title are doomed because new is temporary and muddles project goals.  The new version starts off with clear goals and business value, but time is the enemy.  Whatever gets released is new, regardless of the goals and value in the final product.

If “next” and “new” are names for doomed projects, what names can you use?

You’ve got an existing product or service, and you need to do major work on it.  At the end, you’ll have the “next” generation, or “new” version of the product or service.  Speak to the reason you need to do the work.

If you discover that your software is fundamentally insecure, you don’t need a “Next Generation” project, you need “Maximum Security”.  If your system is slow, don’t start a “New” version that is faster, you need project “Lightspeed”.

I once wrote a piece of software named Polaris.  When the time came for major work, which name would focus the team and drive alignment better - Next Generation Polaris, or Maximum Security Polaris?  New Polaris, or Lightspeed Polaris?

Don’t doom your project, don’t use “Next” or “New”!

Rewrites Have Two Teams – Team Rewrite and Team Maintenance.  Join Team Maintenance

When management agrees to rewrite software, they inevitably split the existing team in two - Team Rewrite and Team Maintenance.  Team Rewrite is in charge of creating a brand new system that recreates everything useful and good about the legacy system.  Team Maintenance is in charge of maintaining the legacy system until Team Rewrite completes the rewrite.  Everyone wants to be on Team Rewrite and no one wants to stay on Team Maintenance; everyone is wrong.

The benefits of Team Rewrite are obvious, you get to write new code without all the horrors of the legacy system.  You’ll use the latest technology!  You’ll do things the right way!  Your work won’t be in production, so you won’t have production incidents!  No angry customers!  The list goes on.

The benefits of Team Maintenance aren’t clear.  You get to work in the horrible legacy system.  The system that is so bad, so unfixable, that management has agreed to a rewrite.  Plus you’ll be responsible for incidents and outages!  Why would anyone join Team Maintenance?

Because Team Maintenance is both a flight to safety and an opportunity. 

It’s not glamorous, but the people maintaining the legacy system are critical.  The members of Team Rewrite contribute to theoretical future value, Team Maintance’s work supports customers today.  If things go badly with the rewrite, and usually do, the entire rewrite team can be fired.  The maintenance team remains critical so long as the legacy system lives.

Team Maintenance is also two kinds of opportunity.  

First, is the opportunity to clean up the legacy system.  The legacy system doesn’t have to continue to be terrible.  It’s likely that most or all of the people who said it couldn’t be fixed are now on Team Rewrite.  Everyone left is committed to working on it for the duration.  It’s a great time to clean things up; not to save the system, but for your own sakes.  Very quickly the legacy system won’t be horrible.  It may never be great, but you can work on it sobbing.

The second opportunity is new features.

Customer needs don't stop just because there’s a rewrite going on.  Management will limit the new features so that the rewrite isn’t trying to hit a moving target.  You’ll only be working on the most important, most critical, and most impactful features.  And you’ll get to work on the critical features, because you’re Team Maintenance.

Does this mean you join Team Maintenance rooting for Team Rewrite to fail?  Not at all!  If Team Rewrite succeeds that’s also great for you.  You’ve shown that you’re a selfless team player - you took on the work that no one else wanted!  Seize the opportunities that come along and you’ll show that you can be trusted to improve your team and deliver critical features.

When a rewrite comes along, your best move is to join the maintenance team.  Volunteering for Team Maintenance is safer, comes with more opportunity, and brings you to management’s attention.  Rewrites usually fail and Team Rewrite leaves the company.  Join Team Maintenance for career growth!

Site Footer