Three Ways To Refactor a Legacy System – A Cheesy Analogy

Software is immortal, but systems age. They reach maximum capacity and can't scale to support additional clients. They get twisted into knots as your business evolves in ways the system wasn't designed to support.

Without constant vigilance you end up with a system that your developers hate to work on and your clients find frustrating. You realize the current system is holding your business back and ask for options.

The most common answer, unfortunately, is the "6 month rewrite", also known as "a big bang." Just give your developers 6 months and they will produce a new system that does all of the good things from the old system, and none of the bad.

The "6 month rewrite" almost never works and often leaves your company in a worse situation because of all the wasted time and resources. I'm going to try and explain why with a very cheesy analogy, and suggest 2 much more effective strategies.

A very cheesy analogy

Imagine, that this piece of string cheese is your system:

"A 6 month rewrite" or "big bang" is the idea that your developers are going to shove the whole thing in their mouths and chew the whole log.

You won't really see any progress during the system mastication, but you'll be able to see the developer's jaws chewing furiously.

6 months is a long time to have developers working on one and only one thing. Especially when the chewing takes longer than expected and you reach the 9, 12, 18 month point. If you stop you'll be left with this:

Your original system. Worse for the wear and tear, but fundamentally, the original system that is restricting your business.

It's the worst of all worlds, you get no value unless the whole cheese is chewed, and you loose all the potential value if you stop!

Cut it up INTO small pieces

A great strategy when your system is failing due to scaling issues is to cut it up and refactor small pieces. Scaling issues include your system not being fast enough, unable to handle enough clients, or unable to handle large clients.

You can analyze which of these pieces are responsible for the bottlenecks in your system and tackle just those pieces:

And if you have to stop work on a single piece?

Your potential loses are much smaller.

Steel threads

When your system has been tied in knots due to changing requirements, replacing individual pieces won't help. Instead, try peeling off small end-to-end slices, creating stand alone pieces that work the way your business works now:

This is the "steel thread" or "tracer bullet" model for refactoring a system.

It allows you to try small, quick, ways to build a new system. Each thread adds immediate value as it is completed. You don't run the risk of having a large body of work that isn't helping your clients.

Like the "small pieces" strategy, you can stop and start without much loss.

Conclusion

6 month rewrites are risky and likely to fail and leave you with nothing of value from your investment of time and resources. Small piece and steel thread strategies offer ways to quickly get incremental value into your client's hands, and greatly reduce the risk of wasted work. They're your best alternative to a total rewrite!

Your Database is not a queue – A Live Example

A while ago I wrote an article, Your Database is not a Queue, where I talked about this common SaaS scaling anti-pattern. At the time I said:

Using a database as a queue is a natural and organic part of any growing system.  It’s an expedient use of the tools you have on hand. It’s also a subtle mistake that will consume hundreds of thousands of dollars in developer time and countless headaches for the rest of your business.  Let’s walk down the easy path into this mess, and how to carve a way out.

Today I have a live example of a SaaS company, layerci.com, proudly embracing the anti-pattern. In this article I will compare my descriptions with theirs, and point out expensive and time consuming problems they will face down the road.

None of this is to hate on layerci.com. An expedient solution that gets your product to market is worth infinitely more than a philosophically correct solution that delays giving value to your clients. My goal is to understand how SaaS companies get themselves into this situation, and offer paths our of the hole.

What's the same

In my article I described a system evolving out of reporting, layerci's problem:

We hit it quickly at LayerCI - we needed to keep the viewers of a test run's page and the github API notified about a run as it progressed.

I described an accidental queue, while layerci is building one explicitly:

CREATE TYPE ci_job_status AS ENUM ('new', 'initializing', 'initialized', 'running', 'success', 'error');

CREATE TABLE ci_jobs(
	id SERIAL, 
	repository varchar(256), 
	status ci_job_status, 
	status_change_time timestamp
);

/*on API call*/
INSERT INTO ci_job_status(repository, status, status_change_time) VALUES ('https://github.com/colinchartier/layerci-color-test', 'new', NOW());

I suggested that after you have an explicit, atomic, queue your next scaling problem is with failures. Layerci punts on this point:

As a database, Postgres has very good persistence guarantees - It's easy to query "dead" jobs with, e.g., SELECT * FROM ci_jobs WHERE status='initializing' AND NOW() - status_change_time > '1 hour'::interval to handle workers crashing or hanging.

What's different

There are a couple of differences between the two scenarios. They aren't material towards my point so I'll give them a quick summary:

  • My system imagines multiple job types, layerci is sticking to a single process type
  • layerci is doing some slick leveraging of PostgreSQL to alleviate the need for a Process Manager. This greatly reduces the amount of work needed to make the system work.

What's the problem?

The main problem with layerci's solution is the amount of developer time spent designing the solution. As a small startup, the time and effort invested in their home grown solution would almost certainly have been better spent developing new features or talking with clients.

It's the failures

From a technical perspective, the biggest problem is lack of failure handling. layerci punts on retries:

As a database, Postgres has very good persistence guarantees - It's easy to query "dead" jobs with, e.g., SELECT * FROM ci_jobs WHERE status='initializing' AND NOW() - status_change_time > '1 hour'::interval to handle workers crashing or hanging.

Handling failures is a lot of work, and something you get for free as part of a queue.

Without retries and poison queue handling, these failures will immediately impact layerci's clients and require manual human intervention. You can add failure support, but that's throwing good developer time after bad. Queues give you great support out of the box.

Monitoring should not be an afterthought

In addition to not handling failure, layerci's solution doesn't handle monitoring either:

Since jobs are defined in SQL, it's easy to generate graphql and protobuf representations of them (i.e., to provide APIs that checks the run status.)

This means that initially you'll be running blind on a solution with no retries. This is the "Our customers will tell us when there's a problem" school of monitoring. That's betting your client relationships on perfect software with no hiccups. I don't like those odds.

SCALING Databases is expensive

The design uses a single, ever growing jobs table ci_jobs, which will store a row for every job forever. The article points out postgreSQL's amazing ability to scale, which could keep you ahead of the curve forever. Database scaling is the most expensive piece in any cloud application stack.

Why pay to scale databases to support quick inserts, updates and triggers on a million row table? The database is your permanent record, a queue is ephemeral.

Conclusion

No judgement if you build a queue into your database to get your product to market. layerci has a clever solution, but it is incomplete, and by the time you get it to work at scale in production you will have squandered tons of developer resources to get a system that is more expensive to run than out of the box solutions.

Do you have a queue in your database? Read my original article for suggestions on how to get out of the hole without doing a total rewrite.

Is your situation unique? I'd love to hear more about it!

What does Go Asynchronous mean?

In an earlier post I suggested Asynchronous Processing as a way to buy time to handle scaling bugs.  Remembering my friend and his comment “assume I have a hammer, a screwdriver, and a database”, today’s post will explain Synchronous versus Asynchronous processing and discuss how asynchronous processing will help your software scale.

Processing: Synchronous versus Asynchronous

Synchronous Explained

Synchronous processing means that each step starts, does some action, and then starts the next step.  Eventually the last action completes and returns, and so on back.

A basic synchronous web request looks like this:

A user clicks save and the browser tells the server to save the data.  The server tells the database. The database returns OK, then the server returns OK, and the browser shows a Save Successful message.

Simple to understand, but when you are having scaling problems, sometimes that save time can go from 100ms to 10s.  It’s a horrible user experience and unnecessary wait!

Asynchronous Explained

Asynchronous Processing gives a superior user experience by returning to the browser immediately. The actual save will be processed later. This makes things more complex because the request has been decoupled from the processing.

The user is now insulated from scaling issues.  It doesn’t matter if the save takes 100ms or 10s, the user gets a consistent experience.

In an asynchronous model, the user doesn’t get notified that the save was successful.  For most cases this is fine, the user shouldn’t be worried about whether their actions are succeeding, the client should be able to assume success.

The client being able to assume success does not mean your system can assume success!  Your system still needs to handle failures, exceptions and retries! You just don’t need to drag the user into it. Since you no longer have a direct path from request through processing, asynchronous operations can be harder to reason about and debug.

For instances where “blind” asynchronous isn’t acceptable you need a polling mechanism so that the user can check on the status.

How Asynchronous Processing Helps Systems to Scale

With synchronous processing your system must process all of the incoming activity and events as they occur, or your clients will experience random, intermittent, failures.

Synchronous scaling results in numerous business problems:

  • It runs up infrastructure costs. The only way to protect service level agreements is by greatly over provisioning your system so that there is significant excess capacity.
  • It creates repetitional problems. Clients can easily impact each other with cyclical behavior.  Morning email blasts, hourly advertising spending rates, and Black Friday are some examples.
  • You never know how much improvement you’ll get out of the next fix.  As your system scales you will always be rate-limited by a single bottleneck.  If your system is limited to 100 events/s because your database can only handle 100 events/s, doubling the hardware might get you to 200 events/s, or you might discover that your servers can only handle 120 events/s. 
  • You don’t have control over your system’s load.  The processing rate is set by your clients instead of your architecture. There is no way to relieve pressure on your system without a failure.

Asynchronous processing gives you options:

  • You can protect your service level agreements by pushing incoming events onto queues and acknowledging the event instantly.  Whether it takes 100ms, 1s, or 10 minutes to complete processing, your system is living up to its service level agreements.
  • After quickly acknowledging the event, you can control the rate at which the queued events are processed at a client level.  This makes it difficult for your large clients to starve out the smalls ones.
  • Asynchronous architecture forces you to loosely couple your system’s components. Each piece becomes easy to load test in isolation, giving you'll have a pretty good idea about how much a fix will actually help. It also makes small iterations much more effective.  Instead of spending 2x to double your databases when your servers can only support another 20%, you can increase spending 20% to match your server’s max capacity. Loosely coupled components can also be worked on by different teams at the same time, making it much easier to scale your system.
  • You regain control over system load.  Instead of everything, all at once, you can set expectations.  If clients want faster processing guarantees, you can now not only provide them, but charge accordingly.

Conclusion

Shifting from synchronous to asynchronous processing will require some refactoring of your current system, but it’s one of the most effective ways to overcome scaling problems.  You can be highly tactical with your implementation efforts and apply asynchronous techniques at your current bottlenecks to rapidly give your system breathing room.  

If your developers are ready to give up on your current system, propose one or two spots to make asynchronous. You will get your clients some relief while rebuilding your team's confidence and ability to iterate. It’s your best alternative to a total rewrite!

The Five Ws for Developers

You may remember learning the 5 Ws: Who, What, Where, When and Why, way back in Kindergarten.  Today I want to talk about the 5 Ws for Developers and how they will transform you from a developer that is told what features to build, to one that is asked, how features should be built.

  • Who am I building this feature for?
  • What is this feature is supposed to accomplish?
  • Where will I see impact if the feature is a success?
  • When will I see the impact?
  • Why build this feature now?

These questions overlap and build upon each other to give you the full context around a request. Ask these questions for every new feature you get assigned.  Especially if you think you know the answer!

Who am I building this feature for?

Does it interface with humans or other computers?  
Is it for internal or client use? 
Which clients?

The answers to these questions will tell you a lot about performance requirements, whether things should be synchronous or asynchronous, and even if you need to build a UI.

What is this feature supposed to accomplish?

This is probably the most omitted piece of data in feature requests.

I have seen countless stories that boil down to “Build a CRUD app to collect and maintain this set of data” without any context about what the data is supposed to accomplish.  Which has resulted in countless followup stories to “Add these missing data points to the CRUD app”, because the original ask couldn’t accomplish the goal.

As a developer you are responsible for making sure the code accomplishes the goal.  If you don’t know what you’re trying to achieve, you’re going to end up rewriting the feature.  Repeatedly.

Where will I see the impact of the feature?

There are 2 goals for this question: ensure that everyone has agreed upon what success looks like, and ensure that you write your code to make tracking success easy.

If you get pushback to the tune of “the impact can’t be measured”, that is a sign that the asker is just a note taker and doesn’t understand.  Most features are to get who to do more what.  If you can’t find where that will manifest, then the feature request is broken

When will I see the impact?

Performance and efficiency features should have instantaneous impact.  Improvements in a sales pipeline could take 6 months to manifest. Know when you should start seeing movement, and try to find ways to make it as early as possible.

Why build this feature now?

For developers, there is always more work than time.  You need to know why this feature’s time is now.

Is there an upcoming regulatory change?
Is the current performance causing your clients pain?
Maybe the system is overloaded and you need it to scale?
Will the new feature bring in a new client or keep one from leaving?

How these questions tie together

When you know the answers to the 5 Ws, a magical thing happens.

You will see how to deliver a feature that achieves What for Who earlier for less effort.  Most of the time you’ll never end up implementing the original feature, and soon people will start asking you how to implement the feature instead of telling you!

Replacement is not a release plan

Replacement is not a release plan, it’s a sign that you are solving developer’s pain instead of client pain.

Deployment gets glossed over in the pitch: First we will mimic the existing functionality.  Then turn off the old system.

Since the plan is to re-implement the current functionality, your developers can start immediately!  No need to talk to the clients since they won’t notice any difference until we show them all the wonder improvements!

Developers get super excited about these kinds of rewrites because it is all about them and their pain.  The plan fails because the client cares about client pain, not developer pain.

Don’t assume the client wants what are you giving them!  Don’t assume they would love for you to give them more features, better code, or anything that excites your developers.  A more common situation is that someone has full time job doing manual data extractions, transformations, and other manipulations that software could do in seconds and your developers could write in a week.

Find your client’s pain.  Appeal to your developer’s sense of empathy.  If they hate dealing with the system, have them imagine the low level person being kept in a pointless job.  It’s a good bet that once your developers find out how their software is being used they’ll find that there’s no need for a rewrite; the clients need new tools, not replacements.

Site Footer