Three Ways To Refactor a Legacy System – A Cheesy Analogy

Software is immortal, but systems age. They reach maximum capacity and can't scale to support additional clients. They get twisted into knots as your business evolves in ways the system wasn't designed to support.

Without constant vigilance you end up with a system that your developers hate to work on and your clients find frustrating. You realize the current system is holding your business back and ask for options.

The most common answer, unfortunately, is the "6 month rewrite", also known as "a big bang." Just give your developers 6 months and they will produce a new system that does all of the good things from the old system, and none of the bad.

The "6 month rewrite" almost never works and often leaves your company in a worse situation because of all the wasted time and resources. I'm going to try and explain why with a very cheesy analogy, and suggest 2 much more effective strategies.

A very cheesy analogy

Imagine, that this piece of string cheese is your system:

"A 6 month rewrite" or "big bang" is the idea that your developers are going to shove the whole thing in their mouths and chew the whole log.

You won't really see any progress during the system mastication, but you'll be able to see the developer's jaws chewing furiously.

6 months is a long time to have developers working on one and only one thing. Especially when the chewing takes longer than expected and you reach the 9, 12, 18 month point. If you stop you'll be left with this:

Your original system. Worse for the wear and tear, but fundamentally, the original system that is restricting your business.

It's the worst of all worlds, you get no value unless the whole cheese is chewed, and you loose all the potential value if you stop!

Cut it up INTO small pieces

A great strategy when your system is failing due to scaling issues is to cut it up and refactor small pieces. Scaling issues include your system not being fast enough, unable to handle enough clients, or unable to handle large clients.

You can analyze which of these pieces are responsible for the bottlenecks in your system and tackle just those pieces:

And if you have to stop work on a single piece?

Your potential loses are much smaller.

Steel threads

When your system has been tied in knots due to changing requirements, replacing individual pieces won't help. Instead, try peeling off small end-to-end slices, creating stand alone pieces that work the way your business works now:

This is the "steel thread" or "tracer bullet" model for refactoring a system.

It allows you to try small, quick, ways to build a new system. Each thread adds immediate value as it is completed. You don't run the risk of having a large body of work that isn't helping your clients.

Like the "small pieces" strategy, you can stop and start without much loss.

Conclusion

6 month rewrites are risky and likely to fail and leave you with nothing of value from your investment of time and resources. Small piece and steel thread strategies offer ways to quickly get incremental value into your client's hands, and greatly reduce the risk of wasted work. They're your best alternative to a total rewrite!

Your Database is not a queue – A Live Example

A while ago I wrote an article, Your Database is not a Queue, where I talked about this common SaaS scaling anti-pattern. At the time I said:

Using a database as a queue is a natural and organic part of any growing system.  It’s an expedient use of the tools you have on hand. It’s also a subtle mistake that will consume hundreds of thousands of dollars in developer time and countless headaches for the rest of your business.  Let’s walk down the easy path into this mess, and how to carve a way out.

Today I have a live example of a SaaS company, layerci.com, proudly embracing the anti-pattern. In this article I will compare my descriptions with theirs, and point out expensive and time consuming problems they will face down the road.

None of this is to hate on layerci.com. An expedient solution that gets your product to market is worth infinitely more than a philosophically correct solution that delays giving value to your clients. My goal is to understand how SaaS companies get themselves into this situation, and offer paths our of the hole.

What's the same

In my article I described a system evolving out of reporting, layerci's problem:

We hit it quickly at LayerCI - we needed to keep the viewers of a test run's page and the github API notified about a run as it progressed.

I described an accidental queue, while layerci is building one explicitly:

CREATE TYPE ci_job_status AS ENUM ('new', 'initializing', 'initialized', 'running', 'success', 'error');

CREATE TABLE ci_jobs(
	id SERIAL, 
	repository varchar(256), 
	status ci_job_status, 
	status_change_time timestamp
);

/*on API call*/
INSERT INTO ci_job_status(repository, status, status_change_time) VALUES ('https://github.com/colinchartier/layerci-color-test', 'new', NOW());

I suggested that after you have an explicit, atomic, queue your next scaling problem is with failures. Layerci punts on this point:

As a database, Postgres has very good persistence guarantees - It's easy to query "dead" jobs with, e.g., SELECT * FROM ci_jobs WHERE status='initializing' AND NOW() - status_change_time > '1 hour'::interval to handle workers crashing or hanging.

What's different

There are a couple of differences between the two scenarios. They aren't material towards my point so I'll give them a quick summary:

  • My system imagines multiple job types, layerci is sticking to a single process type
  • layerci is doing some slick leveraging of PostgreSQL to alleviate the need for a Process Manager. This greatly reduces the amount of work needed to make the system work.

What's the problem?

The main problem with layerci's solution is the amount of developer time spent designing the solution. As a small startup, the time and effort invested in their home grown solution would almost certainly have been better spent developing new features or talking with clients.

It's the failures

From a technical perspective, the biggest problem is lack of failure handling. layerci punts on retries:

As a database, Postgres has very good persistence guarantees - It's easy to query "dead" jobs with, e.g., SELECT * FROM ci_jobs WHERE status='initializing' AND NOW() - status_change_time > '1 hour'::interval to handle workers crashing or hanging.

Handling failures is a lot of work, and something you get for free as part of a queue.

Without retries and poison queue handling, these failures will immediately impact layerci's clients and require manual human intervention. You can add failure support, but that's throwing good developer time after bad. Queues give you great support out of the box.

Monitoring should not be an afterthought

In addition to not handling failure, layerci's solution doesn't handle monitoring either:

Since jobs are defined in SQL, it's easy to generate graphql and protobuf representations of them (i.e., to provide APIs that checks the run status.)

This means that initially you'll be running blind on a solution with no retries. This is the "Our customers will tell us when there's a problem" school of monitoring. That's betting your client relationships on perfect software with no hiccups. I don't like those odds.

SCALING Databases is expensive

The design uses a single, ever growing jobs table ci_jobs, which will store a row for every job forever. The article points out postgreSQL's amazing ability to scale, which could keep you ahead of the curve forever. Database scaling is the most expensive piece in any cloud application stack.

Why pay to scale databases to support quick inserts, updates and triggers on a million row table? The database is your permanent record, a queue is ephemeral.

Conclusion

No judgement if you build a queue into your database to get your product to market. layerci has a clever solution, but it is incomplete, and by the time you get it to work at scale in production you will have squandered tons of developer resources to get a system that is more expensive to run than out of the box solutions.

Do you have a queue in your database? Read my original article for suggestions on how to get out of the hole without doing a total rewrite.

Is your situation unique? I'd love to hear more about it!

What does Go Asynchronous mean?

In an earlier post I suggested Asynchronous Processing as a way to buy time to handle scaling bugs.  Remembering my friend and his comment “assume I have a hammer, a screwdriver, and a database”, today’s post will explain Synchronous versus Asynchronous processing and discuss how asynchronous processing will help your software scale.

Processing: Synchronous versus Asynchronous

Synchronous Explained

Synchronous processing means that each step starts, does some action, and then starts the next step.  Eventually the last action completes and returns, and so on back.

A basic synchronous web request looks like this:

A user clicks save and the browser tells the server to save the data.  The server tells the database. The database returns OK, then the server returns OK, and the browser shows a Save Successful message.

Simple to understand, but when you are having scaling problems, sometimes that save time can go from 100ms to 10s.  It’s a horrible user experience and unnecessary wait!

Asynchronous Explained

Asynchronous Processing gives a superior user experience by returning to the browser immediately. The actual save will be processed later. This makes things more complex because the request has been decoupled from the processing.

The user is now insulated from scaling issues.  It doesn’t matter if the save takes 100ms or 10s, the user gets a consistent experience.

In an asynchronous model, the user doesn’t get notified that the save was successful.  For most cases this is fine, the user shouldn’t be worried about whether their actions are succeeding, the client should be able to assume success.

The client being able to assume success does not mean your system can assume success!  Your system still needs to handle failures, exceptions and retries! You just don’t need to drag the user into it. Since you no longer have a direct path from request through processing, asynchronous operations can be harder to reason about and debug.

For instances where “blind” asynchronous isn’t acceptable you need a polling mechanism so that the user can check on the status.

How Asynchronous Processing Helps Systems to Scale

With synchronous processing your system must process all of the incoming activity and events as they occur, or your clients will experience random, intermittent, failures.

Synchronous scaling results in numerous business problems:

  • It runs up infrastructure costs. The only way to protect service level agreements is by greatly over provisioning your system so that there is significant excess capacity.
  • It creates repetitional problems. Clients can easily impact each other with cyclical behavior.  Morning email blasts, hourly advertising spending rates, and Black Friday are some examples.
  • You never know how much improvement you’ll get out of the next fix.  As your system scales you will always be rate-limited by a single bottleneck.  If your system is limited to 100 events/s because your database can only handle 100 events/s, doubling the hardware might get you to 200 events/s, or you might discover that your servers can only handle 120 events/s. 
  • You don’t have control over your system’s load.  The processing rate is set by your clients instead of your architecture. There is no way to relieve pressure on your system without a failure.

Asynchronous processing gives you options:

  • You can protect your service level agreements by pushing incoming events onto queues and acknowledging the event instantly.  Whether it takes 100ms, 1s, or 10 minutes to complete processing, your system is living up to its service level agreements.
  • After quickly acknowledging the event, you can control the rate at which the queued events are processed at a client level.  This makes it difficult for your large clients to starve out the smalls ones.
  • Asynchronous architecture forces you to loosely couple your system’s components. Each piece becomes easy to load test in isolation, giving you'll have a pretty good idea about how much a fix will actually help. It also makes small iterations much more effective.  Instead of spending 2x to double your databases when your servers can only support another 20%, you can increase spending 20% to match your server’s max capacity. Loosely coupled components can also be worked on by different teams at the same time, making it much easier to scale your system.
  • You regain control over system load.  Instead of everything, all at once, you can set expectations.  If clients want faster processing guarantees, you can now not only provide them, but charge accordingly.

Conclusion

Shifting from synchronous to asynchronous processing will require some refactoring of your current system, but it’s one of the most effective ways to overcome scaling problems.  You can be highly tactical with your implementation efforts and apply asynchronous techniques at your current bottlenecks to rapidly give your system breathing room.  

If your developers are ready to give up on your current system, propose one or two spots to make asynchronous. You will get your clients some relief while rebuilding your team's confidence and ability to iterate. It’s your best alternative to a total rewrite!

Four ways Scaling Bugs are Different

Scaling Bugs don’t really exist, you will never find “unable to scale” in your logs.  Scaling bugs are timing, concurrency and reliability bugs that emerge as your system scales.  Today I’m going to show you 4 signs that your system is being plagued by scaling bugs, and 4 things you can do to buy time and minimize your client’s pain.

Scaling bugs boil down to “Something that used to be reliable is no longer reliable and your code doesn’t handle the failure gracefully”.  This means that they are going to appear in the oldest parts of your codebase, be inconsistent, bursty, and hit your most valuable clients the hardest.

Scaling Bugs appear in older, stable, parts of your codebase

The oldest parts of your are typically the most stable, that’s how they managed to get old.  But, the code was also written with lower performance needs and higher reliability expectations.

Reliability bugs can lay dormant for years, emerging where you least expect it.  I once spent an entire week finding a bug deep in code that hadn't changed in 10 years.  As long as there were no problems, everything was fine, but a database connection hiccup in one specific function would cause a cascading failure on a distributed task being processed on over 30 servers.

Database connectivity is ridiculously stable these days, you can have hundreds of servers and go weeks without an issue.  Unless your databases are overloaded, and that’s when the bug struck.

Scaling Bugs Are Inconsistent

Sometimes the system has trouble, sometimes things are fine.  Even more perplexing is that they occur regardless of multi-threading or the statefulness of your code.

This makes scaling bugs difficult to find, since you’ll never be able to reproduce them locally.  They won’t appear for a single test execution, only when you have hundreds or thousands of events happening simultaneously.

Even if your code is single threaded and stateless, your system is multi-process and has state.  A serverless design still has scaling bottlenecks at the persistence layer.

Scaling Bugs Are Bursty

Bursty means that the bugs appear in clusters, usually in ever increasing numbers after ever shorter intervals.  Initially the error crops up once every few weeks and does minimal damage, so it gets documented as low priority and never worked on.  As your platform scales though, the error starts popping up 5 at a time every few days, then dozens of time once a day. Eventually the low priority, low impact bug becomes an extremely expensive support problem.

Scaling Bugs Hit Your Most Valuable Clients Hardest

Which are the clients with the most contacts in a CRM?  Which are the ones with the most emails? The most traffic and activity?

The same ones paying the most for the privilege of pushing your platform to the limit.

The impact of scaling bugs mostly fall on your most valuable clients, which makes their potential impact high in dollar terms.

Four ways to buy time

These tactics aren’t solutions, they are ways to buy time to transform your system to one that operates at scale.   I’ll cover some scaling tactics in a future post!

Throw money at the problem

There’s never a better time to throw money at a problem then the early stages of scaling problems!  More clients + larger clients = more dollars available.

Increase the number of servers, upgrade the databases, and increase your network throughput.  If you have a multi-tenant setup, add shards and decrease the number of customers running on the same hardware.

If throwing money at the problem helps, then you know you have scaling problems.  You can also get a rough estimate of the time-for-money runway. If the improved infrastructure doesn’t help you can downgrade everything and stop spending the extra money.

Keep your Error Rate Low

It’s common for the first time you notice a scaling bug to be when it causes a cascading system failure.  However, it’s rare for that to be the first time the bug manifested itself. Resolving those low priority rare bugs is key to keeping catastrophic scaling bugs at bay.

I once worked on a system that ran at over 1 million events per second (100 billion/day).  We had a saying: The nice thing about this system is that something that’s 1 in a million happens 60 times a minute.  The only known error we let stand: Servers would always fail to process the first event after a restart.

Retries

As load and scale increases, transient errors become more  common. Take a design cue from RESTful systems and add retry logic.  Most modern databases support upsert operations, which go a long way towards making it safe to retry inserts.

Asynchronous Processing

Most actions don’t need to be processed synchronously.  Switching to asynchronous processing makes many scaling bugs disappear for a while because the apparent processing greatly improves.  You still have to do the processing work, and the overall latency of your system may increase. Slowly and reliably processing everything successfully is greatly preferable to praying that everything processes quickly.

Congratulations!  You Have Scaling Problems!

Scaling bugs only hit systems that gets used.  Take solace in the fact that you have something people want to use.

The techniques in this article will help you buy enough time to work up a plan to scale your system.  Analyze your scaling pain points to gain insight into which parts of your system are most useful to your clients and prioritize your refactoring accordingly.

Remember that there are always ways to scale your current system without resorting to a total rewrite!

The Five Ws for Developers

You may remember learning the 5 Ws: Who, What, Where, When and Why, way back in Kindergarten.  Today I want to talk about the 5 Ws for Developers and how they will transform you from a developer that is told what features to build, to one that is asked, how features should be built.

  • Who am I building this feature for?
  • What is this feature is supposed to accomplish?
  • Where will I see impact if the feature is a success?
  • When will I see the impact?
  • Why build this feature now?

These questions overlap and build upon each other to give you the full context around a request. Ask these questions for every new feature you get assigned.  Especially if you think you know the answer!

Who am I building this feature for?

Does it interface with humans or other computers?  
Is it for internal or client use? 
Which clients?

The answers to these questions will tell you a lot about performance requirements, whether things should be synchronous or asynchronous, and even if you need to build a UI.

What is this feature supposed to accomplish?

This is probably the most omitted piece of data in feature requests.

I have seen countless stories that boil down to “Build a CRUD app to collect and maintain this set of data” without any context about what the data is supposed to accomplish.  Which has resulted in countless followup stories to “Add these missing data points to the CRUD app”, because the original ask couldn’t accomplish the goal.

As a developer you are responsible for making sure the code accomplishes the goal.  If you don’t know what you’re trying to achieve, you’re going to end up rewriting the feature.  Repeatedly.

Where will I see the impact of the feature?

There are 2 goals for this question: ensure that everyone has agreed upon what success looks like, and ensure that you write your code to make tracking success easy.

If you get pushback to the tune of “the impact can’t be measured”, that is a sign that the asker is just a note taker and doesn’t understand.  Most features are to get who to do more what.  If you can’t find where that will manifest, then the feature request is broken

When will I see the impact?

Performance and efficiency features should have instantaneous impact.  Improvements in a sales pipeline could take 6 months to manifest. Know when you should start seeing movement, and try to find ways to make it as early as possible.

Why build this feature now?

For developers, there is always more work than time.  You need to know why this feature’s time is now.

Is there an upcoming regulatory change?
Is the current performance causing your clients pain?
Maybe the system is overloaded and you need it to scale?
Will the new feature bring in a new client or keep one from leaving?

How these questions tie together

When you know the answers to the 5 Ws, a magical thing happens.

You will see how to deliver a feature that achieves What for Who earlier for less effort.  Most of the time you’ll never end up implementing the original feature, and soon people will start asking you how to implement the feature instead of telling you!

Tactical API Obsolescence

For a SaaS company, the cost of sudden API changes aren’t measured in your developers’ time, but your clients’ trust and goodwill.  Worse, the burden doesn't fall on small clients at your lowest tier, you are placing the burden squarely on your most valuable enterprise customers, who have invested in custom API integrations.  API changes are unexpected hits to your client’s timelines and budgets. Nothing will change your image from a trusted partner, to a flake that needs to be replaced, faster than a developer spending a day doing a hotfix because their API integration stopped working.

At the same time, you need to keep improving and expanding your service.  How do you keep moving forward? The answer is extremely low tech: set expectations and communicate your API Obsolescence plans.

By law, auto manufacturers have to make parts for 10 years after they stop selling a car.  Microsoft typically guarantees 5 years of support for Windows as part of the purchase price, with an option to pay for 5 additional years.

Unless you publish a sunset date for an API feature, your clients will expect you to keep supporting it for the life of the contract.  Most SaaS companies offer discounts for annual renewals; your API is locked in until a year from tomorrow!

Can you afford to support your current API for another year?  Can you afford to change it?

If you don’t have an Obsolescence strategy, here are 5 tips for getting started:

  • Notify early!  A year is ideal, but a week is better than nothing.  Your goal is to be a trustworthy and stable partner. Putting notifications on your website will give you clarity and your support reps something to bring to your clients.
  • The less lead time you can give, the more important it is to reach out and talk to your most important clients.  Make sure you have a way to communicate with your client’s developers. Email lists, documentation, even a notice on login.  From your client’s perspective, all outages from missed notifications are your fault so you have to take responsibility for notifications.
  • Know which clients are using which features.  Metrics around which clients use which pieces of your API are critical for making prioritization and communication decisions.  These metrics don’t need to be real time, you can generate them nightly or weekly from your server log files. But if you don’t know who uses what, you have to assume everyone uses everything.
  • API versioning makes everything easier.  If you aren’t using a versioning scheme, that should be your first change.
  • An endpoint is a contract, not code, and you should only communicate contract changes.  You can create a new endpoint that uses old code, or completely rewrite the code behind an existing endpoint.  Endpoints are a decoupling point, and they give you a bit of freedom and wiggle room.

Taking control of your APIs lifecycle is key to managing your tech debt. Tactical obsolescence may be your Best Alternative to a Total Rewrite. 

Creating Alternatives To Rewrites By Using Topics

I’ve been explaining what queues are, and why using a database as a queue is a bad idea, today I’m going to expand your toolkit, explain how Topics create alternatives to rewrites, and give a concrete example.

Topics allow multiple queues to register for incoming messages.  That means instead of publishing a message onto a queue, you publish onto zero or more queues at once, and there is no impact on the publisher.  One consumer, no consumer, 100 consumers, you publish one message onto a topic.

All of these situations require the same effort and resources from your publisher.

For a SaaS company with services running off queues, Topics give your developers the ability to create new services that run side-by-side with your existing infrastructure.  New functionality off of your existing infrastructure, without doing a rewrite! How does that work?

Adding a new consumer means adding another Queue to the Topic. 

No code changes for any existing services.  This is extremely valuable when the existing services are poorly documented and difficult to test.

You can test new versions of your code through end-to-end tests.  

Since you can now create two sets of the data, you can run the new version in parallel with the current version and compare the results.  Rinse, repeat until you feel confident in sending your customers results from the new system.

It's not ideal, but you'll sleep a whole lot easier at night knowing that the original code and original output remains untouched.

New uses for your message flow have no impact on your existing services.  

Consuming data becomes “loosely coupled”.  Freed from potentially impacting the existing, difficult, code, new reports, monitoring and other ideas become feasible and exciting instead of dread inducing.  New uses don’t even have to be in the same programming language!

A concrete example; How Topics can be used to create monitoring on a legacy system:

I worked for a company that was processing jobs off of a queue.  This was an older system that had evolved over a decade and was a mess of spaghetti code.  It mostly worked, but was not designed for any kind of observability. Because jobs like hourly reports would run, rerun, and even retry, knowing whether a specific hourly report completed successfully was a major support headache.

When challenged to improve the situation the lead developer would shrug and say that nothing could be done with the current code.  Instead, he had a plan to do a full rewrite of the scheduler system with logging, tests, and observability baked in. The rewrite would take 6 months.  The flaws, bugs and angry customers weren’t quite enough to justify a developer spending 6 months developing a new system. Especially since the new system wouldn’t add value until it was complete.  The company didn’t have the resources for a rewrite, but it did have me.

The original system was using SQS on AWS as the queue.  We changed the scheduler code to use AWS’s Topic service, SNS, instead.  We had SNS write incoming messages to the original SQS queue, and called it a release.

We now had the option and ability to add new services without any further disruption or risk to the original job processor.

We created a new service with the creative name Task Monitor, created a new SQS queue and added it as a listener to SNS.  Task Monitor maintained a list of active tasks. It would read messages off a queue and create an entry in an in memory list.  Every 5 minutes it would iterate the list and check the status of the task against the database and remove completed tasks.

Surviving tasks were added to “Over 5 min” list, “Over 10 min list”, etc and the data was exposed via a simple web api framework.  Anything over 45 minutes resulted in an alert being generated.

We now had visibility into which tasks were slipping through the cracks, and with the pattern exposed we were quickly able to fix the bugs.  Client complaints ceased (about scheduled reports anyway), which reduced the support load by about 60% of one developer. With almost 3 additional developer days per week we were able to start knocking out some long delayed features and refactoring.

All of these changes were created by a simple change of a call to SQS to a call to SNS.  We didn’t need to dive deep into the legacy system to add monitoring and instrumentation.

The additional cost and load of using Topics is negligible, but they create amazingly powerful opportunities, especially for legacy systems that are difficult to refactor.

When your developers say that there’s no way to improve a queue based system without rewriting it, look into Topics.  They’re your Best Alternative to a Total Rewrite.

Legacy System Seniors – The Lord vs The Reformer

Mission critical legacy systems tend to attract two flavors of Senior Developer: The Lord and The Reformer.  The Lord runs his team to ensure his usefulness to the company and his primacy in the developer hierarchy, while The Reformer focuses on rebuilding the team and business value.

If you are a manager taking on a new legacy team, or developer on a legacy system getting a new senior developer, the good news is that you will quickly be able to tell which group your senior falls into.  The difference is stark and telling within 3-6 months if you’re paying attention for these 5 major signs:

  1. Both The Lord and The Reformer may point to a piece of code and say “I’m the only one who can understand piece of code”.  The Lord will say it as a point of pride and evidence of his superiority. The reformer says it sadly knowing that if only the Senior Developer can understand the code, then code is bad, even if it works correctly.  Over time The Reformer will shrink the inscrutable parts of the codebase through refactoring and tests. The Lord will both proclaim that nothing can be done, and complain loudly and often, about being held back by the terrible state of the code.
  2. When a junior developer joins the team and asks about documentation, both may say “this is a legacy system and there is no documentation”.  The Lord will then suggest the new developer read the code to understand, and may even smirk and suggest that the new developer write some documentation during the journey to enlightenment.  The Reformer will point to the parts of the system that have been documented, discuss the best way to get familiar with the code and offer to write additional documentation as questions arise.  The Reformer expands documentation with every hire, The Lord will complain that adding people is taking away from his development time.
  3. The Lord instinctively knows that critical developers can’t be fired and works to increase the company’s dependency on him.  The Reformer knows that if you can’t be fired, you also can’t be promoted. The Reformer works to ensure he can step out of his current team at a moment’s notice so that he is available to step into a bigger role.  This sign is a little subtle, but consider your bus factor. If the Senior got hit by a bus, how long would it take the rest of the team to handle support? The time for Reformers is measured in days, Lords in weeks.
  4. The Lord believes that the system couldn’t possibly function without him and speaks of his overnight heroics as a point of pride.  The Reformer views anything that requires his presence as a mistake that needs correcting. The Lord will rarely fix the underlying issues that result in heroics, while The Reformer makes them a priority.  As a result The Lord will spend a steady amount of time dealing with production, while The Reformer quickly oversees a very calm environment.
  5. The Lord will take all of the new feature work “to ensure it is architected correctly” and leave all of the maintenance work to the junior developers on the team.  The Reformer will take all the maintenance work and teach the rest of the team how to build new features. The Lord’s team will have an ever increasing amount of maintenance, and production issues, while The Reformer’s team will see bugs and maintenance work drop dramatically.

The good news is that you’ll know very quickly which Senior you have on your team.

The bad news is that The Reformer won’t be around long.  By improving the system and the team The Reformer won’t be needed and you’ll soon lose him to a new disaster.  There is a silver lining, teams run by Reformers become attractive to other developers and you will have volunteers looking to join the new hot team.

The worst news though, is that The Lord isn’t going anywhere.  His antics will actively drive away developers and managing him is your problem.

What is a Queue Anyway?

Several readers wrote in response to my article Your Database is not a Queue to tell me that they don’t know what a queue is, or why they would use one.  As one reader put it, “assume I have 3 tools, a hammer, a screwdriver and a database”. Fair enough, this week I want to talk about what a message queue is, what features it offers, and why it is a superior solution for batch processing.

To keep things as concrete as possible, I’ll use a real world example, Nightly Report Generation, and AWS technologies.

Your customers want to know how things are going.  If your SaaS integrates with a shopping cart, how many sales did you do?  If you do marketing, how many potential customers did you reach? How many leads were generated?  Emails sent? Whatever your service, you should be letting your customers know that you’re killing it for them.

Most SaaS companies have some form of RESTful API for customers, and initially you can ask clients to help themselves and generate reports on demand.  But as you grow, on demand reporting becomes to slow. Code that worked for a client with 200 customer shopping carts a day may be to slow at 2,000 or 20,000 carts.  These are great problems to have!

To meet your client’s needs, you need a system to generate reports overnight.  It’s not client facing, so it doesn’t need to be RESTful, and it’s not driven by client activity, so it won’t run itself.

Enter queues!

For this article we will use AWS’s SQS, which stands for Simple Queue Service.  There are lots of other wonderful open source solutions like Rabbit MQ, but you are probably already using AWS, and SQS is cheap and fully managed.

What does a queue do?

  • A queue gives you a list of messages that can only be accessed in order.
  • A queue guarantees that one, and only one, consumer can see a message at a time.
  • A queue guarantees that messages do not get lost, if the consumer takes a message from the queue, it has a certain amount of time to tell the queue that the message is done, or the queue will assume something has gone wrong and make the message available again.
  • A queue offers a Dead Letter Queue, if something goes wrong to many times the message is removed and placed on the Dead Letter Queue so that the rest of the list can be processed

As a practical matter all that boils down to:

  • You can run multiple instances of the report generator without worrying about missing a client, or sending the same report twice.  You get to skip the early concurrency and scaling problems you’d encounter if you wrote your own code, or tried to use a database.
  • When your code has a bug in an edge case, you’ll still be able to generate all of the reports that don’t hit the edge case.
  • You can alert on failures, see which reports failed, and after you fix the bug, you can click of a button put them back into the queue to run again.
  • Because SQS is managed by AWS you get monitoring and alerts out of the box
  • Because SQS is managed by AWS you don’t have to worry about the queue’s scale or performance.
  • You can add API endpoints to generate “on demand” messages and put them on the queue.  This means that the “standard” report messages, and client generated “on demand” report requests follow the same process and you don’t need to build and maintain separate reporting pipelines for internal and external requests.

Everything is tradeoffs, and queues don’t solve all problems well.  What kinds of problems are you going to have with SQS?

  • Priority/Re-sorting existing messages - You have 10,000 messages on the queue and an important client needs a report NOW!  Prioritized messages, or changing the order of messages on the queue is not something SQS supports. You can have a “fast lane” queue that gets high priority messages, but it’s clunky.
  • Random Access - You have 10,000 messages and want to know where in the queue a specific client’s report is?  Queues let you add at the end and take from the front, that’s it. If you need to know what’s in the middle you’ll need to maintain that information in a separate system.
  • Random Deleting - Probably not important for reports, but for cases like bulk importing, if a client changes her mind and wants to cancel a job, you can’t reach into the middle of the queue and remove the message.
  • Process order is not guaranteed - If you have a 3 part task, and you cannot start Task 2 until Task 1 finishes, you cannot add all 3 tasks onto the queue at once.  It is highly likely that a second worker will come along and start Task 2 while Task 1 is still in process. Instead you will need to have Task 1 add Task 2 onto the queue when it finishes.

None of these problems will crop up in your early iterations, and they are great problems to have!  They are signs that your SaaS is growing to meet your client’s needs, you and your clients are thriving!

To bring it full circle, what if you already have a home grown system for running reports?  How can you get on the path to queues? See my article on about common ways into the DB as a Queue mess, and some suggestions on how to get out.

Your Database is not a Queue

SaaS Scaling anti-patterns: The database as a queue

Using a database as a queue is a natural and organic part of any growing system.  It’s an expedient use of the tools you have on hand. It’s also a subtle mistake that will consume hundreds of thousands of dollars in developer time and countless headaches for the rest of your business.  Let’s walk down the easy path into this mess, and how to carve a way out.

No matter what your business does on the backend, your client facing platform will be some kind of web front end, which means you have web servers and a database.  As your platform grows, you will have work that needs to be done, but doesn’t make sense in an api / ui format. Daily sales reports and end of day reconciliation, are common examples.

Simple Straight Through Processing

The initial developer probably didn’t realize he was building a queue.  The initial version would have been a single table called process which tracked client id, date and completed status.  Your report generator would load a list of active client ids, iterate through them, and write done to the database.

Still pretty simple

Simple, stateful and it works.

For a while.

But, some of your clients are bigger, and there are a lot of them, and the process was taking longer and longer, until it wasn’t finishing overnight.  So to gain concurrency and added worker processes your developers added “Not started” and “in process” states. Thanks to database concurrency guarantees and atomic updates, it only took a few releases to get everything working smoothly with the end-to-end processing time dropping back to something manageable.

Databases are atomic, what could go wrong?

Now the database is a queue and preventing duplicate work.

There’s a list of work, a bunch of workers, and with only a few more days of developer time you can even monitor progress as they chew through the queue.

Except your devs haven’t implemented retry logic because failures are rare.  If the process dies and doesn’t generate a report, then someone, usually support fielding an angry customer call, will notice and ask your developers to stop what they’re doing and restart the process.  No problem, adding code to move “in-process” back to “not started” after some amount of time is only a sprint worth of work.

Except, sometimes, for some reason, some tasks always fail.  So your developers add a counter for retries, and after 5 or so, they set the state to “skip” so that the bad jobs don’t keep sucking up system resources.

Just a few more sprints and we'll finally have time to add all kinds of new processes!

Congratulations!  For about $100,000 in precious developer time, your SaaS product has a buggy, inefficient, poor scaling implementation of database-as-a-queue.  Probably best not to even try to quantify the opportunity costs.

Solutions like SQS and RabbitMQ are available, effectively free, and take an afternoon to set up.

Instead of worrying about how you got here, a better question is how do you stop throwing good developer resources away and migrate?

Every instance is different, but I find it is easiest to work backwards.

You already have worker code to generate reports.  Have your developers extend the code to accept a job from a queue like SQS in addition to the DB.  In the first iteration, the developers can manually add failed jobs to the queue. Likely you already have a manual retries process; migrate that to use the queue.

Queue is integrated within the existing system

Once you have the code working smoothly with a queue, you can start having the job generator write to the queue instead of the database.  Something magically usually happens at this point. You’ll be amazed at how many new types of jobs your developers will want to implement once new functionality no longer requires a database migration.

Soon, you’ll be able to run your system off the db or a queue, but the db tables will be empty.

Only then do you refactor the db queues out of your codebase.

Queue runs the system

Adding a proper queue system gets your team out of the hole and scratches your developers itch for shiny and new technology.  You get improved functionality after the very first sprint, and aren’t rewriting your code from scratch.

That’s your best alternative to a total rewrite, start today!

Site Footer