Four ways Scaling Bugs are Different

Scaling Bugs don’t really exist, you will never find “unable to scale” in your logs.  Scaling bugs are timing, concurrency and reliability bugs that emerge as your system scales.  Today I’m going to show you 4 signs that your system is being plagued by scaling bugs, and 4 things you can do to buy time and minimize your client’s pain.

Scaling bugs boil down to “Something that used to be reliable is no longer reliable and your code doesn’t handle the failure gracefully”.  This means that they are going to appear in the oldest parts of your codebase, be inconsistent, bursty, and hit your most valuable clients the hardest.

Scaling Bugs appear in older, stable, parts of your codebase

The oldest parts of your are typically the most stable, that’s how they managed to get old.  But, the code was also written with lower performance needs and higher reliability expectations.

Reliability bugs can lay dormant for years, emerging where you least expect it.  I once spent an entire week finding a bug deep in code that hadn’t changed in 10 years.  As long as there were no problems, everything was fine, but a database connection hiccup in one specific function would cause a cascading failure on a distributed task being processed on over 30 servers.

Database connectivity is ridiculously stable these days, you can have hundreds of servers and go weeks without an issue.  Unless your databases are overloaded, and that’s when the bug struck.

Scaling Bugs Are Inconsistent

Sometimes the system has trouble, sometimes things are fine.  Even more perplexing is that they occur regardless of multi-threading or the statefulness of your code.

This makes scaling bugs difficult to find, since you’ll never be able to reproduce them locally.  They won’t appear for a single test execution, only when you have hundreds or thousands of events happening simultaneously.

Even if your code is single threaded and stateless, your system is multi-process and has state.  A serverless design still has scaling bottlenecks at the persistence layer.

Scaling Bugs Are Bursty

Bursty means that the bugs appear in clusters, usually in ever increasing numbers after ever shorter intervals.  Initially the error crops up once every few weeks and does minimal damage, so it gets documented as low priority and never worked on.  As your platform scales though, the error starts popping up 5 at a time every few days, then dozens of time once a day. Eventually the low priority, low impact bug becomes an extremely expensive support problem.

Scaling Bugs Hit Your Most Valuable Clients Hardest

Which are the clients with the most contacts in a CRM?  Which are the ones with the most emails? The most traffic and activity?

The same ones paying the most for the privilege of pushing your platform to the limit.

The impact of scaling bugs mostly fall on your most valuable clients, which makes their potential impact high in dollar terms.

Four ways to buy time

These tactics aren’t solutions, they are ways to buy time to transform your system to one that operates at scale.   I’ll cover some scaling tactics in a future post!

Throw money at the problem

There’s never a better time to throw money at a problem then the early stages of scaling problems!  More clients + larger clients = more dollars available.

Increase the number of servers, upgrade the databases, and increase your network throughput.  If you have a multi-tenant setup, add shards and decrease the number of customers running on the same hardware.

If throwing money at the problem helps, then you know you have scaling problems.  You can also get a rough estimate of the time-for-money runway. If the improved infrastructure doesn’t help you can downgrade everything and stop spending the extra money.

Keep your Error Rate Low

It’s common for the first time you notice a scaling bug to be when it causes a cascading system failure.  However, it’s rare for that to be the first time the bug manifested itself. Resolving those low priority rare bugs is key to keeping catastrophic scaling bugs at bay.

I once worked on a system that ran at over 1 million events per second (100 billion/day).  We had a saying: The nice thing about this system is that something that’s 1 in a million happens 60 times a minute.  The only known error we let stand: Servers would always fail to process the first event after a restart.


As load and scale increases, transient errors become more  common. Take a design cue from RESTful systems and add retry logic.  Most modern databases support upsert operations, which go a long way towards making it safe to retry inserts.

Asynchronous Processing

Most actions don’t need to be processed synchronously.  Switching to asynchronous processing makes many scaling bugs disappear for a while because the apparent processing greatly improves.  You still have to do the processing work, and the overall latency of your system may increase. Slowly and reliably processing everything successfully is greatly preferable to praying that everything processes quickly.

Congratulations!  You Have Scaling Problems!

Scaling bugs only hit systems that gets used.  Take solace in the fact that you have something people want to use.

The techniques in this article will help you buy enough time to work up a plan to scale your system.  Analyze your scaling pain points to gain insight into which parts of your system are most useful to your clients and prioritize your refactoring accordingly.

Remember that there are always ways to scale your current system without resorting to a total rewrite!

Your Database is not a Queue

SaaS Scaling anti-patterns: The database as a queue

Using a database as a queue is a natural and organic part of any growing system.  It’s an expedient use of the tools you have on hand. It’s also a subtle mistake that will consume hundreds of thousands of dollars in developer time and countless headaches for the rest of your business.  Let’s walk down the easy path into this mess, and how to carve a way out.

No matter what your business does on the backend, your client facing platform will be some kind of web front end, which means you have web servers and a database.  As your platform grows, you will have work that needs to be done, but doesn’t make sense in an api / ui format. Daily sales reports and end of day reconciliation, are common examples.

Simple Straight Through Processing

The initial developer probably didn’t realize he was building a queue.  The initial version would have been a single table called process which tracked client id, date and completed status.  Your report generator would load a list of active client ids, iterate through them, and write done to the database.

Still pretty simple

Simple, stateful and it works.

For a while.

But, some of your clients are bigger, and there are a lot of them, and the process was taking longer and longer, until it wasn’t finishing overnight.  So to gain concurrency and added worker processes your developers added “Not started” and “in process” states. Thanks to database concurrency guarantees and atomic updates, it only took a few releases to get everything working smoothly with the end-to-end processing time dropping back to something manageable.

Databases are atomic, what could go wrong?

Now the database is a queue and preventing duplicate work.

There’s a list of work, a bunch of workers, and with only a few more days of developer time you can even monitor progress as they chew through the queue.

Except your devs haven’t implemented retry logic because failures are rare.  If the process dies and doesn’t generate a report, then someone, usually support fielding an angry customer call, will notice and ask your developers to stop what they’re doing and restart the process.  No problem, adding code to move “in-process” back to “not started” after some amount of time is only a sprint worth of work.

Except, sometimes, for some reason, some tasks always fail.  So your developers add a counter for retries, and after 5 or so, they set the state to “skip” so that the bad jobs don’t keep sucking up system resources.

Just a few more sprints and we’ll finally have time to add all kinds of new processes!

Congratulations!  For about $100,000 in precious developer time, your SaaS product has a buggy, inefficient, poor scaling implementation of database-as-a-queue.  Probably best not to even try to quantify the opportunity costs.

Solutions like SQS and RabbitMQ are available, effectively free, and take an afternoon to set up.

Instead of worrying about how you got here, a better question is how do you stop throwing good developer resources away and migrate?

Every instance is different, but I find it is easiest to work backwards.

You already have worker code to generate reports.  Have your developers extend the code to accept a job from a queue like SQS in addition to the DB.  In the first iteration, the developers can manually add failed jobs to the queue. Likely you already have a manual retries process; migrate that to use the queue.

Queue is integrated within the existing system

Once you have the code working smoothly with a queue, you can start having the job generator write to the queue instead of the database.  Something magically usually happens at this point. You’ll be amazed at how many new types of jobs your developers will want to implement once new functionality no longer requires a database migration.

Soon, you’ll be able to run your system off the db or a queue, but the db tables will be empty.

Only then do you refactor the db queues out of your codebase.

Queue runs the system

Adding a proper queue system gets your team out of the hole and scratches your developers itch for shiny and new technology.  You get improved functionality after the very first sprint, and aren’t rewriting your code from scratch.

That’s your best alternative to a total rewrite, start today!