For SaaS with a pure Single Tenant model, infrastructure consolidation usually drives the first two, nearly simultaneous, steps towards a Multi-Tenant model. The two steps convert the front end servers to be Multi-Tenant and switch the client databases from physical to logical isolation. These two steps are usually done nearly simultaneously as a SaaS grows beyond a handful of clients, infrastructure costs skyrocket and things become unmanageable.
Considering the 5 factors laid out in the introduction and addendum - complexity, security, scalability, consistent performance, and synergy this move greatly increases scalability, at the cost of increased complexity, decreased security, and opening the door to consistent performance problems. Synergy is not immediately impacted, but these changes make adding Synergy at a later date much easier.
Why is this such an early move when it has 3 negative factors and only 1 positive? Because pure Single Tenant designs have nearly insurmountable scalability problems, and these two changes are the fastest, most obvious and most cost effective solution.
Shifting from Single Tenant servers and databases to Multi-Tenant slightly increases software complexity in exchange for massively decreasing platform complexity.
The web servers need to be able to understand which client a request is for, usually through sub domains like client.mySaaS.com, and use that knowledge to validate the user and connect to the correct database to retrieve data.
The difficult and risky part here is making sure that valid sessions stay associated with the correct account.
Database server consolidation tends to be less tricky. Most database servers support multiple schemas with their own credentials and logical isolation. Logical separation provides unique connection settings for the web servers. Individual client logins are restricted to the client’s schema and the SaaS developers do not need to treat logical and physical separation any differently.
Migrations and Versioning Become Expensive
The biggest database problems with a many-to-many design crop up during migrations. Inevitably, web and database changes will be incomparable between versions. Some SaaS models require all clients on the same version, which limits comparability issues to the release window (which itself can take days), while other models allow clients to be on different versions for years.
The general solution to the problem of long lived versions is to stand up a pool of web and database servers on the new version, migrate clients to the new pool, and update request routing.
The biggest risk around these changes is database secret handling; every server can now connect to every database. Compromising a single server becomes a vector for exposing data from multiple clients. This risk can be limited by proxy layers that keep database connections away from public facing web servers. Still a compromised server is now a risk to multiple clients.
Changing from physical to logical database separation is less risky. Each client will still be logically separated with their own schema, and permissioning should make it impossible to do queries across multiple clients.
Scalability is the goal of Multi-Tenant Infrastructure Consolidation.
In addition to helping the SaaS, the consolidation will also help clients. Shared server pools will increase stability and uptime by providing access to a much larger group of active servers. The client also benefits from having more servers and more slack, making it much easier for the SaaS to absorb bursts in client activity.
Likewise, running multiple clients on larger database clusters generally increases uptime and provides slack for bursts and spikes.
These changes only impact response times when the single tenant setup would have been overwhelmed. The minimum response times don’t change, but the maximum response times get lower and occur less frequently.
The flip side to the tenancy change is the introduction of the Noisy Neighbor problem. This mostly impacts the database layer and occurs when large clients overwhelm the database servers and drown out resources for smaller clients.
This can be especially frustrating to clients because it can happen at any time, last for an unknown period, and there’s no warning or notification. Things “get slow” and there are no guarantees about how often clients are impacted, notice, or complain.
There is no direct Synergy impact from changing the web and database servers.
A SaaS starting from a pure Single Tenant model is not pursuing Synergy, otherwise the initial model would have been Multi-Tenant.
Placing distinct client schemas onto a single server does open the door to future Synergy work. Working with data in SQL across different schemas on the same server is much easier than working across physical servers. The work would still require changing the security model and writing quite a bit of code. There is now a doorway if the SaaS has a reason to walk through.
As discussed in the introduction, a SaaS may begin with a purely Single Tenant model for several reasons. High infrastructure bills and poor resource utilization will quickly drive an Infrastructure Consolidation to Multi-Tenant servers and logically separated databases.
The exceptions to this rule are SaaS that have few very large clients or clients with high security requirements. These SaaS will have to price and market themselves accordingly.
Infrastructure Consolidation is an early driver away from a pure Single Tenant model to Multi-Tenancy. The change is mostly positive for clients, but does add additional security and client satisfaction risks.
In Part 1 of Making Link Tracking Scale I showed how switching event recording from synchronous to asynchronous processing creates a superior, faster and more consistent user experience. In Part 2, I will discuss how Link Tracking scaling issues are governed by Long Tails, and how to overcome the initial burst using edge caching and tools like Memcache and Redis.
The Long Tails of Link Tracking
When your client sends an email campaign, publishes new content your link tracker will experience a giant burst of activity, which will quickly decay like this:
To illustrate with some numbers, imagine an email blast that results in 100,000 link tracking events. 80% of those will occur in the first hour.
In our original design from Part 1, that would 22 URL lookups, and 22 inserts per second:
For simplicity, pretend that inserts and selects produce similar db load. Your system would need to support 44 events/s to avoid slowdowns and frustrating your clients.
The asynchronous model:
Reduces the load to 22 URL lookups, and a controllable number of inserts. Again for simplicity let’s go with 8 inserts/s, for a total of 30 events/s. That’s a 1/3 reduction in load!
But, your system is still looking up the Original URL 22 times/s. That’s a lot of unnecessary db load.
Edge Caching The Original URL
The Original URL is static data that can be cached on the web server instead of loaded from the database for each event. Instead, each server would retrieve the Original URL from the db once, store it in memory, and reuse it as needed.
This effectively drops the lookup rate from 22 events/s to 0 events/s, reducing the db load to 8 events/s, a 55% drop! Combined with the asynchronous processing improvements from Part 1, that’s an 80% reduction in max database load.
Edge Caching on the servers works for a while, but as your clients expand the number of URLs you’ll need to keep track of won’t fit in server memory. At that point you’ll need to add in tools like Memcached or Redis. Like web servers, these tools are a lot cheaper than scaling your database.
Consistent Load on the Database
The great thing about this design is that you can keep the db load consistent, regardless of the incoming traffic. Whether the load is 44 events/s or 100 events/s you control the rate of asynchronous processing. So long as you have room on your servers for an internal queue, or if you use an external queue like RabbitMQ or SQS you can delay processing the events.
Scaling questions become discussions about cost and how quickly your clients need to see results.
Caching static data is a great way to reduce database load. You can use prebuilt libraries like Guava for Java, cacheout for Python, or dozens of others. You can also leverage distributed cache systems like Memcached and Redis. While there’s no such thing as a free lunch, web servers and distributed caches are much much cheaper to scale than databases.
You’ll save money and deliver a superior experience to your clients and their users!
Scaling Bugs don’t really exist, you will never find “unable to scale” in your logs. Scaling bugs are timing, concurrency and reliability bugs that emerge as your system scales. Today I’m going to show you 4 signs that your system is being plagued by scaling bugs, and 4 things you can do to buy time and minimize your client’s pain.
Scaling bugs boil down to “Something that used to be reliable is no longer reliable and your code doesn’t handle the failure gracefully”. This means that they are going to appear in the oldest parts of your codebase, be inconsistent, bursty, and hit your most valuable clients the hardest.
Scaling Bugs appear in older, stable, parts of your codebase
The oldest parts of your are typically the most stable, that’s how they managed to get old. But, the code was also written with lower performance needs and higher reliability expectations.
Reliability bugs can lay dormant for years, emerging where you least expect it. I once spent an entire week finding a bug deep in code that hadn't changed in 10 years. As long as there were no problems, everything was fine, but a database connection hiccup in one specific function would cause a cascading failure on a distributed task being processed on over 30 servers.
Database connectivity is ridiculously stable these days, you can have hundreds of servers and go weeks without an issue. Unless your databases are overloaded, and that’s when the bug struck.
Scaling Bugs Are Inconsistent
Sometimes the system has trouble, sometimes things are fine. Even more perplexing is that they occur regardless of multi-threading or the statefulness of your code.
This makes scaling bugs difficult to find, since you’ll never be able to reproduce them locally. They won’t appear for a single test execution, only when you have hundreds or thousands of events happening simultaneously.
Even if your code is single threaded and stateless, your system is multi-process and has state. A serverless design still has scaling bottlenecks at the persistence layer.
Scaling Bugs Are Bursty
Bursty means that the bugs appear in clusters, usually in ever increasing numbers after ever shorter intervals. Initially the error crops up once every few weeks and does minimal damage, so it gets documented as low priority and never worked on. As your platform scales though, the error starts popping up 5 at a time every few days, then dozens of time once a day. Eventually the low priority, low impact bug becomes an extremely expensive support problem.
Scaling Bugs Hit Your Most Valuable Clients Hardest
Which are the clients with the most contacts in a CRM? Which are the ones with the most emails? The most traffic and activity?
The same ones paying the most for the privilege of pushing your platform to the limit.
The impact of scaling bugs mostly fall on your most valuable clients, which makes their potential impact high in dollar terms.
Four ways to buy time
These tactics aren’t solutions, they are ways to buy time to transform your system to one that operates at scale. I’ll cover some scaling tactics in a future post!
Throw money at the problem
There’s never a better time to throw money at a problem then the early stages of scaling problems! More clients + larger clients = more dollars available.
Increase the number of servers, upgrade the databases, and increase your network throughput. If you have a multi-tenant setup, add shards and decrease the number of customers running on the same hardware.
If throwing money at the problem helps, then you know you have scaling problems. You can also get a rough estimate of the time-for-money runway. If the improved infrastructure doesn’t help you can downgrade everything and stop spending the extra money.
Keep your Error Rate Low
It’s common for the first time you notice a scaling bug to be when it causes a cascading system failure. However, it’s rare for that to be the first time the bug manifested itself. Resolving those low priority rare bugs is key to keeping catastrophic scaling bugs at bay.
I once worked on a system that ran at over 1 million events per second (100 billion/day). We had a saying: The nice thing about this system is that something that’s 1 in a million happens 60 times a minute. The only known error we let stand: Servers would always fail to process the first event after a restart.
As load and scale increases, transient errors become more common. Take a design cue from RESTful systems and add retry logic. Most modern databases support upsert operations, which go a long way towards making it safe to retry inserts.
Most actions don’t need to be processed synchronously. Switching to asynchronous processing makes many scaling bugs disappear for a while because the apparent processing greatly improves. You still have to do the processing work, and the overall latency of your system may increase. Slowly and reliably processing everything successfully is greatly preferable to praying that everything processes quickly.
Congratulations! You Have Scaling Problems!
Scaling bugs only hit systems that gets used. Take solace in the fact that you have something people want to use.
The techniques in this article will help you buy enough time to work up a plan to scale your system. Analyze your scaling pain points to gain insight into which parts of your system are most useful to your clients and prioritize your refactoring accordingly.
Remember that there are always ways to scale your current system without resorting to a total rewrite!
SaaS Scaling anti-patterns: The database as a queue
Using a database as a queue is a natural and organic part of any growing system. It’s an expedient use of the tools you have on hand. It’s also a subtle mistake that will consume hundreds of thousands of dollars in developer time and countless headaches for the rest of your business. Let’s walk down the easy path into this mess, and how to carve a way out.
No matter what your business does on the backend, your client facing platform will be some kind of web front end, which means you have web servers and a database. As your platform grows, you will have work that needs to be done, but doesn’t make sense in an api / ui format. Daily sales reports and end of day reconciliation, are common examples.
The initial developer probably didn’t realize he was building a queue. The initial version would have been a single table called process which tracked client id, date and completed status. Your report generator would load a list of active client ids, iterate through them, and write done to the database.
Simple, stateful and it works.
For a while.
But, some of your clients are bigger, and there are a lot of them, and the process was taking longer and longer, until it wasn’t finishing overnight. So to gain concurrency and added worker processes your developers added “Not started” and “in process” states. Thanks to database concurrency guarantees and atomic updates, it only took a few releases to get everything working smoothly with the end-to-end processing time dropping back to something manageable.
Now the database is a queue and preventing duplicate work.
There’s a list of work, a bunch of workers, and with only a few more days of developer time you can even monitor progress as they chew through the queue.
Except your devs haven’t implemented retry logic because failures are rare. If the process dies and doesn’t generate a report, then someone, usually support fielding an angry customer call, will notice and ask your developers to stop what they’re doing and restart the process. No problem, adding code to move “in-process” back to “not started” after some amount of time is only a sprint worth of work.
Except, sometimes, for some reason, some tasks always fail. So your developers add a counter for retries, and after 5 or so, they set the state to “skip” so that the bad jobs don’t keep sucking up system resources.
Congratulations! For about $100,000 in precious developer time, your SaaS product has a buggy, inefficient, poor scaling implementation of database-as-a-queue. Probably best not to even try to quantify the opportunity costs.
Solutions like SQS and RabbitMQ are available, effectively free, and take an afternoon to set up.
Instead of worrying about how you got here, a better question is how do you stop throwing good developer resources away and migrate?
Every instance is different, but I find it is easiest to work backwards.
You already have worker code to generate reports. Have your developers extend the code to accept a job from a queue like SQS in addition to the DB. In the first iteration, the developers can manually add failed jobs to the queue. Likely you already have a manual retries process; migrate that to use the queue.
Once you have the code working smoothly with a queue, you can start having the job generator write to the queue instead of the database. Something magically usually happens at this point. You’ll be amazed at how many new types of jobs your developers will want to implement once new functionality no longer requires a database migration.
Soon, you’ll be able to run your system off the db or a queue, but the db tables will be empty.
Only then do you refactor the db queues out of your codebase.
Adding a proper queue system gets your team out of the hole and scratches your developers itch for shiny and new technology. You get improved functionality after the very first sprint, and aren’t rewriting your code from scratch.
That’s your best alternative to a total rewrite, start today!