But how do you get started? How do you shorten your stride from shooting the moon, to one small step?
The next series of posts is going to lay out my scaling iterative delivery framework. This site is about scaling SaaS software, and this framework works best if you want an order of magnitude more of what you already offer your clients. This isn’t a general framework, and it certainly isn’t the only way to get started with iterative delivery.
Work your way through these steps:
Pick a goal – 1 sentence, highly aspirational and self explanatory.
Define the characteristics of your goal – What measurable characteristics does your system need in order to achieve your goal?
What are the implications? – What technical things would have to be true in order for your system to have all the characteristics you need?
What are the blockers? – What is stopping you from making the implications true?
What can you do to weaken the blockers? – Set aside the goal, characteristics and implications; what can you do to weaken the blockers?
Weakening the blockers is where you start delivering iteratively. As the blockers disappear, your system becomes better for your clients and easier for you to implement your technical needs.
We will explore each step in depth in the following posts.
You become a Scaleup when your SaaS’s service offering becomes compelling and you start attracting exponentially more clients.
All at once you have a lot more clients, clients with a lot more data.
Solutions that support 1,000 clients buckle as you pass 5,000. Suddenly, 25,000 clients is only months away.
Services that support hundreds of thousands of transactions a day fall hopelessly behind as you onboard clients with millions of transactions.
You finally know what customers want. You quickly find the edges of your system. Money is rolling in from customers and VCs. You can throw money at the problems to literally buy time to find a solution.
But you’re faced with a looming question – moonshots or baby steps.
Moonshots Are About You, Baby Steps Are About Your Clients
It’s not about you or your SaaS, it’s about your client’s outcomes.
Moonshots are appealing because they take you directly to where you need to be. Your system needs to scale 10x today and 100x next year; why not go straight for 100x?
Baby steps feel like aiming low because the impact on you is small. But it’s not about you! Think about the impact on your clients.
From a technology perspective, sending emails 1% faster is ::yawn::
But for your clients, faster emails means more engagement, which means more sales.
Would your clients rather have more sales this week, compounding every week for the next year, or flat sales for a year while you build a moonshot?
Clients who churn, or go out of business, won’t get value from the moonshot. Even if you deliver greater value eventually, your clients are better off getting some value now.
Are you delivering value to your SaaS or your clients?
The Chestburster is an antipattern that occurs when transitioning from a monolith to services.
The team sees an opportunity to exact a small piece of functionality from the monolith into a new service, but the monolith is the only place that handles security, permissions and composition.
Because the new service can’t face clients directly, the Chestburster hides behind the monolith, hoping to burst through at some later point.
The Chestburster begins as the inverse of the Strangler pattern, with the monolith delegating to the service instead of the new service delegates to the monolith.
Why it’s appealing
The Chestburster’s appeal is that it gets the New Service up and running quickly. This looks like progress! The legacy code is extracted, possibly rewritten, and maybe better.
Why it fails
There is no business case for building the functionality the new service needs to burst through the monolith. The functionality has been rewritten. It’s been rewritten into a new service. How do you go back now and ask for time to address security and the other missing pieces? Worse, the missing pieces are usually outside of the team’s control; security is one area you want to leave to the experts.
Even if you get past all the problems on your side, you’ve created new composition complexities for the client. Now the client has to create a new connection to the Chestburster and handle routing themselves. Can you make your clients update? Should you?
Remember The Strangler
If you want to break apart a monolith, it’s always a good idea to start with a Strangler. If you can’t set up a strangle on your existing monolith, you aren’t ready to start breaking it apart.
That doesn’t mean you’re stuck with the current functionality!
If you have the time and resources to extract the code into a new service, you have the time and resources to decouple the code inside of the monolith. When the time comes to decompose into services, you’ll be ready.
The chestburster gives the illusion of quick progress; but quickly stalls as the team runs into problems they can’t control. Overcoming the technical hurdles doesn’t guarantee that clients will ever update their integration.
Success in legacy system replacement comes by integrating first, and moving functionality second. With the chestburster you move functionality first and probably never burst through.
Giving users the ability to define their own searches, data segmentation and processes creates a lot of value for a SaaS. The User Defined Parts of the codebase are also always going to contain the most “interesting” performance and scaling problems as users assemble the pieces into beautiful, powerful and mind boggling ways.
It’s not a bug, it’s performance
Performance bugs aren’t traditional bugs. The code does come up with the right answer, eventually. But when your clients think your system is slow, they don’t care why. Whether it does too much work, can’t be run in parallel, or if your system allows the customer to shoot themselves in the foot, it’s all bugs to your clients.
You need to care about why because you get to do something to make things better.
Run to a Performance Runbook
A performance runbook can be nothing more than a list of tips and tricks for dealing with issues in User Generated land. Because the problems aren’t bugs, they won’t leave obvious errors in the logs. They require developing specialized techniques, tools, and pattern matching.
By writing down your debugging techniques, a runbook will help you diagnose problems faster.
Reduce Everyone’s Mental Load
Performance issues manifest everywhere in a tech stack. The issues that a client is noticing are often far removed from the bottleneck.
Having a centralized place to document issue triaging reduces the mental load on everyone in your organization. Where do we start looking? What’s that query? A runbook helps you with those first common steps.
Support gets help with common trouble areas and basic solutions. Listening to a client explain an issue and not being able to do anything but escalate is demoralizing for everyone involved. Every issue support can fix improves the experience for the client and support. Even something as simple as improving the questions support asks the client will pay off in time saved.
When senior support and developers are called in, they know that all the common solutions have been tried. The basic questions have been asked and the data gathered. They can skip the basics and move on to the more powerful tools and queries, saving everyone’s time. New diagnosis and solutions go into the runbook making support more powerful.
The common questions and common solutions become automation targets. You can proactively tell a client that they’re using the system “wrong”, and send them help and training materials. The best support solutions are when you reach out to the client before they even realize they have a problem.
6 Questions To Start A Runbook
Common solutions to common problems? Training? Proactive alerting? Sounds great, but daunting.
Runbooks are living documents. The days when they were printed and bound into manuals ended decades ago.
Talk to the developer who fixed the last issue:
What did they look for in the logs?
What queries did they run?
What did they find?
How did they resolve the issue?
Write down the answers. Repeat every time there’s a performance issue.
After a few incidents, patterns should emerge.
Bring what you’ve got to your support managers and ask:
Could support have done any of the investigative work?
If support had the answer, could they have resolved the issue?
Help train support on what they can do, create tools for useful things support can’t do on their own.
Every time a problem gets escalated, that’s a chance to iterate and improve.
Conclusion – Runbooks Help Everyone
Building a performance runbook sounds a lot like accepting performance problems and working on mitigation.
Instead, it is about surfacing the performance problems faster, finding the commonalities, and fixing the underlying system.
Along the way the runbook improves the client experience, empowers support, and reduces the support load on developers.
Rather than Legacy System Rescue, I was hired to do “keep the lights on” work. The company had a 3 developer team working on a next generation system, all I had to do was to keep things running as smoothly as possible until they delivered.
The legacy system was buckling under the weight of their current customers. Potential customers were waiting in line to give them money, and had to be turned. Active customers were churning because the platform was buckling.
That’s when I realized – Legacy System Rescue may grudgingly get a single developer, but Scaling gets three developers to scale and one to keep the lights on. Scaling is an expensive problem because it involves churning existing customers and turning away new ones.
Over 10 months I iteratively rescued the legacy system by fixing bugs and removing choke points. After investing over 50 developer months, the next generation system was completely scrapped.
The Lesson – Companies won’t pay to rescue a legacy system, but they’d gladly pay 4x to scaleup and meet demand.
For SaaS with a pure Single Tenant model, infrastructure consolidation usually drives the first two, nearly simultaneous, steps towards a Multi-Tenant model. The two steps convert the front end servers to be Multi-Tenant and switch the client databases from physical to logical isolation. These two steps are usually done nearly simultaneously as a SaaS grows beyond a handful of clients, infrastructure costs skyrocket and things become unmanageable.
Considering the 5 factors laid out in the introduction and addendum – complexity, security, scalability, consistent performance, and synergy this move greatly increases scalability, at the cost of increased complexity, decreased security, and opening the door to consistent performance problems. Synergy is not immediately impacted, but these changes make adding Synergy at a later date much easier.
Why is this such an early move when it has 3 negative factors and only 1 positive? Because pure Single Tenant designs have nearly insurmountable scalability problems, and these two changes are the fastest, most obvious and most cost effective solution.
Shifting from Single Tenant servers and databases to Multi-Tenant slightly increases software complexity in exchange for massively decreasing platform complexity.
The web servers need to be able to understand which client a request is for, usually through sub domains like client.mySaaS.com, and use that knowledge to validate the user and connect to the correct database to retrieve data.
The difficult and risky part here is making sure that valid sessions stay associated with the correct account.
Database server consolidation tends to be less tricky. Most database servers support multiple schemas with their own credentials and logical isolation. Logical separation provides unique connection settings for the web servers. Individual client logins are restricted to the client’s schema and the SaaS developers do not need to treat logical and physical separation any differently.
Migrations and Versioning Become Expensive
The biggest database problems with a many-to-many design crop up during migrations. Inevitably, web and database changes will be incomparable between versions. Some SaaS models require all clients on the same version, which limits comparability issues to the release window (which itself can take days), while other models allow clients to be on different versions for years.
The general solution to the problem of long lived versions is to stand up a pool of web and database servers on the new version, migrate clients to the new pool, and update request routing.
The biggest risk around these changes is database secret handling; every server can now connect to every database. Compromising a single server becomes a vector for exposing data from multiple clients. This risk can be limited by proxy layers that keep database connections away from public facing web servers. Still a compromised server is now a risk to multiple clients.
Changing from physical to logical database separation is less risky. Each client will still be logically separated with their own schema, and permissioning should make it impossible to do queries across multiple clients.
Scalability is the goal of Multi-Tenant Infrastructure Consolidation.
In addition to helping the SaaS, the consolidation will also help clients. Shared server pools will increase stability and uptime by providing access to a much larger group of active servers. The client also benefits from having more servers and more slack, making it much easier for the SaaS to absorb bursts in client activity.
Likewise, running multiple clients on larger database clusters generally increases uptime and provides slack for bursts and spikes.
These changes only impact response times when the single tenant setup would have been overwhelmed. The minimum response times don’t change, but the maximum response times get lower and occur less frequently.
The flip side to the tenancy change is the introduction of the Noisy Neighbor problem. This mostly impacts the database layer and occurs when large clients overwhelm the database servers and drown out resources for smaller clients.
This can be especially frustrating to clients because it can happen at any time, last for an unknown period, and there’s no warning or notification. Things “get slow” and there are no guarantees about how often clients are impacted, notice, or complain.
There is no direct Synergy impact from changing the web and database servers.
A SaaS starting from a pure Single Tenant model is not pursuing Synergy, otherwise the initial model would have been Multi-Tenant.
Placing distinct client schemas onto a single server does open the door to future Synergy work. Working with data in SQL across different schemas on the same server is much easier than working across physical servers. The work would still require changing the security model and writing quite a bit of code. There is now a doorway if the SaaS has a reason to walk through.
As discussed in the introduction, a SaaS may begin with a purely Single Tenant model for several reasons. High infrastructure bills and poor resource utilization will quickly drive an Infrastructure Consolidation to Multi-Tenant servers and logically separated databases.
The exceptions to this rule are SaaS that have few very large clients or clients with high security requirements. These SaaS will have to price and market themselves accordingly.
Infrastructure Consolidation is an early driver away from a pure Single Tenant model to Multi-Tenancy. The change is mostly positive for clients, but does add additional security and client satisfaction risks.
In Part 1 of Making Link Tracking Scale I showed how switching event recording from synchronous to asynchronous processing creates a superior, faster and more consistent user experience. In Part 2, I will discuss how Link Tracking scaling issues are governed by Long Tails, and how to overcome the initial burst using edge caching and tools like Memcache and Redis.
The Long Tails of Link Tracking
When your client sends an email campaign, publishes new content your link tracker will experience a giant burst of activity, which will quickly decay like this:
To illustrate with some numbers, imagine an email blast that results in 100,000 link tracking events. 80% of those will occur in the first hour.
In our original design from Part 1, that would 22 URL lookups, and 22 inserts per second:
For simplicity, pretend that inserts and selects produce similar db load. Your system would need to support 44 events/s to avoid slowdowns and frustrating your clients.
The asynchronous model:
Reduces the load to 22 URL lookups, and a controllable number of inserts. Again for simplicity let’s go with 8 inserts/s, for a total of 30 events/s. That’s a 1/3 reduction in load!
But, your system is still looking up the Original URL 22 times/s. That’s a lot of unnecessary db load.
Edge Caching The Original URL
The Original URL is static data that can be cached on the web server instead of loaded from the database for each event. Instead, each server would retrieve the Original URL from the db once, store it in memory, and reuse it as needed.
This effectively drops the lookup rate from 22 events/s to 0 events/s, reducing the db load to 8 events/s, a 55% drop! Combined with the asynchronous processing improvements from Part 1, that’s an 80% reduction in max database load.
Edge Caching on the servers works for a while, but as your clients expand the number of URLs you’ll need to keep track of won’t fit in server memory. At that point you’ll need to add in tools like Memcached or Redis. Like web servers, these tools are a lot cheaper than scaling your database.
Consistent Load on the Database
The great thing about this design is that you can keep the db load consistent, regardless of the incoming traffic. Whether the load is 44 events/s or 100 events/s you control the rate of asynchronous processing. So long as you have room on your servers for an internal queue, or if you use an external queue like RabbitMQ or SQS you can delay processing the events.
Scaling questions become discussions about cost and how quickly your clients need to see results.
Caching static data is a great way to reduce database load. You can use prebuilt libraries like Guava for Java, cacheout for Python, or dozens of others. You can also leverage distributed cache systems like Memcached and Redis. While there’s no such thing as a free lunch, web servers and distributed caches are much much cheaper to scale than databases.
You’ll save money and deliver a superior experience to your clients and their users!
Scaling Bugs don’t really exist, you will never find “unable to scale” in your logs. Scaling bugs are timing, concurrency and reliability bugs that emerge as your system scales. Today I’m going to show you 4 signs that your system is being plagued by scaling bugs, and 4 things you can do to buy time and minimize your client’s pain.
Scaling bugs boil down to “Something that used to be reliable is no longer reliable and your code doesn’t handle the failure gracefully”. This means that they are going to appear in the oldest parts of your codebase, be inconsistent, bursty, and hit your most valuable clients the hardest.
Scaling Bugs appear in older, stable, parts of your codebase
The oldest parts of your are typically the most stable, that’s how they managed to get old. But, the code was also written with lower performance needs and higher reliability expectations.
Reliability bugs can lay dormant for years, emerging where you least expect it. I once spent an entire week finding a bug deep in code that hadn’t changed in 10 years. As long as there were no problems, everything was fine, but a database connection hiccup in one specific function would cause a cascading failure on a distributed task being processed on over 30 servers.
Database connectivity is ridiculously stable these days, you can have hundreds of servers and go weeks without an issue. Unless your databases are overloaded, and that’s when the bug struck.
Scaling Bugs Are Inconsistent
Sometimes the system has trouble, sometimes things are fine. Even more perplexing is that they occur regardless of multi-threading or the statefulness of your code.
This makes scaling bugs difficult to find, since you’ll never be able to reproduce them locally. They won’t appear for a single test execution, only when you have hundreds or thousands of events happening simultaneously.
Even if your code is single threaded and stateless, your system is multi-process and has state. A serverless design still has scaling bottlenecks at the persistence layer.
Scaling Bugs Are Bursty
Bursty means that the bugs appear in clusters, usually in ever increasing numbers after ever shorter intervals. Initially the error crops up once every few weeks and does minimal damage, so it gets documented as low priority and never worked on. As your platform scales though, the error starts popping up 5 at a time every few days, then dozens of time once a day. Eventually the low priority, low impact bug becomes an extremely expensive support problem.
Scaling Bugs Hit Your Most Valuable Clients Hardest
Which are the clients with the most contacts in a CRM? Which are the ones with the most emails? The most traffic and activity?
The same ones paying the most for the privilege of pushing your platform to the limit.
The impact of scaling bugs mostly fall on your most valuable clients, which makes their potential impact high in dollar terms.
Four ways to buy time
These tactics aren’t solutions, they are ways to buy time to transform your system to one that operates at scale. I’ll cover some scaling tactics in a future post!
Throw money at the problem
There’s never a better time to throw money at a problem then the early stages of scaling problems! More clients + larger clients = more dollars available.
Increase the number of servers, upgrade the databases, and increase your network throughput. If you have a multi-tenant setup, add shards and decrease the number of customers running on the same hardware.
If throwing money at the problem helps, then you know you have scaling problems. You can also get a rough estimate of the time-for-money runway. If the improved infrastructure doesn’t help you can downgrade everything and stop spending the extra money.
Keep your Error Rate Low
It’s common for the first time you notice a scaling bug to be when it causes a cascading system failure. However, it’s rare for that to be the first time the bug manifested itself. Resolving those low priority rare bugs is key to keeping catastrophic scaling bugs at bay.
I once worked on a system that ran at over 1 million events per second (100 billion/day). We had a saying: The nice thing about this system is that something that’s 1 in a million happens 60 times a minute. The only known error we let stand: Servers would always fail to process the first event after a restart.
As load and scale increases, transient errors become more common. Take a design cue from RESTful systems and add retry logic. Most modern databases support upsert operations, which go a long way towards making it safe to retry inserts.
Most actions don’t need to be processed synchronously. Switching to asynchronous processing makes many scaling bugs disappear for a while because the apparent processing greatly improves. You still have to do the processing work, and the overall latency of your system may increase. Slowly and reliably processing everything successfully is greatly preferable to praying that everything processes quickly.
Congratulations! You Have Scaling Problems!
Scaling bugs only hit systems that gets used. Take solace in the fact that you have something people want to use.
The techniques in this article will help you buy enough time to work up a plan to scale your system. Analyze your scaling pain points to gain insight into which parts of your system are most useful to your clients and prioritize your refactoring accordingly.
Remember that there are always ways to scale your current system without resorting to a total rewrite!