Run to A Runbook

Giving users the ability to define their own searches, data segmentation and processes creates a lot of value for a SaaS.  The User Defined Parts of the codebase are also always going to contain the most “interesting” performance and scaling problems as users assemble the pieces into beautiful, powerful and mind boggling ways.

It’s not a bug, it’s performance

Performance bugs aren’t traditional bugs.  The code does come up with the right answer, eventually.  But when your clients think your system is slow, they don’t care why.  Whether it does too much work, can’t be run in parallel, or if your system allows the customer to shoot themselves in the foot, it’s all bugs to your clients.

You need to care about why because you get to do something to make things better.

Run to a Performance Runbook

A performance runbook can be nothing more than a list of tips and tricks for dealing with issues in User Generated land.  Because the problems aren’t bugs, they won’t leave obvious errors in the logs.  They require developing specialized techniques, tools, and pattern matching.

By writing down your debugging techniques, a runbook will help you diagnose problems faster.

Reduce Everyone’s Mental Load

Performance issues manifest everywhere in a tech stack.  The issues that a client is noticing are often far removed from the bottleneck.  

Having a centralized place to document issue triaging reduces the mental load on everyone in your organization.  Where do we start looking?  What’s that query?  A runbook helps you with those first common steps.

Support gets help with common trouble areas and basic solutions.  Listening to a client explain an issue and not being able to do anything but escalate is demoralizing for everyone involved.  Every issue support can fix improves the experience for the client and support.  Even something as simple as improving the questions support asks the client will pay off in time saved.

When senior support and developers are called in, they know that all the common solutions have been tried.  The basic questions have been asked and the data gathered.  They can skip the basics and move on to the more powerful tools and queries, saving everyone’s time.  New diagnosis and solutions go into the runbook making support more powerful.

The common questions and common solutions become automation targets.  You can proactively tell a client that they’re using the system “wrong”, and send them help and training materials.  The best support solutions are when you reach out to the client before they even realize they have a problem.

6 Questions To Start A Runbook

Common solutions to common problems?  Training?  Proactive alerting?  Sounds great, but daunting.

Runbooks are living documents.  The days when they were printed and bound into manuals ended decades ago.

Start small.

Talk to the developer who fixed the last issue:

  1. What did they look for in the logs?  
  2. What queries did they run?  
  3. What did they find? 
  4. How did they resolve the issue?

Write down the answers.  Repeat every time there’s a performance issue.

After a few incidents, patterns should emerge.

Bring what you’ve got to your support managers and ask:

  1. Could support have done any of the investigative work?
  2. If support had the answer, could they have resolved the issue? 

Help train support on what they can do, create tools for useful things support can’t do on their own.

Every time a problem gets escalated, that’s a chance to iterate and improve.

Conclusion – Runbooks Help Everyone

Building a performance runbook sounds a lot like accepting performance problems and working on mitigation.

Instead, it is about surfacing the performance problems faster, finding the commonalities, and fixing the underlying system.

Along the way the runbook improves the client experience, empowers support, and reduces the support load on developers.

Everyone wins when you run to a runbook!

Scaling is Legacy System Rescue That Pays 4x

In my last article, You Won’t Pay Me to Rescue Your Legacy System, I talked about my original attempt at specializing, and why it didn’t work.  I bumbled along until I lucked into a client that helped me understand when Legacy System Rescue becomes an Expensive Problem.

Rather than Legacy System Rescue, I was hired to do “keep the lights on” work.  The company had a 3 developer team working on a next generation system, all I had to do was to keep things running as smoothly as possible until they delivered.

The legacy system was buckling under the weight of their current customers.  Potential customers were waiting in line to give them money, and had to be turned.  Active customers were churning because the platform was buckling.

That’s when I realized – Legacy System Rescue may grudgingly get a single developer, but Scaling gets three developers to scale and one to keep the lights on.  Scaling is an expensive problem because it involves churning existing customers and turning away new ones.

Over 10 months I iteratively rescued the legacy system by fixing bugs and removing choke points.  After investing over 50 developer months, the next generation system was completely scrapped.

The Lesson – Companies won’t pay to rescue a legacy system, but they’d gladly pay 4x to scaleup and meet demand.

Tenancy Model Roundup

Over the past few months I have been ruminating on SaaS Tenancy Models and how they drive architectural decisions.  I hope you’ve enjoyed the series as I’ve scratched my itch.

Here is a roundup of the 7 articles In case you missed any of the parts, or need a handy index to what I’m sure is the most in depth discussion of SaaS Tenancy Models ever written.

Part 1 – An introduction to SaaS Tenancy Models

Part 2 – An addendum to the introduction

Part 3 – How growth and scale drive tenancy model changes

Part 4 – Regaining Effective Single Tenancy through Cell Isolation

Part 5 – Why your job service should be Multi-Tenant even if your model is Single Tenant

Part 6 – Whose data is it anyway, why you need to separate your SaaS’s data from your clients

Part 7 – 3 Signs your resource allocation model is working against you

Infrastructure Consolidation Drives Early Tenancy Migrations

For SaaS with a pure Single Tenant model, infrastructure consolidation usually drives the first two, nearly simultaneous, steps towards a Multi-Tenant model.  The two steps convert the front end servers to be Multi-Tenant and switch the client databases from physical to logical isolation.  These two steps are usually done nearly simultaneously as a SaaS grows beyond a handful of clients, infrastructure costs skyrocket and things become unmanageable.

Diagram of a single tenant architecture become multi-tenant

Considering the 5 factors laid out in the introduction and addendumcomplexity, security, scalability, consistent performance, and synergy this move greatly increases scalability, at the cost of increased complexity, decreased security, and opening the door to consistent performance problems.  Synergy is not immediately impacted, but these changes make adding Synergy at a later date much easier.

Why is this such an early move when it has 3 negative factors and only 1 positive?  Because pure Single Tenant designs have nearly insurmountable scalability problems, and these two changes are the fastest, most obvious and most cost effective solution.

Complexity 

Shifting from Single Tenant servers and databases to Multi-Tenant slightly increases software complexity in exchange for massively decreasing platform complexity.

The web servers need to be able to understand which client a request is for, usually through sub domains like client.mySaaS.com, and use that knowledge to validate the user and connect to the correct database to retrieve data.

Increased complexity from consolidation

The difficult and risky part here is making sure that valid sessions stay associated with the correct account.  

Database server consolidation tends to be less tricky.  Most database servers support multiple schemas with their own credentials and logical isolation.  Logical separation provides unique connection settings for the web servers.  Individual client logins are restricted to the client’s schema and the SaaS developers do not need to treat logical and physical separation any differently.

Migrations and Versioning Become Expensive

The biggest database problems with a many-to-many design crop up during migrations.  Inevitably, web and database changes will be incomparable between versions.  Some SaaS models require all clients on the same version, which limits comparability issues to the release window (which itself can take days), while other models allow clients to be on different versions for years.

Versioning and Migration Diagram

The general solution to the problem of long lived versions is to stand up a pool of web and database servers on the new version, migrate clients to the new pool, and update request routing.

Security

The biggest risk around these changes is database secret handling; every server can now connect to every database.  Compromising a single server becomes a vector for exposing data from multiple clients.  This risk can be limited by proxy layers that keep database connections away from public facing web servers.  Still a compromised server is now a risk to multiple clients.

Changing from physical to logical database separation is less risky.  Each client will still be logically separated with their own schema, and permissioning should make it impossible to do queries across multiple clients.

Scalability

Scalability is the goal of Multi-Tenant Infrastructure Consolidation.

In addition to helping the SaaS, the consolidation will also help clients.  Shared server pools will increase stability and uptime by providing access to a much larger group of active servers.  The client also benefits from having more servers and more slack, making it much easier for the SaaS to absorb bursts in client activity.

Likewise, running multiple clients on larger database clusters generally increases uptime and provides slack for bursts and spikes.

These changes only impact response times when the single tenant setup would have been overwhelmed.  The minimum response times don’t change, but the maximum response times get lower and occur less frequently.

Consistent Performance

The flip side to the tenancy change is the introduction of the Noisy Neighbor problem.  This mostly impacts the database layer and occurs when large clients overwhelm the database servers and drown out resources for smaller clients.

This can be especially frustrating to clients because it can happen at any time, last for an unknown period, and there’s no warning or notification.  Things “get slow” and there are no guarantees about how often clients are impacted, notice, or complain.

Synergy

There is no direct Synergy impact from changing the web and database servers.

A SaaS starting from a pure Single Tenant model is not pursuing Synergy, otherwise the initial model would have been Multi-Tenant.

Placing distinct client schemas onto a single server does open the door to future Synergy work.  Working with data in SQL across different schemas on the same server is much easier than working across physical servers.  The work would still require changing the security model and writing quite a bit of code.  There is now a doorway if the SaaS has a reason to walk through.

Conclusion

As discussed in the introduction, a SaaS may begin with a purely Single Tenant model for several reasons.  High infrastructure bills and poor resource utilization will quickly drive an Infrastructure Consolidation to Multi-Tenant servers and logically separated databases.

The exceptions to this rule are SaaS that have few very large clients or clients with high security requirements.  These SaaS will have to price and market themselves accordingly.

Infrastructure Consolidation is an early driver away from a pure Single Tenant model to Multi-Tenancy. The change is mostly positive for clients, but does add additional security and client satisfaction risks.

If you are enjoying this series, please subscribe to my mailing list so that you don’t miss an installment!

Making Link Tracking Scale – Part 2 Edge Caching

In Part 1 of Making Link Tracking Scale I showed how switching event recording from synchronous to asynchronous processing creates a superior, faster and more consistent user experience.  In Part 2, I will discuss how Link Tracking scaling issues are governed by Long Tails, and how to overcome the initial burst using edge caching and tools like Memcache and Redis.

The Long Tails of Link Tracking

When your client sends an email campaign, publishes new content your link tracker will experience a giant burst of activity, which will quickly decay like this:

To illustrate with some numbers, imagine an email blast that results in 100,000 link tracking events.  80% of those will occur in the first hour.  

In our original design from Part 1, that would 22 URL lookups, and 22 inserts per second:

For simplicity, pretend that inserts and selects produce similar db load.  Your system would need to support 44 events/s to avoid slowdowns and frustrating your clients.

The asynchronous model:

Reduces the load to 22 URL lookups, and a controllable number of inserts.  Again for simplicity let’s go with 8 inserts/s, for a total of 30 events/s.  That’s a 1/3 reduction in load!

But, your system is still looking up the Original URL 22 times/s.  That’s a lot of unnecessary db load.

Edge Caching The Original URL

The Original URL is static data that can be cached on the web server instead of loaded from the database for each event.  Instead, each server would retrieve the Original URL from the db once, store it in memory, and reuse it as needed.

This effectively drops the lookup rate from 22 events/s to 0 events/s, reducing the db load to 8 events/s, a 55% drop!  Combined with the asynchronous processing improvements from Part 1, that’s an 80% reduction in max database load.

Edge Caching on the servers works for a while, but as your clients expand the number of URLs you’ll need to keep track of won’t fit in server memory.  At that point you’ll need to add in tools like Memcached or Redis.  Like web servers, these tools are a lot cheaper than scaling your database.

Consistent Load on the Database

The great thing about this design is that you can keep the db load consistent, regardless of the incoming traffic.  Whether the load is 44 events/s or 100 events/s you control the rate of asynchronous processing. So long as you have room on your servers for an internal queue, or if you use an external queue like RabbitMQ or SQS you can delay processing the events.

Scaling questions become discussions about cost and how quickly your clients need to see results.

Conclusion

Caching static data is a great way to reduce database load.  You can use prebuilt libraries like Guava for Java, cacheout for Python, or dozens of others.  You can also leverage distributed cache systems like Memcached and Redis. While there’s no such thing as a free lunch, web servers and distributed caches are much much cheaper to scale than databases. 

You’ll save money and deliver a superior experience to your clients and their users!

Four ways Scaling Bugs are Different

Scaling Bugs don’t really exist, you will never find “unable to scale” in your logs.  Scaling bugs are timing, concurrency and reliability bugs that emerge as your system scales.  Today I’m going to show you 4 signs that your system is being plagued by scaling bugs, and 4 things you can do to buy time and minimize your client’s pain.

Scaling bugs boil down to “Something that used to be reliable is no longer reliable and your code doesn’t handle the failure gracefully”.  This means that they are going to appear in the oldest parts of your codebase, be inconsistent, bursty, and hit your most valuable clients the hardest.

Scaling Bugs appear in older, stable, parts of your codebase

The oldest parts of your are typically the most stable, that’s how they managed to get old.  But, the code was also written with lower performance needs and higher reliability expectations.

Reliability bugs can lay dormant for years, emerging where you least expect it.  I once spent an entire week finding a bug deep in code that hadn’t changed in 10 years.  As long as there were no problems, everything was fine, but a database connection hiccup in one specific function would cause a cascading failure on a distributed task being processed on over 30 servers.

Database connectivity is ridiculously stable these days, you can have hundreds of servers and go weeks without an issue.  Unless your databases are overloaded, and that’s when the bug struck.

Scaling Bugs Are Inconsistent

Sometimes the system has trouble, sometimes things are fine.  Even more perplexing is that they occur regardless of multi-threading or the statefulness of your code.

This makes scaling bugs difficult to find, since you’ll never be able to reproduce them locally.  They won’t appear for a single test execution, only when you have hundreds or thousands of events happening simultaneously.

Even if your code is single threaded and stateless, your system is multi-process and has state.  A serverless design still has scaling bottlenecks at the persistence layer.

Scaling Bugs Are Bursty

Bursty means that the bugs appear in clusters, usually in ever increasing numbers after ever shorter intervals.  Initially the error crops up once every few weeks and does minimal damage, so it gets documented as low priority and never worked on.  As your platform scales though, the error starts popping up 5 at a time every few days, then dozens of time once a day. Eventually the low priority, low impact bug becomes an extremely expensive support problem.

Scaling Bugs Hit Your Most Valuable Clients Hardest

Which are the clients with the most contacts in a CRM?  Which are the ones with the most emails? The most traffic and activity?

The same ones paying the most for the privilege of pushing your platform to the limit.

The impact of scaling bugs mostly fall on your most valuable clients, which makes their potential impact high in dollar terms.

Four ways to buy time

These tactics aren’t solutions, they are ways to buy time to transform your system to one that operates at scale.   I’ll cover some scaling tactics in a future post!

Throw money at the problem

There’s never a better time to throw money at a problem then the early stages of scaling problems!  More clients + larger clients = more dollars available.

Increase the number of servers, upgrade the databases, and increase your network throughput.  If you have a multi-tenant setup, add shards and decrease the number of customers running on the same hardware.

If throwing money at the problem helps, then you know you have scaling problems.  You can also get a rough estimate of the time-for-money runway. If the improved infrastructure doesn’t help you can downgrade everything and stop spending the extra money.

Keep your Error Rate Low

It’s common for the first time you notice a scaling bug to be when it causes a cascading system failure.  However, it’s rare for that to be the first time the bug manifested itself. Resolving those low priority rare bugs is key to keeping catastrophic scaling bugs at bay.

I once worked on a system that ran at over 1 million events per second (100 billion/day).  We had a saying: The nice thing about this system is that something that’s 1 in a million happens 60 times a minute.  The only known error we let stand: Servers would always fail to process the first event after a restart.

Retries

As load and scale increases, transient errors become more  common. Take a design cue from RESTful systems and add retry logic.  Most modern databases support upsert operations, which go a long way towards making it safe to retry inserts.

Asynchronous Processing

Most actions don’t need to be processed synchronously.  Switching to asynchronous processing makes many scaling bugs disappear for a while because the apparent processing greatly improves.  You still have to do the processing work, and the overall latency of your system may increase. Slowly and reliably processing everything successfully is greatly preferable to praying that everything processes quickly.

Congratulations!  You Have Scaling Problems!

Scaling bugs only hit systems that gets used.  Take solace in the fact that you have something people want to use.

The techniques in this article will help you buy enough time to work up a plan to scale your system.  Analyze your scaling pain points to gain insight into which parts of your system are most useful to your clients and prioritize your refactoring accordingly.

Remember that there are always ways to scale your current system without resorting to a total rewrite!

Your Database is not a Queue

SaaS Scaling anti-patterns: The database as a queue

Using a database as a queue is a natural and organic part of any growing system.  It’s an expedient use of the tools you have on hand. It’s also a subtle mistake that will consume hundreds of thousands of dollars in developer time and countless headaches for the rest of your business.  Let’s walk down the easy path into this mess, and how to carve a way out.

No matter what your business does on the backend, your client facing platform will be some kind of web front end, which means you have web servers and a database.  As your platform grows, you will have work that needs to be done, but doesn’t make sense in an api / ui format. Daily sales reports and end of day reconciliation, are common examples.

Simple Straight Through Processing

The initial developer probably didn’t realize he was building a queue.  The initial version would have been a single table called process which tracked client id, date and completed status.  Your report generator would load a list of active client ids, iterate through them, and write done to the database.

Still pretty simple

Simple, stateful and it works.

For a while.

But, some of your clients are bigger, and there are a lot of them, and the process was taking longer and longer, until it wasn’t finishing overnight.  So to gain concurrency and added worker processes your developers added “Not started” and “in process” states. Thanks to database concurrency guarantees and atomic updates, it only took a few releases to get everything working smoothly with the end-to-end processing time dropping back to something manageable.

Databases are atomic, what could go wrong?

Now the database is a queue and preventing duplicate work.

There’s a list of work, a bunch of workers, and with only a few more days of developer time you can even monitor progress as they chew through the queue.

Except your devs haven’t implemented retry logic because failures are rare.  If the process dies and doesn’t generate a report, then someone, usually support fielding an angry customer call, will notice and ask your developers to stop what they’re doing and restart the process.  No problem, adding code to move “in-process” back to “not started” after some amount of time is only a sprint worth of work.

Except, sometimes, for some reason, some tasks always fail.  So your developers add a counter for retries, and after 5 or so, they set the state to “skip” so that the bad jobs don’t keep sucking up system resources.

Just a few more sprints and we’ll finally have time to add all kinds of new processes!

Congratulations!  For about $100,000 in precious developer time, your SaaS product has a buggy, inefficient, poor scaling implementation of database-as-a-queue.  Probably best not to even try to quantify the opportunity costs.

Solutions like SQS and RabbitMQ are available, effectively free, and take an afternoon to set up.

Instead of worrying about how you got here, a better question is how do you stop throwing good developer resources away and migrate?

Every instance is different, but I find it is easiest to work backwards.

You already have worker code to generate reports.  Have your developers extend the code to accept a job from a queue like SQS in addition to the DB.  In the first iteration, the developers can manually add failed jobs to the queue. Likely you already have a manual retries process; migrate that to use the queue.

Queue is integrated within the existing system

Once you have the code working smoothly with a queue, you can start having the job generator write to the queue instead of the database.  Something magically usually happens at this point. You’ll be amazed at how many new types of jobs your developers will want to implement once new functionality no longer requires a database migration.

Soon, you’ll be able to run your system off the db or a queue, but the db tables will be empty.

Only then do you refactor the db queues out of your codebase.

Queue runs the system

Adding a proper queue system gets your team out of the hole and scratches your developers itch for shiny and new technology.  You get improved functionality after the very first sprint, and aren’t rewriting your code from scratch.

That’s your best alternative to a total rewrite, start today!