The Opposite of Iterative Delivery

Iterative Delivery is a uniquely powerful method for adding value to a SaaS.  Other than iterative, there are no respectably named ways to deliver features, reskins, updates and bug fixes.  Big bangs, waterfalls, and Quarterly releases fill your customer’s hearts with dread.

Look at 5 antonyms for Iterative Delivery:

  • Erratic Delivery
  • Infrequent Delivery
  • Irregular Delivery
  • Overwhelming Delivery
  • Sporadic Delivery

If your customers used these terms to describe updates to your SaaS, would they be wrong?

Iterative Delivery is about delivering small pieces of value to your customers so often that they know you’re improving the Service, but so small that they barely notice the changes.

Don’t be overwhelming, erratic or infrequent - be iterative and delight your customers.

3 Signs Your Resource Allocation Model Is Working Against You

3 Signs Your Resource Allocation Model Is Working Against You

After 6 posts on SaaS Tenancy Models, I want to bring it back to some concrete examples.  When your SaaS has a Single Tenant model, clients expect to allocate all the resources they need, whenever they want.  When every client is entitled to the entire resource pool, no client gets a great customer experience.

Here are 3 signs your Resource Allocation Model is working against you:

  1. Large clients cause small client’s work to stall
  2. You have to rebalance the mix of clients in a cell for stability
  3. Run your job at night for best performance

Large clients cause small client’s work to stall

This is a classic “noisy neighbor” problem.  Each client tries to claim all the shared resources needed to do their work.  This isn’t much of a problem when none of the clients need a significant percentage of the pool.  When a large client comes along, it drains the pool, and leaves your small clients flopping like fish out of water.

You have to rebalance the mix of clients in a cell for stability

When having multiple large clients in a cell affects stability, the short term solution is to migrate some clients to another cell.  Large clients can impact performance, but they should not be able to impact stability.  Moving clients around buys you time, but it also forces you to focus on smaller, less profitable clients.

Run your job at night for best performance

This is advice that often pops up on SaaS message boards.  Don’t try to run your job during the day, schedule it to run in the evening so it is ready for the morning.  When clients start posting workarounds to your problems, it’s a clear sign of frustration.  Your clients are noticing that performance varies by the time of day.  They are building mental models of your platform and deciding you have load and scale issues.  By being helpful to each other, your clients are advertising your problems.

Conclusion

These 3 issues have the same root cause; your SaaS’s operational data is mixed in with client data.  If you have any of these three problems, the time has come to separate your data from the clients’.

Fixing these problems won’t be easy or quick!  

The good news is that you can separate the data and change your resource allocation model in an iterative fashion.  Start by pushing your job service across the tenancy line.

Get value and regain control one incremental step at a time, and never do a rewrite!

Cell Based Single Tenancy

This is part 4 in a series on SaaS Tenancy Models.  Parts 1 , 2 , and 3.

SaaS companies are often approached by potential clients who want their instance to be completely separate from any other client.  Sometimes the request is driven by legal requirements (primarily healthcare and defense), sometimes it is a desire for enhanced security.

Often, running a Multi-Tenant service with a single client will satisfy the client’s needs.  Clients are often willing to pay for the privilege of their account run Single Tenant, making it a potentially lucrative option for a SaaS.

What is a Cell?

A Cell is an independent instance of a SaaS’ software setup.  This is different from having software running in multiple datacenters or even multiple continents.  If the services talk to each other, they are in the same cell regardless of physical location.

Cells can differ with the number and power of servers and databases.  Cells can even have entirely different caching options depending on need.

The 3 most common Cell setups are Production, Staging (or Test), and Local.

Cell Properties

Cell architecture comes with a few distinct properties:

  • Cell structures allow SaaS to grow internationally and offer clients low latency and localized data policies (think GDPR).  Latency from the US to Europe, Asia and South America is noticeable and degrades the client experience.
  • Clients exist in 1 cell at a time.  They can migrate, but they can’t exist in multiple cells.
  • Generally speaking, Cells can not be part of a disaster recovery plan.  Switching clients between Cells usually involves copying the database, and can’t be done if the client’s original Cell is down.

Cell Isolation as a Single Tenant Option

In part 3 I covered the difficulties in operating in a true Single Tenant model at scale.  A Cell with a single client effectively recreates the Single Tenancy experience.

Few clients want this level of isolation, but those that need it are prepared to pay for the extra infrastructure costs of an additional Cell.

Conclusion

For SaaS without global services, a Cell model enables a mix of clients on logically separated Multi-Tenant infrastructure and clients with effectively Single Tenant infrastructure.  This allows the company to pursue clients with Single Tenant needs, and the higher price point they offer.

The catch is that Single Tenant Cells can’t exist in an architecture with global services.  If there is a single service that must have access to all client data, Single Tenant Cells are out.


If you are enjoying this series, consider subscribing to my mailing list (https://shermanonsoftware.com/subscribe/) so that you don’t miss an installment!

Infrastructure Consolidation Drives Early Tenancy Migrations

For SaaS with a pure Single Tenant model, infrastructure consolidation usually drives the first two, nearly simultaneous, steps towards a Multi-Tenant model.  The two steps convert the front end servers to be Multi-Tenant and switch the client databases from physical to logical isolation.  These two steps are usually done nearly simultaneously as a SaaS grows beyond a handful of clients, infrastructure costs skyrocket and things become unmanageable.

Diagram of a single tenant architecture become multi-tenant

Considering the 5 factors laid out in the introduction and addendum - complexity, security, scalability, consistent performance, and synergy this move greatly increases scalability, at the cost of increased complexity, decreased security, and opening the door to consistent performance problems.  Synergy is not immediately impacted, but these changes make adding Synergy at a later date much easier.

Why is this such an early move when it has 3 negative factors and only 1 positive?  Because pure Single Tenant designs have nearly insurmountable scalability problems, and these two changes are the fastest, most obvious and most cost effective solution.

Complexity 

Shifting from Single Tenant servers and databases to Multi-Tenant slightly increases software complexity in exchange for massively decreasing platform complexity.

The web servers need to be able to understand which client a request is for, usually through sub domains like client.mySaaS.com, and use that knowledge to validate the user and connect to the correct database to retrieve data.

Increased complexity from consolidation

The difficult and risky part here is making sure that valid sessions stay associated with the correct account.  

Database server consolidation tends to be less tricky.  Most database servers support multiple schemas with their own credentials and logical isolation.  Logical separation provides unique connection settings for the web servers.  Individual client logins are restricted to the client’s schema and the SaaS developers do not need to treat logical and physical separation any differently.

Migrations and Versioning Become Expensive

The biggest database problems with a many-to-many design crop up during migrations.  Inevitably, web and database changes will be incomparable between versions.  Some SaaS models require all clients on the same version, which limits comparability issues to the release window (which itself can take days), while other models allow clients to be on different versions for years.

Versioning and Migration Diagram

The general solution to the problem of long lived versions is to stand up a pool of web and database servers on the new version, migrate clients to the new pool, and update request routing.

Security

The biggest risk around these changes is database secret handling; every server can now connect to every database.  Compromising a single server becomes a vector for exposing data from multiple clients.  This risk can be limited by proxy layers that keep database connections away from public facing web servers.  Still a compromised server is now a risk to multiple clients.

Changing from physical to logical database separation is less risky.  Each client will still be logically separated with their own schema, and permissioning should make it impossible to do queries across multiple clients.

Scalability

Scalability is the goal of Multi-Tenant Infrastructure Consolidation.

In addition to helping the SaaS, the consolidation will also help clients.  Shared server pools will increase stability and uptime by providing access to a much larger group of active servers.  The client also benefits from having more servers and more slack, making it much easier for the SaaS to absorb bursts in client activity.

Likewise, running multiple clients on larger database clusters generally increases uptime and provides slack for bursts and spikes.

These changes only impact response times when the single tenant setup would have been overwhelmed.  The minimum response times don’t change, but the maximum response times get lower and occur less frequently.

Consistent Performance

The flip side to the tenancy change is the introduction of the Noisy Neighbor problem.  This mostly impacts the database layer and occurs when large clients overwhelm the database servers and drown out resources for smaller clients.

This can be especially frustrating to clients because it can happen at any time, last for an unknown period, and there’s no warning or notification.  Things “get slow” and there are no guarantees about how often clients are impacted, notice, or complain.

Synergy

There is no direct Synergy impact from changing the web and database servers.

A SaaS starting from a pure Single Tenant model is not pursuing Synergy, otherwise the initial model would have been Multi-Tenant.

Placing distinct client schemas onto a single server does open the door to future Synergy work.  Working with data in SQL across different schemas on the same server is much easier than working across physical servers.  The work would still require changing the security model and writing quite a bit of code.  There is now a doorway if the SaaS has a reason to walk through.

Conclusion

As discussed in the introduction, a SaaS may begin with a purely Single Tenant model for several reasons.  High infrastructure bills and poor resource utilization will quickly drive an Infrastructure Consolidation to Multi-Tenant servers and logically separated databases.

The exceptions to this rule are SaaS that have few very large clients or clients with high security requirements.  These SaaS will have to price and market themselves accordingly.

Infrastructure Consolidation is an early driver away from a pure Single Tenant model to Multi-Tenancy. The change is mostly positive for clients, but does add additional security and client satisfaction risks.

If you are enjoying this series, please subscribe to my mailing list so that you don’t miss an installment!

Tenancy Models – Intro Addendum

In the first post on Saas Tenancy Models, I introduced the two idealized models - Single and Multi-Tenant.  Many SaaS companies start off as Single Tenant by default, rather than strategy, and migrate towards increasingly multi-tenant models under the influence of 4 main factors - complexity, security, scalability, and consistent performance.

After publishing, I realized that I left out an important fifth factor, synergy.

Synergy

In the context of this series, synergy is the increased value to the client as a result of mixing the client’s data with other clients.  A SaaS may even become a platform if the synergies become more valuable to the clients than the original service.  

Another aspect of synergy is that the clients only gain the extra value so long as they remain customers of the SaaS.  When clients churn, the SaaS usually retains the extra value, even after deleting the client’s data.  This organically strengthens client lock in and increases the SaaS value over time.  The existing data set becomes ever more valuable, making it increasingly difficult for clients to leave.

Some types of businesses, like retargeting ad buyers, create a lot of value for their clients by mixing client data.  Ad buyers increase effectiveness of their ad purchases by building larger consumer profiles.  This makes the ad purchases more effective for all clients.

On the other hand, a traditional CRM, or a codeless service like Zapier, would be very hard pressed to increase client value by mixing client data.  Having the same physical person in multiple client instances in a CRM doesn’t open a lot of avenues; what could you offer - track which clients a contact responds to?  No code services may mix client data as part of bulk operations, but that doesn’t add value to the clients.

Sometimes there might be potential synergy, like in Healthcare and Education, but it would be unethical and illegal to mix the data.

Not All Factors Are Client Facing

Two of the factors, complexity and scalability, are generally invisible to clients.  When complexity and scalability are noticed, it is negative:

  • Why do new features take so long to develop?  
  • Why are bugs so difficult to resolve?  
  • Why does the client experience get worse as usage grows?

A SaaS never wants a client asking these questions.

Security, Consistent Performance and Synergy are discussion points with clients.

Many SaaS companies can adjust Security concerns and Consistent Performance through configuration isolation.

Synergy is a highly marketable service differentiator and generally not negotiable.

Simplified Drawings

As much as possible I’m going to treat and draw things as 2-tier systems rather than N-tier.  As long as the principles are similar, I’ll default to simplified 2-tier diagrams over N-tier or microservice diagrams.

Next Time

Coming up I’ll be breaking down single to multi-tenant transformations.

Why a SaaS would want the transformation, what are the tradeoffs, and what are the potential pitfalls.

Please subscribe to my mailing list to make sure you don’t miss out!

Introduction to SaaS Tenancy models

Recently, I’ve spent a lot of time discussing the evolution of SaaS company Tenancy Models with my colleague Benjamin. These conversations have revealed that my thinking on the subject is vague and needs focus and sharpening through writing.

This is the first in a series of posts where I will dive deep on the technical aspect of tenancy models, the tradeoffs, which factors go into deciding on appropriate models, and how implementations evolve over time.

What are Tenancy Models?

There are 2 ideal models, single-tenant and multi-tenant, but most actual implementations are a hybrid mix.

In the computer realm, single-tenant systems are ones where the client is the only user of the servers, databases and other system tiers.  Software is installed on the system and it runs for one client.  Multi-tenant means that there are multiple clients on the servers and client data is mingled in the databases.

Pre-web software tended to be single-tenant because it ran on the client’s hardware.  As software migrated online and the SaaS model took off more complicated models became possible.  Moving from Offline to Online to the Cloud was mostly an exercise in who owned the hardware, and how difficult it was to get more.

When the software ran on the client’s hardware, at the client’s site, the hardware was basically unchangeable.  As things moved online, software became much easier to update, but hardware considerations were often made years in advance.  With cloud services, more hardware is just a click away allowing continuous evolution.

Main factors driving Technical Tenancy Decisions

The main factors driving tenancy decisions are complexity, security, scalability, and consistent performance.

Complexity

Keeping client data mingled on the servers without exposing anything to the wrong client tends to make multi-tenant software more complex than single-tenant.  The extra complexity translates to longer development cycles and higher developer costs.

Most SaaS software starts off with a single-tenant design by accident.  It isn’t a case of tech debt or cutting corners, Version 1 of the software needs to support a single client.  Supporting 10 clients with 10 instances is usually easier than developing 1 instance that supports 10 clients.  Being overwhelmed by interested clients is a good problem to have!  

Eventually the complexity cost of running massive numbers of single instances outweighs development savings, and the model begins evolving towards a multi-tenant model.

Security

The biggest driver of complexity is the second most pressing factor - security.  Ensuring that data doesn’t leak between clients is difficult.

A setup like this looks simple, but is extremely dangerous:

Forgetting to include client_id in any SQL Where clause will result in a data leak.

On the server side, it is also very easy to have a user log in, but lose track of which client an active session belongs to, and which data it can access.  This creates a whole collection of bugs around guessing and iterating contact ids.

Single-tenant systems don’t have these types of security problems.  No matter how badly a system is secured, each instance can only leak data for a single client.  Systems in industries with heavy penalties for leaking data, like Healthcare and Education tend to be more single-tenant.  Single tenant models make audits easier and reduce overall company risk.

Scalability

Scalability concerns come in after complexity and security because they fall into the “good problems to have” category.  Scaling problems are a sign of product market fit and paying customers.  Being able to go internet scale and process 1 million events a second is nice, but it is meaningless without customers.

Single-tenant systems scale poorly.  Each client needs separate servers, databases, caches, and other resources.  There are no economies or efficiencies of scale.  The smallest, least powered machines are generally way more powerful than any single client.  Worse, usage patterns mean that these resources will mostly eat money and sit idle.

Finally, all of those machines have to be maintained.  That’s not a big deal with 10 clients, or even 100.  With 100,000 clients, completely separate stacks would require teams of people to maintain.

Multi-tenant models scale much better because the clients share resources.  Cloud services make it easy to add another server to a pool, and large pools make the impact of adding clients negligible.  Adding database nodes is more difficult, but the principle holds - serving dozens to hundreds of clients on a single database allows the SaaS to minimize wasted resources and keeps teams smaller.

Consistent Performance

Consistent Performance, also known as the Noisy Neighbor Problem, comes up as a negative side effect of multi-tenant systems.

Perfectly even load distribution is impossible.  At any given moment, some clients will have greater needs than others.  Whichever servers and databases those clients are on will run hotter than others.  Other clients will experience worse performance than normal because there are fewer resources available on the server.

Bursty and compute intensive SaaS will feel these problems more than SaaS with a regular cadence.  For example a URL shortening service will have a long tail of links that rarely, if ever, get hit.  Some links will suddenly go viral and suck up massive amounts of resources.  On the other extreme - a company that does End Of Day processing for retail stores knows when the data processing starts, and the amount of sales in any one store is limited by the number of registers.

Single tenant systems don’t have these problems because there are no neighbors sucking up resources.  But, due to their higher operating costs, they also don’t have as much extra resources available to handle bursts.

Consistent performance is rarely a driver in initial single vs multi-tenant design because the problems appear as a side effect of scale.  By the time the issue comes up, the design has been in production for years.  Instead, consistent performance becomes a major factor as designs evolve.  

Initial forays into multi-tenant design are especially vulnerable to these problems.  Multi-tenant worker pools fed from single-tenant client repositories are ripe for bursty and long running process problems.

Fully multi-tenant systems, with large resource pools, have more resilience.  Additionally, processing layers have access to all of the data needed to orchestrate and balance between clients.

Conclusion

In this post I covered the two tenancy models, touched on why most SaaS companies start off with single-tenant models, and the major factors impacting and influencing tenancy design.

Single tenant systems tend to be simpler to develop and more secure, but are more expensive to run on a per client basis and don’t scale well.  Multi tenant systems are harder to develop and secure, but have economic and performance advantages as they scale.  As a result, SaaS companies usually start with single tenant designs and iterate towards multi-tenancy.  

Next up, I will cover the gray dividing line between single and multi-tenant data within a SaaS, The Tenancy Line.

Dead Database Pixel Tracking

Pixel Tracking is a common Marketing SaaS activity used to track page loads.  Today I am going to try and tie several earlier posts together and show how to evolve a frustrating Pixel Tracking architecture into one that can survive database outages.

Pixel Tracking events are synchronously written to the database.  A job processor uses the database as a queue to find updates, and farms out processing tasks.

Designed to Punish Users

This design is governed by database performance.  As the load ramps up, users are going to notice lagging page loads.  Worse, each event recorded will have to be processed, tripling the database load.

Designed to Scale

You can relieve the pressure on the user by making your Pixel Tracking asynchronous.  Moving away from using your database as a queue is more complicated, but critical for scaling.  Finally, using Topics makes it easy to expand the types of processing tasks your platform supports.

Users are now completely insulated from scale and processing issues.

Dead Database Design

There is no database in the final design because it is no longer relevant to the users’ interactions with your services.  The performance is the same whether your database is at 0% or 100% load.  

The performance is the same if your database falls over and you have to switch to a hot standby or even restore from a backup.

With a bit of effort your SaaS could have a database fall over on the run up to Black Friday and recover without data loss or clients noticing. If you are using SNS/SQS on AWS the queue defaults are over 100,000 events! It may take a while to chew through the queues, but the data won't disappear.

When your Pixel Tracking is causing your users headaches, going asynchronous is your Best Alternative to a Total Rewrite.

Making Link Tracking Scale – Part 2 Edge Caching

In Part 1 of Making Link Tracking Scale I showed how switching event recording from synchronous to asynchronous processing creates a superior, faster and more consistent user experience.  In Part 2, I will discuss how Link Tracking scaling issues are governed by Long Tails, and how to overcome the initial burst using edge caching and tools like Memcache and Redis.

The Long Tails of Link Tracking

When your client sends an email campaign, publishes new content your link tracker will experience a giant burst of activity, which will quickly decay like this:

To illustrate with some numbers, imagine an email blast that results in 100,000 link tracking events.  80% of those will occur in the first hour.  

In our original design from Part 1, that would 22 URL lookups, and 22 inserts per second:

For simplicity, pretend that inserts and selects produce similar db load.  Your system would need to support 44 events/s to avoid slowdowns and frustrating your clients.

The asynchronous model:

Reduces the load to 22 URL lookups, and a controllable number of inserts.  Again for simplicity let’s go with 8 inserts/s, for a total of 30 events/s.  That’s a 1/3 reduction in load!

But, your system is still looking up the Original URL 22 times/s.  That’s a lot of unnecessary db load.

Edge Caching The Original URL

The Original URL is static data that can be cached on the web server instead of loaded from the database for each event.  Instead, each server would retrieve the Original URL from the db once, store it in memory, and reuse it as needed.

This effectively drops the lookup rate from 22 events/s to 0 events/s, reducing the db load to 8 events/s, a 55% drop!  Combined with the asynchronous processing improvements from Part 1, that’s an 80% reduction in max database load.

Edge Caching on the servers works for a while, but as your clients expand the number of URLs you’ll need to keep track of won’t fit in server memory.  At that point you’ll need to add in tools like Memcached or Redis.  Like web servers, these tools are a lot cheaper than scaling your database.

Consistent Load on the Database

The great thing about this design is that you can keep the db load consistent, regardless of the incoming traffic.  Whether the load is 44 events/s or 100 events/s you control the rate of asynchronous processing. So long as you have room on your servers for an internal queue, or if you use an external queue like RabbitMQ or SQS you can delay processing the events.

Scaling questions become discussions about cost and how quickly your clients need to see results.

Conclusion

Caching static data is a great way to reduce database load.  You can use prebuilt libraries like Guava for Java, cacheout for Python, or dozens of others.  You can also leverage distributed cache systems like Memcached and Redis. While there’s no such thing as a free lunch, web servers and distributed caches are much much cheaper to scale than databases. 

You’ll save money and deliver a superior experience to your clients and their users!

Your Database is not a queue – A Live Example

A while ago I wrote an article, Your Database is not a Queue, where I talked about this common SaaS scaling anti-pattern. At the time I said:

Using a database as a queue is a natural and organic part of any growing system.  It’s an expedient use of the tools you have on hand. It’s also a subtle mistake that will consume hundreds of thousands of dollars in developer time and countless headaches for the rest of your business.  Let’s walk down the easy path into this mess, and how to carve a way out.

Today I have a live example of a SaaS company, layerci.com, proudly embracing the anti-pattern. In this article I will compare my descriptions with theirs, and point out expensive and time consuming problems they will face down the road.

None of this is to hate on layerci.com. An expedient solution that gets your product to market is worth infinitely more than a philosophically correct solution that delays giving value to your clients. My goal is to understand how SaaS companies get themselves into this situation, and offer paths our of the hole.

What's the same

In my article I described a system evolving out of reporting, layerci's problem:

We hit it quickly at LayerCI - we needed to keep the viewers of a test run's page and the github API notified about a run as it progressed.

I described an accidental queue, while layerci is building one explicitly:

CREATE TYPE ci_job_status AS ENUM ('new', 'initializing', 'initialized', 'running', 'success', 'error');

CREATE TABLE ci_jobs(
	id SERIAL, 
	repository varchar(256), 
	status ci_job_status, 
	status_change_time timestamp
);

/*on API call*/
INSERT INTO ci_job_status(repository, status, status_change_time) VALUES ('https://github.com/colinchartier/layerci-color-test', 'new', NOW());

I suggested that after you have an explicit, atomic, queue your next scaling problem is with failures. Layerci punts on this point:

As a database, Postgres has very good persistence guarantees - It's easy to query "dead" jobs with, e.g., SELECT * FROM ci_jobs WHERE status='initializing' AND NOW() - status_change_time > '1 hour'::interval to handle workers crashing or hanging.

What's different

There are a couple of differences between the two scenarios. They aren't material towards my point so I'll give them a quick summary:

  • My system imagines multiple job types, layerci is sticking to a single process type
  • layerci is doing some slick leveraging of PostgreSQL to alleviate the need for a Process Manager. This greatly reduces the amount of work needed to make the system work.

What's the problem?

The main problem with layerci's solution is the amount of developer time spent designing the solution. As a small startup, the time and effort invested in their home grown solution would almost certainly have been better spent developing new features or talking with clients.

It's the failures

From a technical perspective, the biggest problem is lack of failure handling. layerci punts on retries:

As a database, Postgres has very good persistence guarantees - It's easy to query "dead" jobs with, e.g., SELECT * FROM ci_jobs WHERE status='initializing' AND NOW() - status_change_time > '1 hour'::interval to handle workers crashing or hanging.

Handling failures is a lot of work, and something you get for free as part of a queue.

Without retries and poison queue handling, these failures will immediately impact layerci's clients and require manual human intervention. You can add failure support, but that's throwing good developer time after bad. Queues give you great support out of the box.

Monitoring should not be an afterthought

In addition to not handling failure, layerci's solution doesn't handle monitoring either:

Since jobs are defined in SQL, it's easy to generate graphql and protobuf representations of them (i.e., to provide APIs that checks the run status.)

This means that initially you'll be running blind on a solution with no retries. This is the "Our customers will tell us when there's a problem" school of monitoring. That's betting your client relationships on perfect software with no hiccups. I don't like those odds.

SCALING Databases is expensive

The design uses a single, ever growing jobs table ci_jobs, which will store a row for every job forever. The article points out postgreSQL's amazing ability to scale, which could keep you ahead of the curve forever. Database scaling is the most expensive piece in any cloud application stack.

Why pay to scale databases to support quick inserts, updates and triggers on a million row table? The database is your permanent record, a queue is ephemeral.

Conclusion

No judgement if you build a queue into your database to get your product to market. layerci has a clever solution, but it is incomplete, and by the time you get it to work at scale in production you will have squandered tons of developer resources to get a system that is more expensive to run than out of the box solutions.

Do you have a queue in your database? Read my original article for suggestions on how to get out of the hole without doing a total rewrite.

Is your situation unique? I'd love to hear more about it!

Your Database is not a Queue

SaaS Scaling anti-patterns: The database as a queue

Using a database as a queue is a natural and organic part of any growing system.  It’s an expedient use of the tools you have on hand. It’s also a subtle mistake that will consume hundreds of thousands of dollars in developer time and countless headaches for the rest of your business.  Let’s walk down the easy path into this mess, and how to carve a way out.

No matter what your business does on the backend, your client facing platform will be some kind of web front end, which means you have web servers and a database.  As your platform grows, you will have work that needs to be done, but doesn’t make sense in an api / ui format. Daily sales reports and end of day reconciliation, are common examples.

Simple Straight Through Processing

The initial developer probably didn’t realize he was building a queue.  The initial version would have been a single table called process which tracked client id, date and completed status.  Your report generator would load a list of active client ids, iterate through them, and write done to the database.

Still pretty simple

Simple, stateful and it works.

For a while.

But, some of your clients are bigger, and there are a lot of them, and the process was taking longer and longer, until it wasn’t finishing overnight.  So to gain concurrency and added worker processes your developers added “Not started” and “in process” states. Thanks to database concurrency guarantees and atomic updates, it only took a few releases to get everything working smoothly with the end-to-end processing time dropping back to something manageable.

Databases are atomic, what could go wrong?

Now the database is a queue and preventing duplicate work.

There’s a list of work, a bunch of workers, and with only a few more days of developer time you can even monitor progress as they chew through the queue.

Except your devs haven’t implemented retry logic because failures are rare.  If the process dies and doesn’t generate a report, then someone, usually support fielding an angry customer call, will notice and ask your developers to stop what they’re doing and restart the process.  No problem, adding code to move “in-process” back to “not started” after some amount of time is only a sprint worth of work.

Except, sometimes, for some reason, some tasks always fail.  So your developers add a counter for retries, and after 5 or so, they set the state to “skip” so that the bad jobs don’t keep sucking up system resources.

Just a few more sprints and we'll finally have time to add all kinds of new processes!

Congratulations!  For about $100,000 in precious developer time, your SaaS product has a buggy, inefficient, poor scaling implementation of database-as-a-queue.  Probably best not to even try to quantify the opportunity costs.

Solutions like SQS and RabbitMQ are available, effectively free, and take an afternoon to set up.

Instead of worrying about how you got here, a better question is how do you stop throwing good developer resources away and migrate?

Every instance is different, but I find it is easiest to work backwards.

You already have worker code to generate reports.  Have your developers extend the code to accept a job from a queue like SQS in addition to the DB.  In the first iteration, the developers can manually add failed jobs to the queue. Likely you already have a manual retries process; migrate that to use the queue.

Queue is integrated within the existing system

Once you have the code working smoothly with a queue, you can start having the job generator write to the queue instead of the database.  Something magically usually happens at this point. You’ll be amazed at how many new types of jobs your developers will want to implement once new functionality no longer requires a database migration.

Soon, you’ll be able to run your system off the db or a queue, but the db tables will be empty.

Only then do you refactor the db queues out of your codebase.

Queue runs the system

Adding a proper queue system gets your team out of the hole and scratches your developers itch for shiny and new technology.  You get improved functionality after the very first sprint, and aren’t rewriting your code from scratch.

That’s your best alternative to a total rewrite, start today!

Site Footer