Pushing the Jobs Service across the Tenancy Line

A Jobs Service is a very common service for SaaS companies.  It provides a way to run work on a schedule, on demand, and independent of human activity.  Often, everything that isn’t done through the website is done by a Job Service.

I have never worked at a SaaS without some version of a Job Service, usually homegrown and built off a database instead of a queue.  They usually have descriptive and funny names – Task Processor, Crons, Crontabulous, Maestro, Batch Processor and of course Polite Batch Jobs.

Starting early in the SaaS’s life, they also evolve and grow with the SaaS, creating problems as they migrate from Single Tenant to a logically shared environment.

Single Tenant Job Service

In a single tenant model, provisioning a Job Service with a pool of workers is fairly straightforward.  Jobs are generated and put onto a queue (and not a database!

The Job Service takes jobs off of the queue and fans them out to the worker pool.  This is simple and works well because the Queue handles the complexities of tracking and retrying jobs.

Logically Separate Job Service

After an initial infrastructure consolidation the Job Service might look like this:

Multiple clients exist on a single database cluster, each with their own logically separate schema.  

The CRUD service has become a pool of servers that can act on behalf of any client.

There is still only 1 queue and 1 Job Service; Workers can act on behalf of any client, just like the CRUD servers.

Jobs get added haphazardly, and processed in a FIFO manner.

This model is much more resource efficient – sharing workers allows you to size the pool to keep things busy.

But this design is a disaster from a Noisy Neighbor standpoint.

Because the Queue is FIFO, the Job Service has no visibility into the client composition of the pending jobs, and a large client can easily starve a small one of resources by adding hundreds or thousands of jobs to the queue.  The large client will see progress as the jobs are processed, but nothing happens for the small client until the large job finishes.

Things get even worse if the Queue and Job Service are Global instead of Cell based.  A global queue feeding a global worker pool that works on clients spread across multiple database clusters will naturally cause database cluster hot spots.  Performance will degrade for everyone on the cluster while the workers do massive jobs for a few large clients.

You can add bandaids like limiting the number of jobs per client and moving excess work onto overflow queues.  This will help smaller clients somewhat, but natural hotspots will still occur.

Cross The Tenancy Line – Become Multi-Tenant

The Job Service needs to evolve from being Logically Separated into a Multi-Tenant service.

It needs to know how many jobs each client has pending, how long the jobs are taking, and how hot the database clusters are running so that it can operate a priority queue instead of FIFO.

The Jobs Service needs to move across the Tenancy Line

What is the Tenancy Line?

With Logically Separate infrastructure the clients share infrastructure, but the data and services all behave as if there is only one client at a time.  As a result each client can regulate its own behavior, but has no visibility into the infrastructure as a whole.

To stop acting like a Single Tenant service, the Jobs Service needs to cross the line into Multi-Tenancy.

This change is conceptually simple, but has a lot of subtle implications.

The Service can control load across clients

In the original model work loads are random based on when jobs are added to the queue.  When a hotspot emerges, there’s not much that the service can do without manual intervention.  When there’s a noisy neighbor you can’t do much to stop them from starving smaller clients because you don’t know where those clients are in the queue.

With a Multi-Tenant job service, you can control resources across cells and the entire platform.  Small clients can be protected by moving jobs up in priority based on how many recent jobs they have completed.

Jobs will finish faster as worker loads can be managed across cells, preventing hotspots.

Overall throughput will rise, smaller client performance will improve dramatically, and large clients will see more consistent execution times.

The Job Service Becomes a Queue

The original design used a single simple queue.  Every client adds jobs directly to the queue, and the Job Service’s responsibility is to take work, pass it to a worker, and mark the job as complete.  If there’s a failure, the queue will time the job out and put the work back on the queue.

A FIFO queue prioritizes by insertion order and doesn’t have any mechanism for reordering.  The Job Service will have to build prioritization logic and find a way to integrate into a queuing mechanism.  Do not give in to temptation and turn your database into a queue!

Conclusion

Pushing the Jobs Service across the Tenancy Line is a major coming of age step in the evolution of a SaaS company.

It trades significant development resources and complexity for consistent execution and a solution to the Noisy Neighbor Problem.  The SaaS benefits from the synergy this creates with better resource utilization and reduced database hotspotting.

Once a SaaS has enough clients to warrant the change, making the Jobs Processor Multi-Tenant is a major step forward.

Infrastructure Consolidation Drives Early Tenancy Migrations

For SaaS with a pure Single Tenant model, infrastructure consolidation usually drives the first two, nearly simultaneous, steps towards a Multi-Tenant model.  The two steps convert the front end servers to be Multi-Tenant and switch the client databases from physical to logical isolation.  These two steps are usually done nearly simultaneously as a SaaS grows beyond a handful of clients, infrastructure costs skyrocket and things become unmanageable.

Diagram of a single tenant architecture become multi-tenant

Considering the 5 factors laid out in the introduction and addendumcomplexity, security, scalability, consistent performance, and synergy this move greatly increases scalability, at the cost of increased complexity, decreased security, and opening the door to consistent performance problems.  Synergy is not immediately impacted, but these changes make adding Synergy at a later date much easier.

Why is this such an early move when it has 3 negative factors and only 1 positive?  Because pure Single Tenant designs have nearly insurmountable scalability problems, and these two changes are the fastest, most obvious and most cost effective solution.

Complexity 

Shifting from Single Tenant servers and databases to Multi-Tenant slightly increases software complexity in exchange for massively decreasing platform complexity.

The web servers need to be able to understand which client a request is for, usually through sub domains like client.mySaaS.com, and use that knowledge to validate the user and connect to the correct database to retrieve data.

Increased complexity from consolidation

The difficult and risky part here is making sure that valid sessions stay associated with the correct account.  

Database server consolidation tends to be less tricky.  Most database servers support multiple schemas with their own credentials and logical isolation.  Logical separation provides unique connection settings for the web servers.  Individual client logins are restricted to the client’s schema and the SaaS developers do not need to treat logical and physical separation any differently.

Migrations and Versioning Become Expensive

The biggest database problems with a many-to-many design crop up during migrations.  Inevitably, web and database changes will be incomparable between versions.  Some SaaS models require all clients on the same version, which limits comparability issues to the release window (which itself can take days), while other models allow clients to be on different versions for years.

Versioning and Migration Diagram

The general solution to the problem of long lived versions is to stand up a pool of web and database servers on the new version, migrate clients to the new pool, and update request routing.

Security

The biggest risk around these changes is database secret handling; every server can now connect to every database.  Compromising a single server becomes a vector for exposing data from multiple clients.  This risk can be limited by proxy layers that keep database connections away from public facing web servers.  Still a compromised server is now a risk to multiple clients.

Changing from physical to logical database separation is less risky.  Each client will still be logically separated with their own schema, and permissioning should make it impossible to do queries across multiple clients.

Scalability

Scalability is the goal of Multi-Tenant Infrastructure Consolidation.

In addition to helping the SaaS, the consolidation will also help clients.  Shared server pools will increase stability and uptime by providing access to a much larger group of active servers.  The client also benefits from having more servers and more slack, making it much easier for the SaaS to absorb bursts in client activity.

Likewise, running multiple clients on larger database clusters generally increases uptime and provides slack for bursts and spikes.

These changes only impact response times when the single tenant setup would have been overwhelmed.  The minimum response times don’t change, but the maximum response times get lower and occur less frequently.

Consistent Performance

The flip side to the tenancy change is the introduction of the Noisy Neighbor problem.  This mostly impacts the database layer and occurs when large clients overwhelm the database servers and drown out resources for smaller clients.

This can be especially frustrating to clients because it can happen at any time, last for an unknown period, and there’s no warning or notification.  Things “get slow” and there are no guarantees about how often clients are impacted, notice, or complain.

Synergy

There is no direct Synergy impact from changing the web and database servers.

A SaaS starting from a pure Single Tenant model is not pursuing Synergy, otherwise the initial model would have been Multi-Tenant.

Placing distinct client schemas onto a single server does open the door to future Synergy work.  Working with data in SQL across different schemas on the same server is much easier than working across physical servers.  The work would still require changing the security model and writing quite a bit of code.  There is now a doorway if the SaaS has a reason to walk through.

Conclusion

As discussed in the introduction, a SaaS may begin with a purely Single Tenant model for several reasons.  High infrastructure bills and poor resource utilization will quickly drive an Infrastructure Consolidation to Multi-Tenant servers and logically separated databases.

The exceptions to this rule are SaaS that have few very large clients or clients with high security requirements.  These SaaS will have to price and market themselves accordingly.

Infrastructure Consolidation is an early driver away from a pure Single Tenant model to Multi-Tenancy. The change is mostly positive for clients, but does add additional security and client satisfaction risks.

If you are enjoying this series, please subscribe to my mailing list so that you don’t miss an installment!

Introduction to SaaS Tenancy models

Recently, I’ve spent a lot of time discussing the evolution of SaaS company Tenancy Models with my colleague Benjamin. These conversations have revealed that my thinking on the subject is vague and needs focus and sharpening through writing.

This is the first in a series of posts where I will dive deep on the technical aspect of tenancy models, the tradeoffs, which factors go into deciding on appropriate models, and how implementations evolve over time.

What are Tenancy Models?

There are 2 ideal models, single-tenant and multi-tenant, but most actual implementations are a hybrid mix.

In the computer realm, single-tenant systems are ones where the client is the only user of the servers, databases and other system tiers.  Software is installed on the system and it runs for one client.  Multi-tenant means that there are multiple clients on the servers and client data is mingled in the databases.

Pre-web software tended to be single-tenant because it ran on the client’s hardware.  As software migrated online and the SaaS model took off more complicated models became possible.  Moving from Offline to Online to the Cloud was mostly an exercise in who owned the hardware, and how difficult it was to get more.

When the software ran on the client’s hardware, at the client’s site, the hardware was basically unchangeable.  As things moved online, software became much easier to update, but hardware considerations were often made years in advance.  With cloud services, more hardware is just a click away allowing continuous evolution.

Main factors driving Technical Tenancy Decisions

The main factors driving tenancy decisions are complexity, security, scalability, and consistent performance.

Complexity

Keeping client data mingled on the servers without exposing anything to the wrong client tends to make multi-tenant software more complex than single-tenant.  The extra complexity translates to longer development cycles and higher developer costs.

Most SaaS software starts off with a single-tenant design by accident.  It isn’t a case of tech debt or cutting corners, Version 1 of the software needs to support a single client.  Supporting 10 clients with 10 instances is usually easier than developing 1 instance that supports 10 clients.  Being overwhelmed by interested clients is a good problem to have!  

Eventually the complexity cost of running massive numbers of single instances outweighs development savings, and the model begins evolving towards a multi-tenant model.

Security

The biggest driver of complexity is the second most pressing factor – security.  Ensuring that data doesn’t leak between clients is difficult.

A setup like this looks simple, but is extremely dangerous:

Forgetting to include client_id in any SQL Where clause will result in a data leak.

On the server side, it is also very easy to have a user log in, but lose track of which client an active session belongs to, and which data it can access.  This creates a whole collection of bugs around guessing and iterating contact ids.

Single-tenant systems don’t have these types of security problems.  No matter how badly a system is secured, each instance can only leak data for a single client.  Systems in industries with heavy penalties for leaking data, like Healthcare and Education tend to be more single-tenant.  Single tenant models make audits easier and reduce overall company risk.

Scalability

Scalability concerns come in after complexity and security because they fall into the “good problems to have” category.  Scaling problems are a sign of product market fit and paying customers.  Being able to go internet scale and process 1 million events a second is nice, but it is meaningless without customers.

Single-tenant systems scale poorly.  Each client needs separate servers, databases, caches, and other resources.  There are no economies or efficiencies of scale.  The smallest, least powered machines are generally way more powerful than any single client.  Worse, usage patterns mean that these resources will mostly eat money and sit idle.

Finally, all of those machines have to be maintained.  That’s not a big deal with 10 clients, or even 100.  With 100,000 clients, completely separate stacks would require teams of people to maintain.

Multi-tenant models scale much better because the clients share resources.  Cloud services make it easy to add another server to a pool, and large pools make the impact of adding clients negligible.  Adding database nodes is more difficult, but the principle holds – serving dozens to hundreds of clients on a single database allows the SaaS to minimize wasted resources and keeps teams smaller.

Consistent Performance

Consistent Performance, also known as the Noisy Neighbor Problem, comes up as a negative side effect of multi-tenant systems.

Perfectly even load distribution is impossible.  At any given moment, some clients will have greater needs than others.  Whichever servers and databases those clients are on will run hotter than others.  Other clients will experience worse performance than normal because there are fewer resources available on the server.

Bursty and compute intensive SaaS will feel these problems more than SaaS with a regular cadence.  For example a URL shortening service will have a long tail of links that rarely, if ever, get hit.  Some links will suddenly go viral and suck up massive amounts of resources.  On the other extreme – a company that does End Of Day processing for retail stores knows when the data processing starts, and the amount of sales in any one store is limited by the number of registers.

Single tenant systems don’t have these problems because there are no neighbors sucking up resources.  But, due to their higher operating costs, they also don’t have as much extra resources available to handle bursts.

Consistent performance is rarely a driver in initial single vs multi-tenant design because the problems appear as a side effect of scale.  By the time the issue comes up, the design has been in production for years.  Instead, consistent performance becomes a major factor as designs evolve.  

Initial forays into multi-tenant design are especially vulnerable to these problems.  Multi-tenant worker pools fed from single-tenant client repositories are ripe for bursty and long running process problems.

Fully multi-tenant systems, with large resource pools, have more resilience.  Additionally, processing layers have access to all of the data needed to orchestrate and balance between clients.

Conclusion

In this post I covered the two tenancy models, touched on why most SaaS companies start off with single-tenant models, and the major factors impacting and influencing tenancy design.

Single tenant systems tend to be simpler to develop and more secure, but are more expensive to run on a per client basis and don’t scale well.  Multi tenant systems are harder to develop and secure, but have economic and performance advantages as they scale.  As a result, SaaS companies usually start with single tenant designs and iterate towards multi-tenancy.  

Next up, I will cover the gray dividing line between single and multi-tenant data within a SaaS, The Tenancy Line.