Pushing the Jobs Service across the Tenancy Line

A Jobs Service is a very common service for SaaS companies.  It provides a way to run work on a schedule, on demand, and independent of human activity.  Often, everything that isn’t done through the website is done by a Job Service.

I have never worked at a SaaS without some version of a Job Service, usually homegrown and built off a database instead of a queue.  They usually have descriptive and funny names – Task Processor, Crons, Crontabulous, Maestro, Batch Processor and of course Polite Batch Jobs.

Starting early in the SaaS’s life, they also evolve and grow with the SaaS, creating problems as they migrate from Single Tenant to a logically shared environment.

Single Tenant Job Service

In a single tenant model, provisioning a Job Service with a pool of workers is fairly straightforward.  Jobs are generated and put onto a queue (and not a database!

The Job Service takes jobs off of the queue and fans them out to the worker pool.  This is simple and works well because the Queue handles the complexities of tracking and retrying jobs.

Logically Separate Job Service

After an initial infrastructure consolidation the Job Service might look like this:

Multiple clients exist on a single database cluster, each with their own logically separate schema.  

The CRUD service has become a pool of servers that can act on behalf of any client.

There is still only 1 queue and 1 Job Service; Workers can act on behalf of any client, just like the CRUD servers.

Jobs get added haphazardly, and processed in a FIFO manner.

This model is much more resource efficient – sharing workers allows you to size the pool to keep things busy.

But this design is a disaster from a Noisy Neighbor standpoint.

Because the Queue is FIFO, the Job Service has no visibility into the client composition of the pending jobs, and a large client can easily starve a small one of resources by adding hundreds or thousands of jobs to the queue.  The large client will see progress as the jobs are processed, but nothing happens for the small client until the large job finishes.

Things get even worse if the Queue and Job Service are Global instead of Cell based.  A global queue feeding a global worker pool that works on clients spread across multiple database clusters will naturally cause database cluster hot spots.  Performance will degrade for everyone on the cluster while the workers do massive jobs for a few large clients.

You can add bandaids like limiting the number of jobs per client and moving excess work onto overflow queues.  This will help smaller clients somewhat, but natural hotspots will still occur.

Cross The Tenancy Line – Become Multi-Tenant

The Job Service needs to evolve from being Logically Separated into a Multi-Tenant service.

It needs to know how many jobs each client has pending, how long the jobs are taking, and how hot the database clusters are running so that it can operate a priority queue instead of FIFO.

The Jobs Service needs to move across the Tenancy Line

What is the Tenancy Line?

With Logically Separate infrastructure the clients share infrastructure, but the data and services all behave as if there is only one client at a time.  As a result each client can regulate its own behavior, but has no visibility into the infrastructure as a whole.

To stop acting like a Single Tenant service, the Jobs Service needs to cross the line into Multi-Tenancy.

This change is conceptually simple, but has a lot of subtle implications.

The Service can control load across clients

In the original model work loads are random based on when jobs are added to the queue.  When a hotspot emerges, there’s not much that the service can do without manual intervention.  When there’s a noisy neighbor you can’t do much to stop them from starving smaller clients because you don’t know where those clients are in the queue.

With a Multi-Tenant job service, you can control resources across cells and the entire platform.  Small clients can be protected by moving jobs up in priority based on how many recent jobs they have completed.

Jobs will finish faster as worker loads can be managed across cells, preventing hotspots.

Overall throughput will rise, smaller client performance will improve dramatically, and large clients will see more consistent execution times.

The Job Service Becomes a Queue

The original design used a single simple queue.  Every client adds jobs directly to the queue, and the Job Service’s responsibility is to take work, pass it to a worker, and mark the job as complete.  If there’s a failure, the queue will time the job out and put the work back on the queue.

A FIFO queue prioritizes by insertion order and doesn’t have any mechanism for reordering.  The Job Service will have to build prioritization logic and find a way to integrate into a queuing mechanism.  Do not give in to temptation and turn your database into a queue!

Conclusion

Pushing the Jobs Service across the Tenancy Line is a major coming of age step in the evolution of a SaaS company.

It trades significant development resources and complexity for consistent execution and a solution to the Noisy Neighbor Problem.  The SaaS benefits from the synergy this creates with better resource utilization and reduced database hotspotting.

Once a SaaS has enough clients to warrant the change, making the Jobs Processor Multi-Tenant is a major step forward.

2 thoughts on “Pushing the Jobs Service across the Tenancy Line

  1. We have evolved from using a single global database-based queue, to using one database queue per customer, to using “proper” queues with RDBMS as a state store. (These are however still a kind of dual write!)

    The kinds of crunches we see around noisy neighbors are like you describe, and they’re even worse because of the lack of viability in the backlog of a queue.

    I think there are two tactics // technologies that would help us get into a position where the backlog is visible (and so scheduling priorities can be changed before infra pressures make someone a noisy neighbor forcing us t a robot to react):

    * an excellent data model and access pattern that avoids all lock contention — perhaps append only tables that are regularly cleaned up — so you can add more workers without degrading queue query performance and still query the backlog for analysis and scheduling decisions. This tactic may end up looking a bit like we are turning the database inside out by implementing something like MVCC or WAL on the application side… too.
    * giving more appropriate inputs to the scheduler. On the business side, we might want to say that some customers have a higher limit on simultaneous jobs, but really the infrastructure’s performance for those jobs is the decider. https://bistro.io/ is one system that claims to do this.

Leave a Reply