The Chestburster Antipattern

The Chestburster is an antipattern that occurs when transitioning from a monolith to services. 

The team sees an opportunity to exact a small piece of functionality from the monolith into a new service, but the monolith is the only place that handles security, permissions and composition.

Because the new service can’t face clients directly, the Chestburster hides behind the monolith, hoping to burst through at some later point.

The Chestburster begins as the inverse of the Strangler pattern, with the monolith delegating to the service instead of the new service delegates to the monolith.  

Why it’s appealing

The Chestburster’s appeal is that it gets the New Service up and running quickly.  This looks like progress!  The legacy code is extracted, possibly rewritten, and maybe better.

Why it fails

There is no business case for building the functionality the new service needs to burst through the monolith.  The functionality has been rewritten.  It’s been rewritten into a new service.  How do you go back now and ask for time to address security and the other missing pieces?  Worse, the missing pieces are usually outside of the team’s control; security is one area you want to leave to the experts.

Even if you get past all the problems on your side, you’ve created new composition complexities for the client.  Now the client has to create a new connection to the Chestburster and handle routing themselves.  Can you make your clients update?  Should you?

Remember The Strangler

If you want to break apart a monolith, it’s always a good idea to start with a Strangler. If you can’t set up a strangle on your existing monolith, you aren’t ready to start breaking it apart.

That doesn’t mean you’re stuck with the current functionality!

If you have the time and resources to extract the code into a new service, you have the time and resources to decouple the code inside of the monolith.  When the time comes to decompose into services, you’ll be ready.

Conclusion

The chestburster gives the illusion of quick progress; but quickly stalls as the team runs into problems they can’t control.  Overcoming the technical hurdles doesn’t guarantee that clients will ever update their integration.

Success in legacy system replacement comes by integrating first, and moving functionality second.  With the chestburster you move functionality first and probably never burst through.

Tenancy Model Roundup

Over the past few months I have been ruminating on SaaS Tenancy Models and how they drive architectural decisions.  I hope you’ve enjoyed the series as I’ve scratched my itch.

Here is a roundup of the 7 articles In case you missed any of the parts, or need a handy index to what I’m sure is the most in depth discussion of SaaS Tenancy Models ever written.

Part 1 – An introduction to SaaS Tenancy Models

Part 2 – An addendum to the introduction

Part 3 – How growth and scale drive tenancy model changes

Part 4 – Regaining Effective Single Tenancy through Cell Isolation

Part 5 – Why your job service should be Multi-Tenant even if your model is Single Tenant

Part 6 – Whose data is it anyway, why you need to separate your SaaS’s data from your clients

Part 7 – 3 Signs your resource allocation model is working against you

3 Signs Your Resource Allocation Model Is Working Against You

3 Signs Your Resource Allocation Model Is Working Against You

After 6 posts on SaaS Tenancy Models, I want to bring it back to some concrete examples.  When your SaaS has a Single Tenant model, clients expect to allocate all the resources they need, whenever they want.  When every client is entitled to the entire resource pool, no client gets a great customer experience.

Here are 3 signs your Resource Allocation Model is working against you:

  1. Large clients cause small client’s work to stall
  2. You have to rebalance the mix of clients in a cell for stability
  3. Run your job at night for best performance

Large clients cause small client’s work to stall

This is a classic “noisy neighbor” problem.  Each client tries to claim all the shared resources needed to do their work.  This isn’t much of a problem when none of the clients need a significant percentage of the pool.  When a large client comes along, it drains the pool, and leaves your small clients flopping like fish out of water.

You have to rebalance the mix of clients in a cell for stability

When having multiple large clients in a cell affects stability, the short term solution is to migrate some clients to another cell.  Large clients can impact performance, but they should not be able to impact stability.  Moving clients around buys you time, but it also forces you to focus on smaller, less profitable clients.

Run your job at night for best performance

This is advice that often pops up on SaaS message boards.  Don’t try to run your job during the day, schedule it to run in the evening so it is ready for the morning.  When clients start posting workarounds to your problems, it’s a clear sign of frustration.  Your clients are noticing that performance varies by the time of day.  They are building mental models of your platform and deciding you have load and scale issues.  By being helpful to each other, your clients are advertising your problems.

Conclusion

These 3 issues have the same root cause; your SaaS’s operational data is mixed in with client data.  If you have any of these three problems, the time has come to separate your data from the clients’.

Fixing these problems won’t be easy or quick!  

The good news is that you can separate the data and change your resource allocation model in an iterative fashion.  Start by pushing your job service across the tenancy line.

Get value and regain control one incremental step at a time, and never do a rewrite!

Pushing the Jobs Service across the Tenancy Line

A Jobs Service is a very common service for SaaS companies.  It provides a way to run work on a schedule, on demand, and independent of human activity.  Often, everything that isn’t done through the website is done by a Job Service.

I have never worked at a SaaS without some version of a Job Service, usually homegrown and built off a database instead of a queue.  They usually have descriptive and funny names – Task Processor, Crons, Crontabulous, Maestro, Batch Processor and of course Polite Batch Jobs.

Starting early in the SaaS’s life, they also evolve and grow with the SaaS, creating problems as they migrate from Single Tenant to a logically shared environment.

Single Tenant Job Service

In a single tenant model, provisioning a Job Service with a pool of workers is fairly straightforward.  Jobs are generated and put onto a queue (and not a database!

The Job Service takes jobs off of the queue and fans them out to the worker pool.  This is simple and works well because the Queue handles the complexities of tracking and retrying jobs.

Logically Separate Job Service

After an initial infrastructure consolidation the Job Service might look like this:

Multiple clients exist on a single database cluster, each with their own logically separate schema.  

The CRUD service has become a pool of servers that can act on behalf of any client.

There is still only 1 queue and 1 Job Service; Workers can act on behalf of any client, just like the CRUD servers.

Jobs get added haphazardly, and processed in a FIFO manner.

This model is much more resource efficient – sharing workers allows you to size the pool to keep things busy.

But this design is a disaster from a Noisy Neighbor standpoint.

Because the Queue is FIFO, the Job Service has no visibility into the client composition of the pending jobs, and a large client can easily starve a small one of resources by adding hundreds or thousands of jobs to the queue.  The large client will see progress as the jobs are processed, but nothing happens for the small client until the large job finishes.

Things get even worse if the Queue and Job Service are Global instead of Cell based.  A global queue feeding a global worker pool that works on clients spread across multiple database clusters will naturally cause database cluster hot spots.  Performance will degrade for everyone on the cluster while the workers do massive jobs for a few large clients.

You can add bandaids like limiting the number of jobs per client and moving excess work onto overflow queues.  This will help smaller clients somewhat, but natural hotspots will still occur.

Cross The Tenancy Line – Become Multi-Tenant

The Job Service needs to evolve from being Logically Separated into a Multi-Tenant service.

It needs to know how many jobs each client has pending, how long the jobs are taking, and how hot the database clusters are running so that it can operate a priority queue instead of FIFO.

The Jobs Service needs to move across the Tenancy Line

What is the Tenancy Line?

With Logically Separate infrastructure the clients share infrastructure, but the data and services all behave as if there is only one client at a time.  As a result each client can regulate its own behavior, but has no visibility into the infrastructure as a whole.

To stop acting like a Single Tenant service, the Jobs Service needs to cross the line into Multi-Tenancy.

This change is conceptually simple, but has a lot of subtle implications.

The Service can control load across clients

In the original model work loads are random based on when jobs are added to the queue.  When a hotspot emerges, there’s not much that the service can do without manual intervention.  When there’s a noisy neighbor you can’t do much to stop them from starving smaller clients because you don’t know where those clients are in the queue.

With a Multi-Tenant job service, you can control resources across cells and the entire platform.  Small clients can be protected by moving jobs up in priority based on how many recent jobs they have completed.

Jobs will finish faster as worker loads can be managed across cells, preventing hotspots.

Overall throughput will rise, smaller client performance will improve dramatically, and large clients will see more consistent execution times.

The Job Service Becomes a Queue

The original design used a single simple queue.  Every client adds jobs directly to the queue, and the Job Service’s responsibility is to take work, pass it to a worker, and mark the job as complete.  If there’s a failure, the queue will time the job out and put the work back on the queue.

A FIFO queue prioritizes by insertion order and doesn’t have any mechanism for reordering.  The Job Service will have to build prioritization logic and find a way to integrate into a queuing mechanism.  Do not give in to temptation and turn your database into a queue!

Conclusion

Pushing the Jobs Service across the Tenancy Line is a major coming of age step in the evolution of a SaaS company.

It trades significant development resources and complexity for consistent execution and a solution to the Noisy Neighbor Problem.  The SaaS benefits from the synergy this creates with better resource utilization and reduced database hotspotting.

Once a SaaS has enough clients to warrant the change, making the Jobs Processor Multi-Tenant is a major step forward.

Cell Based Single Tenancy

This is part 4 in a series on SaaS Tenancy Models.  Parts 1 , 2 , and 3.

SaaS companies are often approached by potential clients who want their instance to be completely separate from any other client.  Sometimes the request is driven by legal requirements (primarily healthcare and defense), sometimes it is a desire for enhanced security.

Often, running a Multi-Tenant service with a single client will satisfy the client’s needs.  Clients are often willing to pay for the privilege of their account run Single Tenant, making it a potentially lucrative option for a SaaS.

What is a Cell?

A Cell is an independent instance of a SaaS’ software setup.  This is different from having software running in multiple datacenters or even multiple continents.  If the services talk to each other, they are in the same cell regardless of physical location.

Cells can differ with the number and power of servers and databases.  Cells can even have entirely different caching options depending on need.

The 3 most common Cell setups are Production, Staging (or Test), and Local.

Cell Properties

Cell architecture comes with a few distinct properties:

  • Cell structures allow SaaS to grow internationally and offer clients low latency and localized data policies (think GDPR).  Latency from the US to Europe, Asia and South America is noticeable and degrades the client experience.
  • Clients exist in 1 cell at a time.  They can migrate, but they can’t exist in multiple cells.
  • Generally speaking, Cells can not be part of a disaster recovery plan.  Switching clients between Cells usually involves copying the database, and can’t be done if the client’s original Cell is down.

Cell Isolation as a Single Tenant Option

In part 3 I covered the difficulties in operating in a true Single Tenant model at scale.  A Cell with a single client effectively recreates the Single Tenancy experience.

Few clients want this level of isolation, but those that need it are prepared to pay for the extra infrastructure costs of an additional Cell.

Conclusion

For SaaS without global services, a Cell model enables a mix of clients on logically separated Multi-Tenant infrastructure and clients with effectively Single Tenant infrastructure.  This allows the company to pursue clients with Single Tenant needs, and the higher price point they offer.

The catch is that Single Tenant Cells can’t exist in an architecture with global services.  If there is a single service that must have access to all client data, Single Tenant Cells are out.


If you are enjoying this series, consider subscribing to my mailing list (https://shermanonsoftware.com/subscribe/) so that you don’t miss an installment!

Tenancy Models – Intro Addendum

In the first post on Saas Tenancy Models, I introduced the two idealized models – Single and Multi-Tenant.  Many SaaS companies start off as Single Tenant by default, rather than strategy, and migrate towards increasingly multi-tenant models under the influence of 4 main factors – complexity, security, scalability, and consistent performance.

After publishing, I realized that I left out an important fifth factor, synergy.

Synergy

In the context of this series, synergy is the increased value to the client as a result of mixing the client’s data with other clients.  A SaaS may even become a platform if the synergies become more valuable to the clients than the original service.  

Another aspect of synergy is that the clients only gain the extra value so long as they remain customers of the SaaS.  When clients churn, the SaaS usually retains the extra value, even after deleting the client’s data.  This organically strengthens client lock in and increases the SaaS value over time.  The existing data set becomes ever more valuable, making it increasingly difficult for clients to leave.

Some types of businesses, like retargeting ad buyers, create a lot of value for their clients by mixing client data.  Ad buyers increase effectiveness of their ad purchases by building larger consumer profiles.  This makes the ad purchases more effective for all clients.

On the other hand, a traditional CRM, or a codeless service like Zapier, would be very hard pressed to increase client value by mixing client data.  Having the same physical person in multiple client instances in a CRM doesn’t open a lot of avenues; what could you offer – track which clients a contact responds to?  No code services may mix client data as part of bulk operations, but that doesn’t add value to the clients.

Sometimes there might be potential synergy, like in Healthcare and Education, but it would be unethical and illegal to mix the data.

Not All Factors Are Client Facing

Two of the factors, complexity and scalability, are generally invisible to clients.  When complexity and scalability are noticed, it is negative:

  • Why do new features take so long to develop?  
  • Why are bugs so difficult to resolve?  
  • Why does the client experience get worse as usage grows?

A SaaS never wants a client asking these questions.

Security, Consistent Performance and Synergy are discussion points with clients.

Many SaaS companies can adjust Security concerns and Consistent Performance through configuration isolation.

Synergy is a highly marketable service differentiator and generally not negotiable.

Simplified Drawings

As much as possible I’m going to treat and draw things as 2-tier systems rather than N-tier.  As long as the principles are similar, I’ll default to simplified 2-tier diagrams over N-tier or microservice diagrams.

Next Time

Coming up I’ll be breaking down single to multi-tenant transformations.

Why a SaaS would want the transformation, what are the tradeoffs, and what are the potential pitfalls.

Please subscribe to my mailing list to make sure you don’t miss out!

Introduction to SaaS Tenancy models

Recently, I’ve spent a lot of time discussing the evolution of SaaS company Tenancy Models with my colleague Benjamin. These conversations have revealed that my thinking on the subject is vague and needs focus and sharpening through writing.

This is the first in a series of posts where I will dive deep on the technical aspect of tenancy models, the tradeoffs, which factors go into deciding on appropriate models, and how implementations evolve over time.

What are Tenancy Models?

There are 2 ideal models, single-tenant and multi-tenant, but most actual implementations are a hybrid mix.

In the computer realm, single-tenant systems are ones where the client is the only user of the servers, databases and other system tiers.  Software is installed on the system and it runs for one client.  Multi-tenant means that there are multiple clients on the servers and client data is mingled in the databases.

Pre-web software tended to be single-tenant because it ran on the client’s hardware.  As software migrated online and the SaaS model took off more complicated models became possible.  Moving from Offline to Online to the Cloud was mostly an exercise in who owned the hardware, and how difficult it was to get more.

When the software ran on the client’s hardware, at the client’s site, the hardware was basically unchangeable.  As things moved online, software became much easier to update, but hardware considerations were often made years in advance.  With cloud services, more hardware is just a click away allowing continuous evolution.

Main factors driving Technical Tenancy Decisions

The main factors driving tenancy decisions are complexity, security, scalability, and consistent performance.

Complexity

Keeping client data mingled on the servers without exposing anything to the wrong client tends to make multi-tenant software more complex than single-tenant.  The extra complexity translates to longer development cycles and higher developer costs.

Most SaaS software starts off with a single-tenant design by accident.  It isn’t a case of tech debt or cutting corners, Version 1 of the software needs to support a single client.  Supporting 10 clients with 10 instances is usually easier than developing 1 instance that supports 10 clients.  Being overwhelmed by interested clients is a good problem to have!  

Eventually the complexity cost of running massive numbers of single instances outweighs development savings, and the model begins evolving towards a multi-tenant model.

Security

The biggest driver of complexity is the second most pressing factor – security.  Ensuring that data doesn’t leak between clients is difficult.

A setup like this looks simple, but is extremely dangerous:

Forgetting to include client_id in any SQL Where clause will result in a data leak.

On the server side, it is also very easy to have a user log in, but lose track of which client an active session belongs to, and which data it can access.  This creates a whole collection of bugs around guessing and iterating contact ids.

Single-tenant systems don’t have these types of security problems.  No matter how badly a system is secured, each instance can only leak data for a single client.  Systems in industries with heavy penalties for leaking data, like Healthcare and Education tend to be more single-tenant.  Single tenant models make audits easier and reduce overall company risk.

Scalability

Scalability concerns come in after complexity and security because they fall into the “good problems to have” category.  Scaling problems are a sign of product market fit and paying customers.  Being able to go internet scale and process 1 million events a second is nice, but it is meaningless without customers.

Single-tenant systems scale poorly.  Each client needs separate servers, databases, caches, and other resources.  There are no economies or efficiencies of scale.  The smallest, least powered machines are generally way more powerful than any single client.  Worse, usage patterns mean that these resources will mostly eat money and sit idle.

Finally, all of those machines have to be maintained.  That’s not a big deal with 10 clients, or even 100.  With 100,000 clients, completely separate stacks would require teams of people to maintain.

Multi-tenant models scale much better because the clients share resources.  Cloud services make it easy to add another server to a pool, and large pools make the impact of adding clients negligible.  Adding database nodes is more difficult, but the principle holds – serving dozens to hundreds of clients on a single database allows the SaaS to minimize wasted resources and keeps teams smaller.

Consistent Performance

Consistent Performance, also known as the Noisy Neighbor Problem, comes up as a negative side effect of multi-tenant systems.

Perfectly even load distribution is impossible.  At any given moment, some clients will have greater needs than others.  Whichever servers and databases those clients are on will run hotter than others.  Other clients will experience worse performance than normal because there are fewer resources available on the server.

Bursty and compute intensive SaaS will feel these problems more than SaaS with a regular cadence.  For example a URL shortening service will have a long tail of links that rarely, if ever, get hit.  Some links will suddenly go viral and suck up massive amounts of resources.  On the other extreme – a company that does End Of Day processing for retail stores knows when the data processing starts, and the amount of sales in any one store is limited by the number of registers.

Single tenant systems don’t have these problems because there are no neighbors sucking up resources.  But, due to their higher operating costs, they also don’t have as much extra resources available to handle bursts.

Consistent performance is rarely a driver in initial single vs multi-tenant design because the problems appear as a side effect of scale.  By the time the issue comes up, the design has been in production for years.  Instead, consistent performance becomes a major factor as designs evolve.  

Initial forays into multi-tenant design are especially vulnerable to these problems.  Multi-tenant worker pools fed from single-tenant client repositories are ripe for bursty and long running process problems.

Fully multi-tenant systems, with large resource pools, have more resilience.  Additionally, processing layers have access to all of the data needed to orchestrate and balance between clients.

Conclusion

In this post I covered the two tenancy models, touched on why most SaaS companies start off with single-tenant models, and the major factors impacting and influencing tenancy design.

Single tenant systems tend to be simpler to develop and more secure, but are more expensive to run on a per client basis and don’t scale well.  Multi tenant systems are harder to develop and secure, but have economic and performance advantages as they scale.  As a result, SaaS companies usually start with single tenant designs and iterate towards multi-tenancy.  

Next up, I will cover the gray dividing line between single and multi-tenant data within a SaaS, The Tenancy Line.

Links vs Tags, a Rabbit Hole

This tweet from Tyler Tringas sent me down a rabbit hole

When designing a system, what are the differences between bidirectional links and tags?

Tags build value Asynchronously

The most obvious difference is that tags are asynchronous and become more useful as tags are added.  Tags can return results with as few as 1 member, and grow over time.  Links require two items, can’t be set until both items exist, and will never contain more than the two items.

Links are static, while Tags are living documents.  Links will be the same when you come back to them, tags will be different over time.

Tags are Clusters, Links are Paths

Tags can quickly provide a cluster of related items, but don’t offer guidance over what to read next.  Links highlight related, highly branching items.  Readers can swim in a pool of tags, or follow a path of links.

Tags are great if you want more on the topic, links help you find the next topic.

Links are Curated

To create a link, you have to know that the other item exists.  Your ability to create links will always be constrained by your knowledge of previously published items, or your time and desire to search out related content.  Tags are a shot in the dark.  They work regardless of your knowledge of the past.  As a result, links are a more curated source.

Tags are Context

Tag names are context.  If you add a “business” tag to a bunch of articles, someone else will know why you grouped the articles together.  Links do not retain any context, later users (including yourself in 6 months) will need to examine both items to know why you linked them.

Bidirectional Links are more DB Intensive

Assuming your database is set up something like this:

Bidirectional links require 2 rows for each entry.

Tags require 1 row per entry plus 1 row for the tag.

Big O says that 2N and N + 1 are both O(n).  Anyone who has worked on an overwhelmed database will tell you that 2 insertions can be way more than twice as expensive as 1.

Conclusion

As a practical matter, tags are more friendly to casual content creation and casual users.

Links are better when created by subject matter experts and consumed by people trying to learn an entire topic.

Amazon’s Elastic Load Balancer is a Strangler

The Strangler is an extremely effective technique for phasing out legacy systems over time.  Instead of spending months getting a new system up to parity with the current system so that clients can use it, you place The Strangler between the original system and the clients.  

The Strangler passes any request that the new system can’t handle on to the legacy system.  Over time, the new system handles more and more, the legacy system does less and less. Most importantly, your clients won’t have to do any migration work and won’t notice a thing as the legacy system fades away.

A common objection to setting up a Strangler is that it Yet Another Thing that your overloaded team needs to build.  Write a request proxy on top of rewriting the original requests! Who has time? 

Except, AWS customers already have a fully featured Strangler on their shelf.  The Elastic Load Balancer (ELB) is a tool that takes incoming requests and forwards them on to another server.

The only requirement is that your clients access your application through a DNS name.

With an afternoon’s worth of effort you can set up a Strangler for your legacy application.

You no longer need to get the new system up to feature parity for clients to start using it!  Instead, new features get routed to the new server, while old ones stay with the legacy system.  When you do have time or a business reason to replace an existing feature the release is nothing more than a config change.

Getting a new system up to parity with the legacy system is a long process with little business value.  The Strangler lets your new system leverage the legacy system, and you don’t even have to let your clients know about the migration.  The Strangler is your Best Alternative to a Total Rewrite!

Dead Database Pixel Tracking

Pixel Tracking is a common Marketing SaaS activity used to track page loads.  Today I am going to try and tie several earlier posts together and show how to evolve a frustrating Pixel Tracking architecture into one that can survive database outages.

Pixel Tracking events are synchronously written to the database.  A job processor uses the database as a queue to find updates, and farms out processing tasks.

Designed to Punish Users

This design is governed by database performance.  As the load ramps up, users are going to notice lagging page loads.  Worse, each event recorded will have to be processed, tripling the database load.

Designed to Scale

You can relieve the pressure on the user by making your Pixel Tracking asynchronous.  Moving away from using your database as a queue is more complicated, but critical for scaling.  Finally, using Topics makes it easy to expand the types of processing tasks your platform supports.

Users are now completely insulated from scale and processing issues.

Dead Database Design

There is no database in the final design because it is no longer relevant to the users’ interactions with your services.  The performance is the same whether your database is at 0% or 100% load.  

The performance is the same if your database falls over and you have to switch to a hot standby or even restore from a backup.

With a bit of effort your SaaS could have a database fall over on the run up to Black Friday and recover without data loss or clients noticing. If you are using SNS/SQS on AWS the queue defaults are over 100,000 events! It may take a while to chew through the queues, but the data won’t disappear.

When your Pixel Tracking is causing your users headaches, going asynchronous is your Best Alternative to a Total Rewrite.