Introduction to SaaS Tenancy models

Recently, I’ve spent a lot of time discussing the evolution of SaaS company Tenancy Models with my colleague Benjamin. These conversations have revealed that my thinking on the subject is vague and needs focus and sharpening through writing.

This is the first in a series of posts where I will dive deep on the technical aspect of tenancy models, the tradeoffs, which factors go into deciding on appropriate models, and how implementations evolve over time.

What are Tenancy Models?

There are 2 ideal models, single-tenant and multi-tenant, but most actual implementations are a hybrid mix.

In the computer realm, single-tenant systems are ones where the client is the only user of the servers, databases and other system tiers.  Software is installed on the system and it runs for one client.  Multi-tenant means that there are multiple clients on the servers and client data is mingled in the databases.

Pre-web software tended to be single-tenant because it ran on the client’s hardware.  As software migrated online and the SaaS model took off more complicated models became possible.  Moving from Offline to Online to the Cloud was mostly an exercise in who owned the hardware, and how difficult it was to get more.

When the software ran on the client’s hardware, at the client’s site, the hardware was basically unchangeable.  As things moved online, software became much easier to update, but hardware considerations were often made years in advance.  With cloud services, more hardware is just a click away allowing continuous evolution.

Main factors driving Technical Tenancy Decisions

The main factors driving tenancy decisions are complexity, security, scalability, and consistent performance.

Complexity

Keeping client data mingled on the servers without exposing anything to the wrong client tends to make multi-tenant software more complex than single-tenant.  The extra complexity translates to longer development cycles and higher developer costs.

Most SaaS software starts off with a single-tenant design by accident.  It isn’t a case of tech debt or cutting corners, Version 1 of the software needs to support a single client.  Supporting 10 clients with 10 instances is usually easier than developing 1 instance that supports 10 clients.  Being overwhelmed by interested clients is a good problem to have!  

Eventually the complexity cost of running massive numbers of single instances outweighs development savings, and the model begins evolving towards a multi-tenant model.

Security

The biggest driver of complexity is the second most pressing factor - security.  Ensuring that data doesn’t leak between clients is difficult.

A setup like this looks simple, but is extremely dangerous:

Forgetting to include client_id in any SQL Where clause will result in a data leak.

On the server side, it is also very easy to have a user log in, but lose track of which client an active session belongs to, and which data it can access.  This creates a whole collection of bugs around guessing and iterating contact ids.

Single-tenant systems don’t have these types of security problems.  No matter how badly a system is secured, each instance can only leak data for a single client.  Systems in industries with heavy penalties for leaking data, like Healthcare and Education tend to be more single-tenant.  Single tenant models make audits easier and reduce overall company risk.

Scalability

Scalability concerns come in after complexity and security because they fall into the “good problems to have” category.  Scaling problems are a sign of product market fit and paying customers.  Being able to go internet scale and process 1 million events a second is nice, but it is meaningless without customers.

Single-tenant systems scale poorly.  Each client needs separate servers, databases, caches, and other resources.  There are no economies or efficiencies of scale.  The smallest, least powered machines are generally way more powerful than any single client.  Worse, usage patterns mean that these resources will mostly eat money and sit idle.

Finally, all of those machines have to be maintained.  That’s not a big deal with 10 clients, or even 100.  With 100,000 clients, completely separate stacks would require teams of people to maintain.

Multi-tenant models scale much better because the clients share resources.  Cloud services make it easy to add another server to a pool, and large pools make the impact of adding clients negligible.  Adding database nodes is more difficult, but the principle holds - serving dozens to hundreds of clients on a single database allows the SaaS to minimize wasted resources and keeps teams smaller.

Consistent Performance

Consistent Performance, also known as the Noisy Neighbor Problem, comes up as a negative side effect of multi-tenant systems.

Perfectly even load distribution is impossible.  At any given moment, some clients will have greater needs than others.  Whichever servers and databases those clients are on will run hotter than others.  Other clients will experience worse performance than normal because there are fewer resources available on the server.

Bursty and compute intensive SaaS will feel these problems more than SaaS with a regular cadence.  For example a URL shortening service will have a long tail of links that rarely, if ever, get hit.  Some links will suddenly go viral and suck up massive amounts of resources.  On the other extreme - a company that does End Of Day processing for retail stores knows when the data processing starts, and the amount of sales in any one store is limited by the number of registers.

Single tenant systems don’t have these problems because there are no neighbors sucking up resources.  But, due to their higher operating costs, they also don’t have as much extra resources available to handle bursts.

Consistent performance is rarely a driver in initial single vs multi-tenant design because the problems appear as a side effect of scale.  By the time the issue comes up, the design has been in production for years.  Instead, consistent performance becomes a major factor as designs evolve.  

Initial forays into multi-tenant design are especially vulnerable to these problems.  Multi-tenant worker pools fed from single-tenant client repositories are ripe for bursty and long running process problems.

Fully multi-tenant systems, with large resource pools, have more resilience.  Additionally, processing layers have access to all of the data needed to orchestrate and balance between clients.

Conclusion

In this post I covered the two tenancy models, touched on why most SaaS companies start off with single-tenant models, and the major factors impacting and influencing tenancy design.

Single tenant systems tend to be simpler to develop and more secure, but are more expensive to run on a per client basis and don’t scale well.  Multi tenant systems are harder to develop and secure, but have economic and performance advantages as they scale.  As a result, SaaS companies usually start with single tenant designs and iterate towards multi-tenancy.  

Next up, I will cover the gray dividing line between single and multi-tenant data within a SaaS, The Tenancy Line.

Links vs Tags, a Rabbit Hole

This tweet from Tyler Tringas sent me down a rabbit hole

https://twitter.com/tylertringas/status/1264943920380882944?s=21

When designing a system, what are the differences between bidirectional links and tags?

Tags build value Asynchronously

The most obvious difference is that tags are asynchronous and become more useful as tags are added.  Tags can return results with as few as 1 member, and grow over time.  Links require two items, can’t be set until both items exist, and will never contain more than the two items.

Links are static, while Tags are living documents.  Links will be the same when you come back to them, tags will be different over time.

Tags are Clusters, Links are Paths

Tags can quickly provide a cluster of related items, but don’t offer guidance over what to read next.  Links highlight related, highly branching items.  Readers can swim in a pool of tags, or follow a path of links.

Tags are great if you want more on the topic, links help you find the next topic.

Links are Curated

To create a link, you have to know that the other item exists.  Your ability to create links will always be constrained by your knowledge of previously published items, or your time and desire to search out related content.  Tags are a shot in the dark.  They work regardless of your knowledge of the past.  As a result, links are a more curated source.

Tags are Context

Tag names are context.  If you add a “business” tag to a bunch of articles, someone else will know why you grouped the articles together.  Links do not retain any context, later users (including yourself in 6 months) will need to examine both items to know why you linked them.

Bidirectional Links are more DB Intensive

Assuming your database is set up something like this:

Bidirectional links require 2 rows for each entry.

Tags require 1 row per entry plus 1 row for the tag.

Big O says that 2N and N + 1 are both O(n).  Anyone who has worked on an overwhelmed database will tell you that 2 insertions can be way more than twice as expensive as 1.

Conclusion

As a practical matter, tags are more friendly to casual content creation and casual users.

Links are better when created by subject matter experts and consumed by people trying to learn an entire topic.

Amazon’s Elastic Load Balancer is a Strangler

The Strangler is an extremely effective technique for phasing out legacy systems over time.  Instead of spending months getting a new system up to parity with the current system so that clients can use it, you place The Strangler between the original system and the clients.  

The Strangler passes any request that the new system can’t handle on to the legacy system.  Over time, the new system handles more and more, the legacy system does less and less. Most importantly, your clients won’t have to do any migration work and won’t notice a thing as the legacy system fades away.

A common objection to setting up a Strangler is that it Yet Another Thing that your overloaded team needs to build.  Write a request proxy on top of rewriting the original requests! Who has time? 

Except, AWS customers already have a fully featured Strangler on their shelf.  The Elastic Load Balancer (ELB) is a tool that takes incoming requests and forwards them on to another server.

The only requirement is that your clients access your application through a DNS name.

With an afternoon’s worth of effort you can set up a Strangler for your legacy application.

You no longer need to get the new system up to feature parity for clients to start using it!  Instead, new features get routed to the new server, while old ones stay with the legacy system.  When you do have time or a business reason to replace an existing feature the release is nothing more than a config change.

Getting a new system up to parity with the legacy system is a long process with little business value.  The Strangler lets your new system leverage the legacy system, and you don’t even have to let your clients know about the migration.  The Strangler is your Best Alternative to a Total Rewrite!

Dead Database Pixel Tracking

Pixel Tracking is a common Marketing SaaS activity used to track page loads.  Today I am going to try and tie several earlier posts together and show how to evolve a frustrating Pixel Tracking architecture into one that can survive database outages.

Pixel Tracking events are synchronously written to the database.  A job processor uses the database as a queue to find updates, and farms out processing tasks.

Designed to Punish Users

This design is governed by database performance.  As the load ramps up, users are going to notice lagging page loads.  Worse, each event recorded will have to be processed, tripling the database load.

Designed to Scale

You can relieve the pressure on the user by making your Pixel Tracking asynchronous.  Moving away from using your database as a queue is more complicated, but critical for scaling.  Finally, using Topics makes it easy to expand the types of processing tasks your platform supports.

Users are now completely insulated from scale and processing issues.

Dead Database Design

There is no database in the final design because it is no longer relevant to the users’ interactions with your services.  The performance is the same whether your database is at 0% or 100% load.  

The performance is the same if your database falls over and you have to switch to a hot standby or even restore from a backup.

With a bit of effort your SaaS could have a database fall over on the run up to Black Friday and recover without data loss or clients noticing. If you are using SNS/SQS on AWS the queue defaults are over 100,000 events! It may take a while to chew through the queues, but the data won't disappear.

When your Pixel Tracking is causing your users headaches, going asynchronous is your Best Alternative to a Total Rewrite.

I wouldn’t have done it this way.

I confess to having said, “I wouldn’t have done it this way.”  The phrase seems like a polite way of trashing the current system architecture while implying that you know the correct design.

You might get a snicker and feel smart and clever, but you’re creating problems for yourself and your project.

It puts the original programmers on the defensive.  

You may be politely trashing their design, but you’re still trashing their design.  Instead of listening to you describe your superior solution, the original programmers are going to be thinking of arguments defending their work.

The Business People Don’t Care

The managers and executives in the room didn’t care about how the original programmers designed the system.  They don’t care about how you would have designed the system. They want to know what you’re going to do about their business problems.

It leaves the listener questioning your intent.  

Are you changing the design because you don’t like it, or because it needs to be done?  You never want anyone wondering if you are proposing a refactor or rewrite because of style.

Your way might not be possible

I once used PostgreSQL as a noSql system because all the other AWS options had row size limits that were too small.  For years after new developers would tell me that I should have used different technologies. I would explain the size constraint, and more often than not, show that AWS still had constraints that would prevent us from using the technology.

Instead of trashing the original design, try these two approaches instead:

Talk about the business problem with the current design.

“The current design won’t scale to the levels we need.”  Maybe it won’t scale because it was a bad design, maybe it was a massively successful MVP.  Either way, you need to replace it to go forward.

Be positive

“The current design was a great way to get started.”  If the original design was an abject failure, no one would be asking you to rebuild or expand it.  Acknowledge that, whatever its shortcomings, the original design moved the business forward.

Acknowledge your ignorance

“I don’t know what the original requirements were, but the current design isn’t a good fit for our needs”.  It’s useful to know why past choices were made so that you don’t miss requirements, and people are much more likely to tell you if you’re the first to admit you don’t know.

“I wouldn’t have done it this way” is an old developer cliche.  The developers who inherited your work are probably saying it about you right now.

Making Link Tracking Scale – Part 1 Asynchronous Processing

Link Tracking is a core activity for Marketing and CRM SaaS companies.  Link Tracking often an early system bottleneck, one that creates a lousy user experience and frustrates your clients.  In this article I’m going to show a common synchronous design, discuss why it fails to scale, and show how to overcome scaling issues by making the design asynchronous.

What Is Link Tracking?

Link Tracking allows you to track who clicks on a link.  This lets you measure the effectiveness of your marketing, learn what offers appeal to which clients, and generally track user engagement.  When you see a link starting with fb.me or lnkd.in, those are tracking links for Facebook and LinkedIn.

Instead of having a link go to original target, the link is changed to a tracking link.  The system will track 3 pieces of data: which client, which user, and what url, and then redirect the user’s browser to the original link.

A Simple Synchronous Design

Here’s what that looks like as a sequence diagram. 

There are 2 trips to the database, first to discover what the original link is, and a second to record the click.  After all of that is done, the original link is returned to the user and their browser is redirected to the actual content they are looking for.

Best case on a cloud host like AWS the Server + Database time will be about 10ms.  That time will be dwarfed by the 50-100ms from general network latency getting to AWS, through the ELB and to the server.

This design is simple, speedy, and works well enough for your early days.

Why Synchronous Processing Fails to Scale

Link Tracking events tend to be spikey - there’s an email blast, an article is published, or some tweet goes viral.  Instead of 150,000 events/day uniformly spread over 2 events/s, your system will suddenly be hit with 100 events/s, or even 10,000/s.  Looking up the URL and recording the event will spike from 10ms to 1s or even 10s.

While your system records the event, the user waits.  And waits. Often closing the browser tab without ever seeing your content.

Upgrading the database’s hardware is an expensive way to buy time, but it’ll work for a while.  Eventually though, you’ll have to go asynchronous.

How Asynchronous Scales

With Asynchronous Processing, it becomes the responsibility of the Server to remember the Link Tracking event and process it later.  Depending on your tech stack this can be done a lot of different ways including Threads, Callbacks, external queues and other forms of buffering the data until it can be processed.

The important part, from a scaling perspective, is that the user is redirected to the original URL as quickly as possible.

The user doesn’t care about Link Tracking, and with Asynchronous Processing you won’t make your users wait while you write to the database.

Making the event processing asynchronous is an important first step towards making a scalable system. In part 2 I will discuss how caching the URLs will improve the design further.

What is a Queue Anyway?

Several readers wrote in response to my article Your Database is not a Queue to tell me that they don’t know what a queue is, or why they would use one.  As one reader put it, “assume I have 3 tools, a hammer, a screwdriver and a database”. Fair enough, this week I want to talk about what a message queue is, what features it offers, and why it is a superior solution for batch processing.

To keep things as concrete as possible, I’ll use a real world example, Nightly Report Generation, and AWS technologies.

Your customers want to know how things are going.  If your SaaS integrates with a shopping cart, how many sales did you do?  If you do marketing, how many potential customers did you reach? How many leads were generated?  Emails sent? Whatever your service, you should be letting your customers know that you’re killing it for them.

Most SaaS companies have some form of RESTful API for customers, and initially you can ask clients to help themselves and generate reports on demand.  But as you grow, on demand reporting becomes to slow. Code that worked for a client with 200 customer shopping carts a day may be to slow at 2,000 or 20,000 carts.  These are great problems to have!

To meet your client’s needs, you need a system to generate reports overnight.  It’s not client facing, so it doesn’t need to be RESTful, and it’s not driven by client activity, so it won’t run itself.

Enter queues!

For this article we will use AWS’s SQS, which stands for Simple Queue Service.  There are lots of other wonderful open source solutions like Rabbit MQ, but you are probably already using AWS, and SQS is cheap and fully managed.

What does a queue do?

  • A queue gives you a list of messages that can only be accessed in order.
  • A queue guarantees that one, and only one, consumer can see a message at a time.
  • A queue guarantees that messages do not get lost, if the consumer takes a message from the queue, it has a certain amount of time to tell the queue that the message is done, or the queue will assume something has gone wrong and make the message available again.
  • A queue offers a Dead Letter Queue, if something goes wrong to many times the message is removed and placed on the Dead Letter Queue so that the rest of the list can be processed

As a practical matter all that boils down to:

  • You can run multiple instances of the report generator without worrying about missing a client, or sending the same report twice.  You get to skip the early concurrency and scaling problems you’d encounter if you wrote your own code, or tried to use a database.
  • When your code has a bug in an edge case, you’ll still be able to generate all of the reports that don’t hit the edge case.
  • You can alert on failures, see which reports failed, and after you fix the bug, you can click of a button put them back into the queue to run again.
  • Because SQS is managed by AWS you get monitoring and alerts out of the box
  • Because SQS is managed by AWS you don’t have to worry about the queue’s scale or performance.
  • You can add API endpoints to generate “on demand” messages and put them on the queue.  This means that the “standard” report messages, and client generated “on demand” report requests follow the same process and you don’t need to build and maintain separate reporting pipelines for internal and external requests.

Everything is tradeoffs, and queues don’t solve all problems well.  What kinds of problems are you going to have with SQS?

  • Priority/Re-sorting existing messages - You have 10,000 messages on the queue and an important client needs a report NOW!  Prioritized messages, or changing the order of messages on the queue is not something SQS supports. You can have a “fast lane” queue that gets high priority messages, but it’s clunky.
  • Random Access - You have 10,000 messages and want to know where in the queue a specific client’s report is?  Queues let you add at the end and take from the front, that’s it. If you need to know what’s in the middle you’ll need to maintain that information in a separate system.
  • Random Deleting - Probably not important for reports, but for cases like bulk importing, if a client changes her mind and wants to cancel a job, you can’t reach into the middle of the queue and remove the message.
  • Process order is not guaranteed - If you have a 3 part task, and you cannot start Task 2 until Task 1 finishes, you cannot add all 3 tasks onto the queue at once.  It is highly likely that a second worker will come along and start Task 2 while Task 1 is still in process. Instead you will need to have Task 1 add Task 2 onto the queue when it finishes.

None of these problems will crop up in your early iterations, and they are great problems to have!  They are signs that your SaaS is growing to meet your client’s needs, you and your clients are thriving!

To bring it full circle, what if you already have a home grown system for running reports?  How can you get on the path to queues? See my article on about common ways into the DB as a Queue mess, and some suggestions on how to get out.

Site Footer