Tenancy Models – Intro Addendum

In the first post on Saas Tenancy Models, I introduced the two idealized models – Single and Multi-Tenant.  Many SaaS companies start off as Single Tenant by default, rather than strategy, and migrate towards increasingly multi-tenant models under the influence of 4 main factors – complexity, security, scalability, and consistent performance.

After publishing, I realized that I left out an important fifth factor, synergy.

Synergy

In the context of this series, synergy is the increased value to the client as a result of mixing the client’s data with other clients.  A SaaS may even become a platform if the synergies become more valuable to the clients than the original service.  

Another aspect of synergy is that the clients only gain the extra value so long as they remain customers of the SaaS.  When clients churn, the SaaS usually retains the extra value, even after deleting the client’s data.  This organically strengthens client lock in and increases the SaaS value over time.  The existing data set becomes ever more valuable, making it increasingly difficult for clients to leave.

Some types of businesses, like retargeting ad buyers, create a lot of value for their clients by mixing client data.  Ad buyers increase effectiveness of their ad purchases by building larger consumer profiles.  This makes the ad purchases more effective for all clients.

On the other hand, a traditional CRM, or a codeless service like Zapier, would be very hard pressed to increase client value by mixing client data.  Having the same physical person in multiple client instances in a CRM doesn’t open a lot of avenues; what could you offer – track which clients a contact responds to?  No code services may mix client data as part of bulk operations, but that doesn’t add value to the clients.

Sometimes there might be potential synergy, like in Healthcare and Education, but it would be unethical and illegal to mix the data.

Not All Factors Are Client Facing

Two of the factors, complexity and scalability, are generally invisible to clients.  When complexity and scalability are noticed, it is negative:

  • Why do new features take so long to develop?  
  • Why are bugs so difficult to resolve?  
  • Why does the client experience get worse as usage grows?

A SaaS never wants a client asking these questions.

Security, Consistent Performance and Synergy are discussion points with clients.

Many SaaS companies can adjust Security concerns and Consistent Performance through configuration isolation.

Synergy is a highly marketable service differentiator and generally not negotiable.

Simplified Drawings

As much as possible I’m going to treat and draw things as 2-tier systems rather than N-tier.  As long as the principles are similar, I’ll default to simplified 2-tier diagrams over N-tier or microservice diagrams.

Next Time

Coming up I’ll be breaking down single to multi-tenant transformations.

Why a SaaS would want the transformation, what are the tradeoffs, and what are the potential pitfalls.

Please subscribe to my mailing list to make sure you don’t miss out!

Introduction to SaaS Tenancy models

Recently, I’ve spent a lot of time discussing the evolution of SaaS company Tenancy Models with my colleague Benjamin. These conversations have revealed that my thinking on the subject is vague and needs focus and sharpening through writing.

This is the first in a series of posts where I will dive deep on the technical aspect of tenancy models, the tradeoffs, which factors go into deciding on appropriate models, and how implementations evolve over time.

What are Tenancy Models?

There are 2 ideal models, single-tenant and multi-tenant, but most actual implementations are a hybrid mix.

In the computer realm, single-tenant systems are ones where the client is the only user of the servers, databases and other system tiers.  Software is installed on the system and it runs for one client.  Multi-tenant means that there are multiple clients on the servers and client data is mingled in the databases.

Pre-web software tended to be single-tenant because it ran on the client’s hardware.  As software migrated online and the SaaS model took off more complicated models became possible.  Moving from Offline to Online to the Cloud was mostly an exercise in who owned the hardware, and how difficult it was to get more.

When the software ran on the client’s hardware, at the client’s site, the hardware was basically unchangeable.  As things moved online, software became much easier to update, but hardware considerations were often made years in advance.  With cloud services, more hardware is just a click away allowing continuous evolution.

Main factors driving Technical Tenancy Decisions

The main factors driving tenancy decisions are complexity, security, scalability, and consistent performance.

Complexity

Keeping client data mingled on the servers without exposing anything to the wrong client tends to make multi-tenant software more complex than single-tenant.  The extra complexity translates to longer development cycles and higher developer costs.

Most SaaS software starts off with a single-tenant design by accident.  It isn’t a case of tech debt or cutting corners, Version 1 of the software needs to support a single client.  Supporting 10 clients with 10 instances is usually easier than developing 1 instance that supports 10 clients.  Being overwhelmed by interested clients is a good problem to have!  

Eventually the complexity cost of running massive numbers of single instances outweighs development savings, and the model begins evolving towards a multi-tenant model.

Security

The biggest driver of complexity is the second most pressing factor – security.  Ensuring that data doesn’t leak between clients is difficult.

A setup like this looks simple, but is extremely dangerous:

Forgetting to include client_id in any SQL Where clause will result in a data leak.

On the server side, it is also very easy to have a user log in, but lose track of which client an active session belongs to, and which data it can access.  This creates a whole collection of bugs around guessing and iterating contact ids.

Single-tenant systems don’t have these types of security problems.  No matter how badly a system is secured, each instance can only leak data for a single client.  Systems in industries with heavy penalties for leaking data, like Healthcare and Education tend to be more single-tenant.  Single tenant models make audits easier and reduce overall company risk.

Scalability

Scalability concerns come in after complexity and security because they fall into the “good problems to have” category.  Scaling problems are a sign of product market fit and paying customers.  Being able to go internet scale and process 1 million events a second is nice, but it is meaningless without customers.

Single-tenant systems scale poorly.  Each client needs separate servers, databases, caches, and other resources.  There are no economies or efficiencies of scale.  The smallest, least powered machines are generally way more powerful than any single client.  Worse, usage patterns mean that these resources will mostly eat money and sit idle.

Finally, all of those machines have to be maintained.  That’s not a big deal with 10 clients, or even 100.  With 100,000 clients, completely separate stacks would require teams of people to maintain.

Multi-tenant models scale much better because the clients share resources.  Cloud services make it easy to add another server to a pool, and large pools make the impact of adding clients negligible.  Adding database nodes is more difficult, but the principle holds – serving dozens to hundreds of clients on a single database allows the SaaS to minimize wasted resources and keeps teams smaller.

Consistent Performance

Consistent Performance, also known as the Noisy Neighbor Problem, comes up as a negative side effect of multi-tenant systems.

Perfectly even load distribution is impossible.  At any given moment, some clients will have greater needs than others.  Whichever servers and databases those clients are on will run hotter than others.  Other clients will experience worse performance than normal because there are fewer resources available on the server.

Bursty and compute intensive SaaS will feel these problems more than SaaS with a regular cadence.  For example a URL shortening service will have a long tail of links that rarely, if ever, get hit.  Some links will suddenly go viral and suck up massive amounts of resources.  On the other extreme – a company that does End Of Day processing for retail stores knows when the data processing starts, and the amount of sales in any one store is limited by the number of registers.

Single tenant systems don’t have these problems because there are no neighbors sucking up resources.  But, due to their higher operating costs, they also don’t have as much extra resources available to handle bursts.

Consistent performance is rarely a driver in initial single vs multi-tenant design because the problems appear as a side effect of scale.  By the time the issue comes up, the design has been in production for years.  Instead, consistent performance becomes a major factor as designs evolve.  

Initial forays into multi-tenant design are especially vulnerable to these problems.  Multi-tenant worker pools fed from single-tenant client repositories are ripe for bursty and long running process problems.

Fully multi-tenant systems, with large resource pools, have more resilience.  Additionally, processing layers have access to all of the data needed to orchestrate and balance between clients.

Conclusion

In this post I covered the two tenancy models, touched on why most SaaS companies start off with single-tenant models, and the major factors impacting and influencing tenancy design.

Single tenant systems tend to be simpler to develop and more secure, but are more expensive to run on a per client basis and don’t scale well.  Multi tenant systems are harder to develop and secure, but have economic and performance advantages as they scale.  As a result, SaaS companies usually start with single tenant designs and iterate towards multi-tenancy.  

Next up, I will cover the gray dividing line between single and multi-tenant data within a SaaS, The Tenancy Line.

Build Allies By Asking “How did you solve”?

Intractable technical problems are a fact of life.  Architectures make seemingly easy use cases impossible.  Critical code won’t scale, because of meaningless choices made years ago.  There are endless tech problems that defy easy and obvious solutions.

Earnest, well meaning, developers who come up with solutions that won’t work are also part of life.

After you’ve been thinking about a problem for months or years it can be tempting to tell a developer why their solutions won’t work.  Lead the developer down your chain of reasoning.  Show them how much more you’ve thought about the problem.  How you’ve considered their solution, and a dozen others. Be superior and dismissive.

Prove that the intractable problem can’t be solved until you get the developer to say “Ok!  Ok!  You’re right!  It won’t work!”

Or, you could be open to being wrong and build an ally.

Agree with the basic idea of the solution.  That seemed like a good idea to you too! Then, instead of telling them why they can’t, ask the developer how they got around the intractable problem.

“How did you solve it?” is interested and hopeful.  It tells the other person “I have spent time thinking about this problem too.  Here’s where I got stuck.  I’d love to hear how you think we can solve this problem.”

There’s almost never a solution.  Most of the time the other person had no idea that the architecture wouldn’t support it, that the message size is too big, or any of the subtle technical reasons why the solution won’t work.

Asking about the solution, instead of preaching the problem, puts the two of you on the same side.  It makes you allies against the problem and sidesteps the teacher/student power dynamic of “That won’t work.”

Sometimes, a fresh perspective does produce a solution to the intractable problem.  Since you avoided staking a claim that there was no solution, you won’t feel the sting of being wrong.  You won’t end up defensive, and can embrace the developer’s solution.

“How did you solve it” builds allies. 
“That won’t work” harms you both.

Testing For BitRot – 2022

Welcome back to regular Wednesday posting!

Starting next week I will have new content every week, mostly on Wednesdays.

This blog covers scaling SaaS software: common gotchas, anti-patterns, performance and scaling strategies. I cover the technical and social aspects of these problems because it is often harder to convince people than implement code.

Thank you for reading!

Testing for bitrot

Hello, it has been a while!

This is a short and sweet post to remind you of who I am, what to expect from this blog, and test to see if everything is still working.

I’m Jeffrey Sherman, and this blog is about software performance and scaling. Along the way I’ll also talk a lot about development team management, which is often far more important and more effective than writing code.

My current goal is to publish one article per week, mostly on Wednesday.

If that sounds good you should absolutely subscribe to get this blog as an email. For those who have already subscribed, thank you for letting me into your inbox!

Racism in, Racism out

Most developers I know assume that software can’t be racist unless they actively make it racist.  After all, On the Internet, No One Knows You’re a dog.

To paraphrase Charles Babbage: Racism In, Racism Out.  When your inputs are racist, your outputs will be racist.  Even if you didn’t do anything racist.

Years ago, I worked for one of the largest residential mortgage brokers in the US and often tried to impress upon my fellow developers the need to be aware of the past so that we could try to reduce the racism flowing through our code.

The conversations were usually flat.  They would thank me for the interesting history lesson and walk away assured that since they weren’t racist, and weren’t coding anything racist, there was no racism in the software.

Housing in the US has a long and sored history of racism.  

Here are two quick examples of how race and racism leaks into mortgage software. The first is pretty blunt, the second is subtle.

Colorblind Code Not Allowed

For residential mortgages, the US Government requires asking borrowers race and gender.  The government uses the data to find racism (and sexism) in lending.  This data is how we know that Black borrowers pay higher rates and get rejected for loans more often.  The data paints a depressing picture, racism is prevalent in the mortgage industry.  You have to add race to your code, and it is a good bet that some of your users will exploit that data to discriminate.

Pricing Is Based Off Racist History

As a part of the appraisal process, the appraiser will find “comparable” houses nearby to validate the price.  In areas where the value of homes has been depressed by racist history like redlining, comparables act as a racist anchor.  Using comparables is like saying “houses in this neighborhood are worth less than other neighborhoods because 60 years ago racists decided that predominantly Black neighborhoods are worth less, and we have decided to continue the process.”

Many states ban asking about salary history because it reinforces past discrimination.  There are companies out there pushing back against the use of comps.  As a developer you won’t be able to choose the company’s risk models, but with a little work, you can code up alternative models and make better data available.

Don’t be Passive

You have an obligation to understand your inputs.  You may not be able to sanitize them, but understanding is a vital first step.  Google the history of your industry to find where racism has come from in the past, and think about how your code makes it easier or harder for history to repeat itself.

Black Lives Matter.  

Links vs Tags, a Rabbit Hole

This tweet from Tyler Tringas sent me down a rabbit hole

When designing a system, what are the differences between bidirectional links and tags?

Tags build value Asynchronously

The most obvious difference is that tags are asynchronous and become more useful as tags are added.  Tags can return results with as few as 1 member, and grow over time.  Links require two items, can’t be set until both items exist, and will never contain more than the two items.

Links are static, while Tags are living documents.  Links will be the same when you come back to them, tags will be different over time.

Tags are Clusters, Links are Paths

Tags can quickly provide a cluster of related items, but don’t offer guidance over what to read next.  Links highlight related, highly branching items.  Readers can swim in a pool of tags, or follow a path of links.

Tags are great if you want more on the topic, links help you find the next topic.

Links are Curated

To create a link, you have to know that the other item exists.  Your ability to create links will always be constrained by your knowledge of previously published items, or your time and desire to search out related content.  Tags are a shot in the dark.  They work regardless of your knowledge of the past.  As a result, links are a more curated source.

Tags are Context

Tag names are context.  If you add a “business” tag to a bunch of articles, someone else will know why you grouped the articles together.  Links do not retain any context, later users (including yourself in 6 months) will need to examine both items to know why you linked them.

Bidirectional Links are more DB Intensive

Assuming your database is set up something like this:

Bidirectional links require 2 rows for each entry.

Tags require 1 row per entry plus 1 row for the tag.

Big O says that 2N and N + 1 are both O(n).  Anyone who has worked on an overwhelmed database will tell you that 2 insertions can be way more than twice as expensive as 1.

Conclusion

As a practical matter, tags are more friendly to casual content creation and casual users.

Links are better when created by subject matter experts and consumed by people trying to learn an entire topic.

Amazon’s Elastic Load Balancer is a Strangler

The Strangler is an extremely effective technique for phasing out legacy systems over time.  Instead of spending months getting a new system up to parity with the current system so that clients can use it, you place The Strangler between the original system and the clients.  

The Strangler passes any request that the new system can’t handle on to the legacy system.  Over time, the new system handles more and more, the legacy system does less and less. Most importantly, your clients won’t have to do any migration work and won’t notice a thing as the legacy system fades away.

A common objection to setting up a Strangler is that it Yet Another Thing that your overloaded team needs to build.  Write a request proxy on top of rewriting the original requests! Who has time? 

Except, AWS customers already have a fully featured Strangler on their shelf.  The Elastic Load Balancer (ELB) is a tool that takes incoming requests and forwards them on to another server.

The only requirement is that your clients access your application through a DNS name.

With an afternoon’s worth of effort you can set up a Strangler for your legacy application.

You no longer need to get the new system up to feature parity for clients to start using it!  Instead, new features get routed to the new server, while old ones stay with the legacy system.  When you do have time or a business reason to replace an existing feature the release is nothing more than a config change.

Getting a new system up to parity with the legacy system is a long process with little business value.  The Strangler lets your new system leverage the legacy system, and you don’t even have to let your clients know about the migration.  The Strangler is your Best Alternative to a Total Rewrite!

Probing Questions For Executives Approving A Rewrite

A while back, I wrote a guide for developers preparing for a system rewrite.  This time I’m talking to the other side of the table; executives being asked to approve a system rewrite.

Here are three probing questions to help you and your developers dig into the realities of a rewrite and gain confidence in the project:

  1. What are the client’s pains, and how does the rewrite solve them?
  2. How will you support current users during the rewrite?
  3. What lessons have you learned from the original system, and how will you prevent repeating the same mistakes?

What are the client’s pains, and how does the rewrite solve them?

There is no business value in a the next generation of your software.  Have your developers put in the work to understand how the current system causes client pain?  

Do they know the frustrating pain of needing features that can’t be delivered? The maddening pain of redoing work destroyed by a bug? The teeth grinding pain of slow systems that steal minutes with deadlines hours away?

Which pains are they addressing, and how much of the system do they need to rewrite to give clients some relief?

Pain provides focus and a metric for success. Without a specific pain goal, the project will metastasize and grow into “replace everything”. 

How will you support current users during the rewrite?

Feature freeze is the most common answer, but it should not be acceptable.  Neither is only patching critical bugs.

One solution that has worked well for me: The Senior proposing the rewrite gets to architect and oversee the new solution, but all the coding will be done by the juniors.  The senior will be responsible for fixing bugs and implementing new features on the legacy system.  

With that much skin in the game, and no juniors doing junior things, you’ll be amazed at how quickly the legacy system stabilizes. I have seen this be so successful that the rewrite gets canceled, and the work written off as a moral boosting learning experience.

What lessons have you learned from the original system, and how will you prevent repeating the same mistakes?

This is really a question of ownership.  Are your developers ready to acknowledge and own their mistakes?

Are their expectations for the future realistic, or do they expect the shiny new technology to prevent them from making the same mistakes?

Going from monoliths to micro services, micro services to serverless, or serverless back to a monolith won’t help if your developers are repeatedly implementing a fundamentally flawed design.

Rewrites can destroy your company

Companies have been destroyed by endlessly stalled rewrites.

Work with your developers to answer these questions.  It’s likely that you’ll find a much less risky path than the full rewrite.  If you do agree to the rewrite, asking questions will give you confidence that your developers have a realistic plan that they can deliver.

Questions To Ask Before Automating

The prospect of automating manual tasks emits a siren song to most developers.  Like a siren, the call often leads you straight into disaster. Best intentions often end up leaving companies with code that’s more expensive to maintain and less useful than human labor.  Reports and tasks become a leaky faucet for productivity.

Here are six questions to ask yourself, or a developer, before dancing to the automation music:

How often is the task likely to change?

Weekly Business Intelligence reports change monthly, monthly ones change every quarter, and quarterly ones change every year.  They are never stable enough to be worth automating by an outside developer. This is why BI tools that let non-technical users semi-automate reports are a 5 billion dollar industry.

On the other hand, regulatory and compliance reports are likely to be stable for years and make great targets.

If a task won’t be executed at least 10 times between changes, it probably won’t be worth automating.

How long is the task likely to continue?

Some tasks are likely to continue “forever”.  Archiving old client data, scrubbing databases and other client onboarding/offloading tasks fall into this category.

Some tasks are never going to come up again.

If a task won’t be executed at least 5 more times, it probably won’t be worth automating.

How much human effort does the task consume, and for whom?

You can automate turning on the office lights in the morning with a motion detector, but it won’t pay off in terms of time saved from flipping a switch.

How much of an interruption is doing the task?  Turning on the lights on your way in the door isn’t an interruption for anyone.  Phone support manually resetting a user password isn’t an interruption, but having the CFO process refunds for clients is a giant interruption.

Even if the reset and refund are both a single button click that takes 15 seconds, pulling the CFO away is a much bigger deal.  Also the context switch for the CFO will be measured in minutes because she’s not processing refunds all day long.

Use a sliding scale based on time and title.  For entry level, don’t automate until the task takes more than an hour per person per day.  For the C-Suite anything over 5 minutes per day is a good target.

How much lag can automation save your clients?

Clients don’t care how long the task takes, they care about the lag between asking and receiving.  It doesn’t matter that processing a refund only take 5 minutes if your team only processes refunds once a week.

If the client lag is more than a day, consider automating.

Is the Task a real process, or are you cleaning up the effects of a bug?

Software bugs can do all sorts of terrible things to your data and process, but after the first couple of times, the damage becomes predictable and you’ll get better at fixing the damage.  

Automating the fix is one way of fixing the bug.  That’s how bugs become features.

If you don’t want to make the bug an official part of your software, don’t automate the fix.

How common and expensive are mistakes?

Mistakes are inevitable when humans are manually performing routine tasks.  

Mistakes are inevitable when developers first automate a routine task.  Assume that developer mistakes will equal one instance of a manual mistake.  

For an automation to save money you have to expect to prevent at least 2 manual errors.

As an equation:

[Cost to Automate] + [Cost of a mistake] < [Cost of a mistake] * [Frequency of mistakes]

Because the cost of mistakes is relatively easy to quantify, tasks with expensive mistakes are usually automated early on.

Conclusion

Developers always want to automate things, sometimes it pays off, sometimes it’s a mistake.

If you ask these six questions before automating you’re much more likely to make the right choice:

  1. How often is the task likely to change?
  2. How long is the task likely to continue?
  3. How much human effort does the task consume, and for whom?
  4. How much lag can automation save your clients?
  5. Is the Task a real process, or are you cleaning up the effects of a bug?
  6. How common and expensive are mistakes?