I am super excited to announce that my latest article has been published on leaddev.com!
It’s a great discussion of the tension between what you can do, and what is effective.
Please check out The dangers of pulling rank as a Staff+ engineer!
I help SaaS Companies retain whale clients
I am super excited to announce that my latest article has been published on leaddev.com!
It’s a great discussion of the tension between what you can do, and what is effective.
Please check out The dangers of pulling rank as a Staff+ engineer!
Imagine getting this whopper in an interview or a take home test:
The United States has been conducting a census once a decade for over 200 years.
Imagine you can iterate the data at a family level, with the family data being whatever format/object is easiest for you.
Find the family with the longest fibonacci sequence of children.
The most fundamental issue is that it’s not clear what the answer looks like. In fact, the 4 of us had 3 different interpretations of what the answer would look like.
Is the question looking for children’s ages going forward?
That would be an age sequence of 0, 1, 1, 2, 3, 5, etc
Or a newborn, a pair of 1 year old twins, a 2 year old, 3 year old, 5 year old, etc
Or is it looking for children born in the sequence? (This is the inverse of the first answer)
A 6 year old, 5 year old twins, a 3 year old and a newborn
Or is it asking about the age gap between children?
In that case you’d be hunting for Twins (gap of 0), a gap of 1 year, a second gap of 1 year, a gap of 2 years, etc.
There are so many ways to be the family fibonacci.
Fairly straightforward computer problems with meaningless mathematics sprinkled on top. Being asked by people who won’t know the implications of any of the 3 answers.
If you are presented with this question in an interview, the correct answer is to thank the interviewer for their time, wish them the best of luck in their search, and end the interview.
In my last article, You Won’t Pay Me to Rescue Your Legacy System, I talked about my original attempt at specializing, and why it didn’t work. I bumbled along until I lucked into a client that helped me understand when Legacy System Rescue becomes an Expensive Problem.
Rather than Legacy System Rescue, I was hired to do “keep the lights on” work. The company had a 3 developer team working on a next generation system, all I had to do was to keep things running as smoothly as possible until they delivered.
The legacy system was buckling under the weight of their current customers. Potential customers were waiting in line to give them money, and had to be turned. Active customers were churning because the platform was buckling.
That’s when I realized – Legacy System Rescue may grudgingly get a single developer, but Scaling gets three developers to scale and one to keep the lights on. Scaling is an expensive problem because it involves churning existing customers and turning away new ones.
Over 10 months I iteratively rescued the legacy system by fixing bugs and removing choke points. After investing over 50 developer months, the next generation system was completely scrapped.
The Lesson – Companies won’t pay to rescue a legacy system, but they’d gladly pay 4x to scaleup and meet demand.
When I first started consulting, I tried to specialize in Legacy System Rescue. I quickly learned that this is terrible positioning because Legacy System Rescue isn’t an Expensive Problem. Jonathan Stark defines an Expensive Problem as a problem that someone would like to spend a lot of money on to solve right now.
Legacy System Rescue is certainly a Big Problem. Everyone agrees that a buggy system that makes development slow and painful is bad. Errors in production are bad. Spending time and resources to mitigate production outages are bad. But there is no immediacy. There’s no reason to spend a lot of money right now instead of waiting until the next feature ships, or the next quarter. Letting things go just a little bit longer is usually why the system needs a rescue.
Hiring someone like me to come in, analyze the codebase and find a way to untangle the mess is a lot of work. Fixing bugs and making it easy to add new features is a low leverage situation. It takes a lot of time by highly skilled developers. Highly skilled developers in low leverage situations makes Legacy System Rescue an Expensive Solution. It will probably pay off for the company, but no one department is going to get enough value from fixing the legacy system to cover the costs. The ROI gets worse when you factor in the resentments of the developers. Bringing in an outsider to judge their work and dictate fundamental changes doesn’t fill people with joy.
Combine the two and you have a Tragedy of the Commons – a Big Problem that requires an Expensive Solution. What you don’t have is a business case to spend a lot of money, right now, to fix things.
You won’t pay me to rescue your legacy system because paying a lot, right now, for the solution is worth less to you than living with the problem.
Over the past few months I have been ruminating on SaaS Tenancy Models and how they drive architectural decisions. I hope you’ve enjoyed the series as I’ve scratched my itch.
Here is a roundup of the 7 articles In case you missed any of the parts, or need a handy index to what I’m sure is the most in depth discussion of SaaS Tenancy Models ever written.
Part 1 – An introduction to SaaS Tenancy Models
Part 2 – An addendum to the introduction
Part 3 – How growth and scale drive tenancy model changes
Part 4 – Regaining Effective Single Tenancy through Cell Isolation
Part 5 – Why your job service should be Multi-Tenant even if your model is Single Tenant
Part 6 – Whose data is it anyway, why you need to separate your SaaS’s data from your clients
Part 7 – 3 Signs your resource allocation model is working against you
3 Signs Your Resource Allocation Model Is Working Against You
After 6 posts on SaaS Tenancy Models, I want to bring it back to some concrete examples. When your SaaS has a Single Tenant model, clients expect to allocate all the resources they need, whenever they want. When every client is entitled to the entire resource pool, no client gets a great customer experience.
Here are 3 signs your Resource Allocation Model is working against you:
This is a classic “noisy neighbor” problem. Each client tries to claim all the shared resources needed to do their work. This isn’t much of a problem when none of the clients need a significant percentage of the pool. When a large client comes along, it drains the pool, and leaves your small clients flopping like fish out of water.
When having multiple large clients in a cell affects stability, the short term solution is to migrate some clients to another cell. Large clients can impact performance, but they should not be able to impact stability. Moving clients around buys you time, but it also forces you to focus on smaller, less profitable clients.
This is advice that often pops up on SaaS message boards. Don’t try to run your job during the day, schedule it to run in the evening so it is ready for the morning. When clients start posting workarounds to your problems, it’s a clear sign of frustration. Your clients are noticing that performance varies by the time of day. They are building mental models of your platform and deciding you have load and scale issues. By being helpful to each other, your clients are advertising your problems.
Conclusion
These 3 issues have the same root cause; your SaaS’s operational data is mixed in with client data. If you have any of these three problems, the time has come to separate your data from the clients’.
Fixing these problems won’t be easy or quick!
The good news is that you can separate the data and change your resource allocation model in an iterative fashion. Start by pushing your job service across the tenancy line.
Get value and regain control one incremental step at a time, and never do a rewrite!
The tenancy line is a useful construct for separating SaaS data from client data. When you have few clients, separating the data may not be worth the effort of having multiple data stores. As your system grows, ensuring that client data is separated from the SaaS data becomes as critical as ensuring the clients’ data remains separate from each other.
Company data is everything operational. Was an action successful? Does it need to be retried? How many jobs are running right now? This data is extremely important to make sure that client jobs are run efficiently, but it’s not relevant to clients. Clients care about how long a job takes to complete, not about your concurrency, load shaping, or retry rate.
While the data is nearly meaningless to your clients, it is useful to you. It becomes more useful in aggregate. It has synergy. A random failure for one client becomes a pattern when you can see across all clients. When operational data is stored in logically separated databases you quickly lose the ability to check the data. This is when it becomes important to separate operational data from clients.
Pull the operational data from a single client into a multi-tenant repository for the SaaS, and suddenly you can see what’s happening system wide. Instead of only seeing what’s happening to a single client, you see the system.
Once you can see the system, you can shape it. See this article for a discussion on how.
If visibility isn’t enough, extracting operational data is usually its own reward.
Operational data is usually high velocity – tracking a job’s progress involves updating the status with every state change. If your operational store is the same as the client store, tracking progress conflicts with the actual work.
This post has been a long time coming – I wrote the idea down in my first list of potential posts, and I wrote a draft way back in 2019!
It is also the first time I can say that the content has been approved by my employer since they published it on their website.
It’s a great read, and I hope you enjoy it:
Dropdooms! How Binding an Unbounded Reference Table Can Kill Your UI’s Performance
A Jobs Service is a very common service for SaaS companies. It provides a way to run work on a schedule, on demand, and independent of human activity. Often, everything that isn’t done through the website is done by a Job Service.
I have never worked at a SaaS without some version of a Job Service, usually homegrown and built off a database instead of a queue. They usually have descriptive and funny names – Task Processor, Crons, Crontabulous, Maestro, Batch Processor and of course Polite Batch Jobs.
Starting early in the SaaS’s life, they also evolve and grow with the SaaS, creating problems as they migrate from Single Tenant to a logically shared environment.
In a single tenant model, provisioning a Job Service with a pool of workers is fairly straightforward. Jobs are generated and put onto a queue (and not a database!)
The Job Service takes jobs off of the queue and fans them out to the worker pool. This is simple and works well because the Queue handles the complexities of tracking and retrying jobs.
After an initial infrastructure consolidation the Job Service might look like this:
Multiple clients exist on a single database cluster, each with their own logically separate schema.
The CRUD service has become a pool of servers that can act on behalf of any client.
There is still only 1 queue and 1 Job Service; Workers can act on behalf of any client, just like the CRUD servers.
Jobs get added haphazardly, and processed in a FIFO manner.
This model is much more resource efficient – sharing workers allows you to size the pool to keep things busy.
But this design is a disaster from a Noisy Neighbor standpoint.
Because the Queue is FIFO, the Job Service has no visibility into the client composition of the pending jobs, and a large client can easily starve a small one of resources by adding hundreds or thousands of jobs to the queue. The large client will see progress as the jobs are processed, but nothing happens for the small client until the large job finishes.
Things get even worse if the Queue and Job Service are Global instead of Cell based. A global queue feeding a global worker pool that works on clients spread across multiple database clusters will naturally cause database cluster hot spots. Performance will degrade for everyone on the cluster while the workers do massive jobs for a few large clients.
You can add bandaids like limiting the number of jobs per client and moving excess work onto overflow queues. This will help smaller clients somewhat, but natural hotspots will still occur.
The Job Service needs to evolve from being Logically Separated into a Multi-Tenant service.
It needs to know how many jobs each client has pending, how long the jobs are taking, and how hot the database clusters are running so that it can operate a priority queue instead of FIFO.
The Jobs Service needs to move across the Tenancy Line
With Logically Separate infrastructure the clients share infrastructure, but the data and services all behave as if there is only one client at a time. As a result each client can regulate its own behavior, but has no visibility into the infrastructure as a whole.
To stop acting like a Single Tenant service, the Jobs Service needs to cross the line into Multi-Tenancy.
This change is conceptually simple, but has a lot of subtle implications.
In the original model work loads are random based on when jobs are added to the queue. When a hotspot emerges, there’s not much that the service can do without manual intervention. When there’s a noisy neighbor you can’t do much to stop them from starving smaller clients because you don’t know where those clients are in the queue.
With a Multi-Tenant job service, you can control resources across cells and the entire platform. Small clients can be protected by moving jobs up in priority based on how many recent jobs they have completed.
Jobs will finish faster as worker loads can be managed across cells, preventing hotspots.
Overall throughput will rise, smaller client performance will improve dramatically, and large clients will see more consistent execution times.
The original design used a single simple queue. Every client adds jobs directly to the queue, and the Job Service’s responsibility is to take work, pass it to a worker, and mark the job as complete. If there’s a failure, the queue will time the job out and put the work back on the queue.
A FIFO queue prioritizes by insertion order and doesn’t have any mechanism for reordering. The Job Service will have to build prioritization logic and find a way to integrate into a queuing mechanism. Do not give in to temptation and turn your database into a queue!
Pushing the Jobs Service across the Tenancy Line is a major coming of age step in the evolution of a SaaS company.
It trades significant development resources and complexity for consistent execution and a solution to the Noisy Neighbor Problem. The SaaS benefits from the synergy this creates with better resource utilization and reduced database hotspotting.
Once a SaaS has enough clients to warrant the change, making the Jobs Processor Multi-Tenant is a major step forward.
This is part 4 in a series on SaaS Tenancy Models. Parts 1 , 2 , and 3.
SaaS companies are often approached by potential clients who want their instance to be completely separate from any other client. Sometimes the request is driven by legal requirements (primarily healthcare and defense), sometimes it is a desire for enhanced security.
Often, running a Multi-Tenant service with a single client will satisfy the client’s needs. Clients are often willing to pay for the privilege of their account run Single Tenant, making it a potentially lucrative option for a SaaS.
A Cell is an independent instance of a SaaS’ software setup. This is different from having software running in multiple datacenters or even multiple continents. If the services talk to each other, they are in the same cell regardless of physical location.
Cells can differ with the number and power of servers and databases. Cells can even have entirely different caching options depending on need.
The 3 most common Cell setups are Production, Staging (or Test), and Local.
Cell architecture comes with a few distinct properties:
In part 3 I covered the difficulties in operating in a true Single Tenant model at scale. A Cell with a single client effectively recreates the Single Tenancy experience.
Few clients want this level of isolation, but those that need it are prepared to pay for the extra infrastructure costs of an additional Cell.
For SaaS without global services, a Cell model enables a mix of clients on logically separated Multi-Tenant infrastructure and clients with effectively Single Tenant infrastructure. This allows the company to pursue clients with Single Tenant needs, and the higher price point they offer.
The catch is that Single Tenant Cells can’t exist in an architecture with global services. If there is a single service that must have access to all client data, Single Tenant Cells are out.
If you are enjoying this series, consider subscribing to my mailing list (https://shermanonsoftware.com/subscribe/) so that you don’t miss an installment!