Iterative Delivery is a uniquely powerful method for adding value to a SaaS. Other than iterative, there are no respectably named ways to deliver features, reskins, updates and bug fixes. Big bangs, waterfalls, and Quarterly releases fill your customer’s hearts with dread.
Look at 5 antonyms for Iterative Delivery:
Erratic Delivery
Infrequent Delivery
Irregular Delivery
Overwhelming Delivery
Sporadic Delivery
If your customers used these terms to describe updates to your SaaS, would they be wrong?
Iterative Delivery is about delivering small pieces of value to your customers so often that they know you’re improving the Service, but so small that they barely notice the changes.
Don’t be overwhelming, erratic or infrequent - be iterative and delight your customers.
The Chestburster is an antipattern that occurs when transitioning from a monolith to services.
The team sees an opportunity to exact a small piece of functionality from the monolith into a new service, but the monolith is the only place that handles security, permissions and composition.
Because the new service can’t face clients directly, the Chestburster hides behind the monolith, hoping to burst through at some later point.
The Chestburster begins as the inverse of the Strangler pattern, with the monolith delegating to the service instead of the new service delegates to the monolith.
Why it’s appealing
The Chestburster’s appeal is that it gets the New Service up and running quickly. This looks like progress! The legacy code is extracted, possibly rewritten, and maybe better.
Why it fails
There is no business case for building the functionality the new service needs to burst through the monolith. The functionality has been rewritten. It's been rewritten into a new service. How do you go back now and ask for time to address security and the other missing pieces? Worse, the missing pieces are usually outside of the team’s control; security is one area you want to leave to the experts.
Even if you get past all the problems on your side, you’ve created new composition complexities for the client. Now the client has to create a new connection to the Chestburster and handle routing themselves. Can you make your clients update? Should you?
Remember The Strangler
If you want to break apart a monolith, it’s always a good idea to start with a Strangler. If you can’t set up a strangle on your existing monolith, you aren’t ready to start breaking it apart.
That doesn’t mean you’re stuck with the current functionality!
If you have the time and resources to extract the code into a new service, you have the time and resources to decouple the code inside of the monolith. When the time comes to decompose into services, you’ll be ready.
Conclusion
The chestburster gives the illusion of quick progress; but quickly stalls as the team runs into problems they can’t control. Overcoming the technical hurdles doesn’t guarantee that clients will ever update their integration.
Success in legacy system replacement comes by integrating first, and moving functionality second. With the chestburster you move functionality first and probably never burst through.
When you have a creaky monolith the obvious first step is to build new functionality outside the monolith. Working on a greenfield, without the monolith’s constraining design, bugs, and even programming language is highly appealing.
There is a tendency to wander those verdant green fields for months on end and forget that you need to connect that new functionality back to the monolith’s muddy brown field.
Eventually, management loses patience with the project and pushes the team to wrap up. Integration at this point can take months! Worse, because the new project wasn’t talking to the monolith, most of the work tends to be a duplication of what’s in the monolith. Written much better to be sure! But, without value to the client.
Integration is where greenfield projects die. You have to bring two systems together, the monolith which is difficult to work with, and the greenfield, which is intentionally unlike the monolith. Now you need to bring them together, under pressure, and deliver value.
Questions to Ask
When I start working with a team building outside their monolith, integration is the number one issue on my mind.
I push the team to deliver new functionality for the client as early as possible. Here are 3 starting questions I typically ask:
What new functionality are you building? Not what functionality do you need to build; which parts of it are new for the client?
How are you going to integrate the new feature into the monolith’s existing workflows?
What features do you need to duplicate from the monolith? Can you change the monolith instead? You have to work in the monolith sooner or later.
First Create the Seam
I don’t look for the smallest or easiest feature. I look for the smallest seam in the monolith.
For the feature to get used, the monolith must use it. The biggest blocker, the most important thing, is creating a seam in the monolith for the new feature!
A seam is where your feature will be inserted into the workflow. It might be a new function in a procedural straight away, an adapter in your OOP, or even a strangler at your load balancer.
The important part is knowing where and how your feature will fit into the seam.
Second Change The Monolith
Once you have a seam, you have a place to start modifying the monolith to support the feature. This is critical to prevent spending time recreating existing functionality.
Instead of recreating functionality, refactor the seam to provide it to your new service.
Finally Build Outside the monolith
Now that the monolith has a spot for your feature in its workflow, and it can support the external service, building the feature is easy. Drop it right in!
Now, the moment your external service can say “Hello World!”, it is talking to the monolith. It is in production, and even if you don’t finish it 100%, the parts you do finish will still be adding value. Odds are, since your team is delivering, management will be happy to let you go right on adding features and delivering value.
Conclusion
Starting with a seam lets you develop outside the monolith while still being in production with the first release. No working in a silo for months at a time. No recreating functionality.
It delivers faster, partially by doing less work, partially by enabling iterations.
Over the past few months I have been ruminating on SaaS Tenancy Models and how they drive architectural decisions. I hope you’ve enjoyed the series as I’ve scratched my itch.
Here is a roundup of the 7 articles In case you missed any of the parts, or need a handy index to what I’m sure is the most in depth discussion of SaaS Tenancy Models ever written.
3 Signs Your Resource Allocation Model Is Working Against You
After 6 posts on SaaS Tenancy Models, I want to bring it back to some concrete examples. When your SaaS has a Single Tenant model, clients expect to allocate all the resources they need, whenever they want. When every client is entitled to the entire resource pool, no client gets a great customer experience.
Here are 3 signs your Resource Allocation Model is working against you:
Large clients cause small client’s work to stall
You have to rebalance the mix of clients in a cell for stability
Run your job at night for best performance
Large clients cause small client’s work to stall
This is a classic “noisy neighbor” problem. Each client tries to claim all the shared resources needed to do their work. This isn’t much of a problem when none of the clients need a significant percentage of the pool. When a large client comes along, it drains the pool, and leaves your small clients flopping like fish out of water.
You have to rebalance the mix of clients in a cell for stability
When having multiple large clients in a cell affects stability, the short term solution is to migrate some clients to another cell. Large clients can impact performance, but they should not be able to impact stability. Moving clients around buys you time, but it also forces you to focus on smaller, less profitable clients.
Run your job at night for best performance
This is advice that often pops up on SaaS message boards. Don’t try to run your job during the day, schedule it to run in the evening so it is ready for the morning. When clients start posting workarounds to your problems, it’s a clear sign of frustration. Your clients are noticing that performance varies by the time of day. They are building mental models of your platform and deciding you have load and scale issues. By being helpful to each other, your clients are advertising your problems.
The tenancy line is a useful construct for separating SaaS data from client data. When you have few clients, separating the data may not be worth the effort of having multiple data stores. As your system grows, ensuring that client data is separated from the SaaS data becomes as critical as ensuring the clients’ data remains separate from each other.
Company data is everything operational. Was an action successful? Does it need to be retried? How many jobs are running right now? This data is extremely important to make sure that client jobs are run efficiently, but it’s not relevant to clients. Clients care about how long a job takes to complete, not about your concurrency, load shaping, or retry rate.
While the data is nearly meaningless to your clients, it is useful to you. It becomes more useful in aggregate. It has synergy. A random failure for one client becomes a pattern when you can see across all clients. When operational data is stored in logically separated databases you quickly lose the ability to check the data. This is when it becomes important to separate operational data from clients.
Pull the operational data from a single client into a multi-tenant repository for the SaaS, and suddenly you can see what’s happening system wide. Instead of only seeing what’s happening to a single client, you see the system.
If visibility isn’t enough, extracting operational data is usually its own reward.
Operational data is usually high velocity - tracking a job’s progress involves updating the status with every state change. If your operational store is the same as the client store, tracking progress conflicts with the actual work.
A Jobs Service is a very common service for SaaS companies. It provides a way to run work on a schedule, on demand, and independent of human activity. Often, everything that isn’t done through the website is done by a Job Service.
I have never worked at a SaaS without some version of a Job Service, usually homegrown and built off a database instead of a queue. They usually have descriptive and funny names - Task Processor, Crons, Crontabulous, Maestro, Batch Processor and of course Polite Batch Jobs.
Starting early in the SaaS’s life, they also evolve and grow with the SaaS, creating problems as they migrate from Single Tenant to a logically shared environment.
Single Tenant Job Service
In a single tenant model, provisioning a Job Service with a pool of workers is fairly straightforward. Jobs are generated and put onto a queue (and not a database!)
The Job Service takes jobs off of the queue and fans them out to the worker pool. This is simple and works well because the Queue handles the complexities of tracking and retrying jobs.
Multiple clients exist on a single database cluster, each with their own logically separate schema.
The CRUD service has become a pool of servers that can act on behalf of any client.
There is still only 1 queue and 1 Job Service; Workers can act on behalf of any client, just like the CRUD servers.
Jobs get added haphazardly, and processed in a FIFO manner.
This model is much more resource efficient - sharing workers allows you to size the pool to keep things busy.
But this design is a disaster from a Noisy Neighbor standpoint.
Because the Queue is FIFO, the Job Service has no visibility into the client composition of the pending jobs, and a large client can easily starve a small one of resources by adding hundreds or thousands of jobs to the queue. The large client will see progress as the jobs are processed, but nothing happens for the small client until the large job finishes.
Things get even worse if the Queue and Job Service are Global instead of Cell based. A global queue feeding a global worker pool that works on clients spread across multiple database clusters will naturally cause database cluster hot spots. Performance will degrade for everyone on the cluster while the workers do massive jobs for a few large clients.
You can add bandaids like limiting the number of jobs per client and moving excess work onto overflow queues. This will help smaller clients somewhat, but natural hotspots will still occur.
Cross The Tenancy Line - Become Multi-Tenant
The Job Service needs to evolve from being Logically Separated into a Multi-Tenant service.
It needs to know how many jobs each client has pending, how long the jobs are taking, and how hot the database clusters are running so that it can operate a priority queue instead of FIFO.
The Jobs Service needs to move across the Tenancy Line
What is the Tenancy Line?
With Logically Separate infrastructure the clients share infrastructure, but the data and services all behave as if there is only one client at a time. As a result each client can regulate its own behavior, but has no visibility into the infrastructure as a whole.
To stop acting like a Single Tenant service, the Jobs Service needs to cross the line into Multi-Tenancy.
This change is conceptually simple, but has a lot of subtle implications.
The Service can control load across clients
In the original model work loads are random based on when jobs are added to the queue. When a hotspot emerges, there’s not much that the service can do without manual intervention. When there’s a noisy neighbor you can’t do much to stop them from starving smaller clients because you don’t know where those clients are in the queue.
With a Multi-Tenant job service, you can control resources across cells and the entire platform. Small clients can be protected by moving jobs up in priority based on how many recent jobs they have completed.
Jobs will finish faster as worker loads can be managed across cells, preventing hotspots.
Overall throughput will rise, smaller client performance will improve dramatically, and large clients will see more consistent execution times.
The Job Service Becomes a Queue
The original design used a single simple queue. Every client adds jobs directly to the queue, and the Job Service’s responsibility is to take work, pass it to a worker, and mark the job as complete. If there’s a failure, the queue will time the job out and put the work back on the queue.
A FIFO queue prioritizes by insertion order and doesn’t have any mechanism for reordering. The Job Service will have to build prioritization logic and find a way to integrate into a queuing mechanism. Do not give in to temptation and turn your database into a queue!
Conclusion
Pushing the Jobs Service across the Tenancy Line is a major coming of age step in the evolution of a SaaS company.
It trades significant development resources and complexity for consistent execution and a solution to the Noisy Neighbor Problem. The SaaS benefits from the synergy this creates with better resource utilization and reduced database hotspotting.
Once a SaaS has enough clients to warrant the change, making the Jobs Processor Multi-Tenant is a major step forward.
This is part 4 in a series on SaaS Tenancy Models. Parts 1 , 2 , and 3.
SaaS companies are often approached by potential clients who want their instance to be completely separate from any other client. Sometimes the request is driven by legal requirements (primarily healthcare and defense), sometimes it is a desire for enhanced security.
Often, running a Multi-Tenant service with a single client will satisfy the client’s needs. Clients are often willing to pay for the privilege of their account run Single Tenant, making it a potentially lucrative option for a SaaS.
What is a Cell?
A Cell is an independent instance of a SaaS’ software setup. This is different from having software running in multiple datacenters or even multiple continents. If the services talk to each other, they are in the same cell regardless of physical location.
Cells can differ with the number and power of servers and databases. Cells can even have entirely different caching options depending on need.
The 3 most common Cell setups are Production, Staging (or Test), and Local.
Cell Properties
Cell architecture comes with a few distinct properties:
Cell structures allow SaaS to grow internationally and offer clients low latency and localized data policies (think GDPR). Latency from the US to Europe, Asia and South America is noticeable and degrades the client experience.
Clients exist in 1 cell at a time. They can migrate, but they can’t exist in multiple cells.
Generally speaking, Cells can not be part of a disaster recovery plan. Switching clients between Cells usually involves copying the database, and can’t be done if the client’s original Cell is down.
Cell Isolation as a Single Tenant Option
In part 3 I covered the difficulties in operating in a true Single Tenant model at scale. A Cell with a single client effectively recreates the Single Tenancy experience.
Few clients want this level of isolation, but those that need it are prepared to pay for the extra infrastructure costs of an additional Cell.
Conclusion
For SaaS without global services, a Cell model enables a mix of clients on logically separated Multi-Tenant infrastructure and clients with effectively Single Tenant infrastructure. This allows the company to pursue clients with Single Tenant needs, and the higher price point they offer.
The catch is that Single Tenant Cells can’t exist in an architecture with global services. If there is a single service that must have access to all client data, Single Tenant Cells are out.
In the first post on Saas Tenancy Models, I introduced the two idealized models - Single and Multi-Tenant. Many SaaS companies start off as Single Tenant by default, rather than strategy, and migrate towards increasingly multi-tenant models under the influence of 4 main factors - complexity, security, scalability, and consistent performance.
After publishing, I realized that I left out an important fifth factor, synergy.
Synergy
In the context of this series, synergy is the increased value to the client as a result of mixing the client’s data with other clients. A SaaS may even become a platform if the synergies become more valuable to the clients than the original service.
Another aspect of synergy is that the clients only gain the extra value so long as they remain customers of the SaaS. When clients churn, the SaaS usually retains the extra value, even after deleting the client’s data. This organically strengthens client lock in and increases the SaaS value over time. The existing data set becomes ever more valuable, making it increasingly difficult for clients to leave.
Some types of businesses, like retargeting ad buyers, create a lot of value for their clients by mixing client data. Ad buyers increase effectiveness of their ad purchases by building larger consumer profiles. This makes the ad purchases more effective for all clients.
On the other hand, a traditional CRM, or a codeless service like Zapier, would be very hard pressed to increase client value by mixing client data. Having the same physical person in multiple client instances in a CRM doesn’t open a lot of avenues; what could you offer - track which clients a contact responds to? No code services may mix client data as part of bulk operations, but that doesn’t add value to the clients.
Sometimes there might be potential synergy, like in Healthcare and Education, but it would be unethical and illegal to mix the data.
Not All Factors Are Client Facing
Two of the factors, complexity and scalability, are generally invisible to clients. When complexity and scalability are noticed, it is negative:
Why do new features take so long to develop?
Why are bugs so difficult to resolve?
Why does the client experience get worse as usage grows?
A SaaS never wants a client asking these questions.
Security, Consistent Performance and Synergy are discussion points with clients.
Many SaaS companies can adjust Security concerns and Consistent Performance through configuration isolation.
Synergy is a highly marketable service differentiator and generally not negotiable.
Simplified Drawings
As much as possible I’m going to treat and draw things as 2-tier systems rather than N-tier. As long as the principles are similar, I’ll default to simplified 2-tier diagrams over N-tier or microservice diagrams.
Next Time
Coming up I’ll be breaking down single to multi-tenant transformations.
Why a SaaS would want the transformation, what are the tradeoffs, and what are the potential pitfalls.
Please subscribe to my mailing list to make sure you don’t miss out!