I am super excited to announce that my latest article has been published on leaddev.com!
It’s a great discussion of the tension between what you can do, and what is effective.
Please check out The dangers of pulling rank as a Staff+ engineer!
I am super excited to announce that my latest article has been published on leaddev.com!
It’s a great discussion of the tension between what you can do, and what is effective.
Please check out The dangers of pulling rank as a Staff+ engineer!
Imagine getting this whopper in an interview or a take home test:
The most fundamental issue is that it’s not clear what the answer looks like. In fact, the 4 of us had 3 different interpretations of what the answer would look like.
Is the question looking for children’s ages going forward?
That would be an age sequence of 0, 1, 1, 2, 3, 5, etc
Or a newborn, a pair of 1 year old twins, a 2 year old, 3 year old, 5 year old, etc
Or is it looking for children born in the sequence? (This is the inverse of the first answer)
A 6 year old, 5 year old twins, a 3 year old and a newborn
Or is it asking about the age gap between children?
In that case you’d be hunting for Twins (gap of 0), a gap of 1 year, a second gap of 1 year, a gap of 2 years, etc.
There are so many ways to be the family fibonacci.
Fairly straightforward computer problems with meaningless mathematics sprinkled on top. Being asked by people who won’t know the implications of any of the 3 answers.
If you are presented with this question in an interview, the correct answer is to thank the interviewer for their time, wish them the best of luck in their search, and end the interview.
The tenancy line is a useful construct for separating SaaS data from client data. When you have few clients, separating the data may not be worth the effort of having multiple data stores. As your system grows, ensuring that client data is separated from the SaaS data becomes as critical as ensuring the clients’ data remains separate from each other.
Company data is everything operational. Was an action successful? Does it need to be retried? How many jobs are running right now? This data is extremely important to make sure that client jobs are run efficiently, but it’s not relevant to clients. Clients care about how long a job takes to complete, not about your concurrency, load shaping, or retry rate.
While the data is nearly meaningless to your clients, it is useful to you. It becomes more useful in aggregate. It has synergy. A random failure for one client becomes a pattern when you can see across all clients. When operational data is stored in logically separated databases you quickly lose the ability to check the data. This is when it becomes important to separate operational data from clients.
Pull the operational data from a single client into a multi-tenant repository for the SaaS, and suddenly you can see what’s happening system wide. Instead of only seeing what’s happening to a single client, you see the system.
Once you can see the system, you can shape it. See this article for a discussion on how.
If visibility isn’t enough, extracting operational data is usually its own reward.
Operational data is usually high velocity – tracking a job’s progress involves updating the status with every state change. If your operational store is the same as the client store, tracking progress conflicts with the actual work.
This post has been a long time coming – I wrote the idea down in my first list of potential posts, and I wrote a draft way back in 2019!
It is also the first time I can say that the content has been approved by my employer since they published it on their website.
It’s a great read, and I hope you enjoy it:
SaaS companies are often approached by potential clients who want their instance to be completely separate from any other client. Sometimes the request is driven by legal requirements (primarily healthcare and defense), sometimes it is a desire for enhanced security.
Often, running a Multi-Tenant service with a single client will satisfy the client’s needs. Clients are often willing to pay for the privilege of their account run Single Tenant, making it a potentially lucrative option for a SaaS.
A Cell is an independent instance of a SaaS’ software setup. This is different from having software running in multiple datacenters or even multiple continents. If the services talk to each other, they are in the same cell regardless of physical location.
Cells can differ with the number and power of servers and databases. Cells can even have entirely different caching options depending on need.
The 3 most common Cell setups are Production, Staging (or Test), and Local.
Cell architecture comes with a few distinct properties:
In part 3 I covered the difficulties in operating in a true Single Tenant model at scale. A Cell with a single client effectively recreates the Single Tenancy experience.
Few clients want this level of isolation, but those that need it are prepared to pay for the extra infrastructure costs of an additional Cell.
For SaaS without global services, a Cell model enables a mix of clients on logically separated Multi-Tenant infrastructure and clients with effectively Single Tenant infrastructure. This allows the company to pursue clients with Single Tenant needs, and the higher price point they offer.
The catch is that Single Tenant Cells can’t exist in an architecture with global services. If there is a single service that must have access to all client data, Single Tenant Cells are out.
If you are enjoying this series, consider subscribing to my mailing list (https://shermanonsoftware.com/subscribe/) so that you don’t miss an installment!
For SaaS with a pure Single Tenant model, infrastructure consolidation usually drives the first two, nearly simultaneous, steps towards a Multi-Tenant model. The two steps convert the front end servers to be Multi-Tenant and switch the client databases from physical to logical isolation. These two steps are usually done nearly simultaneously as a SaaS grows beyond a handful of clients, infrastructure costs skyrocket and things become unmanageable.
Considering the 5 factors laid out in the introduction and addendum – complexity, security, scalability, consistent performance, and synergy this move greatly increases scalability, at the cost of increased complexity, decreased security, and opening the door to consistent performance problems. Synergy is not immediately impacted, but these changes make adding Synergy at a later date much easier.
Why is this such an early move when it has 3 negative factors and only 1 positive? Because pure Single Tenant designs have nearly insurmountable scalability problems, and these two changes are the fastest, most obvious and most cost effective solution.
Shifting from Single Tenant servers and databases to Multi-Tenant slightly increases software complexity in exchange for massively decreasing platform complexity.
The web servers need to be able to understand which client a request is for, usually through sub domains like client.mySaaS.com, and use that knowledge to validate the user and connect to the correct database to retrieve data.
The difficult and risky part here is making sure that valid sessions stay associated with the correct account.
Database server consolidation tends to be less tricky. Most database servers support multiple schemas with their own credentials and logical isolation. Logical separation provides unique connection settings for the web servers. Individual client logins are restricted to the client’s schema and the SaaS developers do not need to treat logical and physical separation any differently.
The biggest database problems with a many-to-many design crop up during migrations. Inevitably, web and database changes will be incomparable between versions. Some SaaS models require all clients on the same version, which limits comparability issues to the release window (which itself can take days), while other models allow clients to be on different versions for years.
The general solution to the problem of long lived versions is to stand up a pool of web and database servers on the new version, migrate clients to the new pool, and update request routing.
The biggest risk around these changes is database secret handling; every server can now connect to every database. Compromising a single server becomes a vector for exposing data from multiple clients. This risk can be limited by proxy layers that keep database connections away from public facing web servers. Still a compromised server is now a risk to multiple clients.
Changing from physical to logical database separation is less risky. Each client will still be logically separated with their own schema, and permissioning should make it impossible to do queries across multiple clients.
Scalability is the goal of Multi-Tenant Infrastructure Consolidation.
In addition to helping the SaaS, the consolidation will also help clients. Shared server pools will increase stability and uptime by providing access to a much larger group of active servers. The client also benefits from having more servers and more slack, making it much easier for the SaaS to absorb bursts in client activity.
Likewise, running multiple clients on larger database clusters generally increases uptime and provides slack for bursts and spikes.
These changes only impact response times when the single tenant setup would have been overwhelmed. The minimum response times don’t change, but the maximum response times get lower and occur less frequently.
The flip side to the tenancy change is the introduction of the Noisy Neighbor problem. This mostly impacts the database layer and occurs when large clients overwhelm the database servers and drown out resources for smaller clients.
This can be especially frustrating to clients because it can happen at any time, last for an unknown period, and there’s no warning or notification. Things “get slow” and there are no guarantees about how often clients are impacted, notice, or complain.
There is no direct Synergy impact from changing the web and database servers.
A SaaS starting from a pure Single Tenant model is not pursuing Synergy, otherwise the initial model would have been Multi-Tenant.
Placing distinct client schemas onto a single server does open the door to future Synergy work. Working with data in SQL across different schemas on the same server is much easier than working across physical servers. The work would still require changing the security model and writing quite a bit of code. There is now a doorway if the SaaS has a reason to walk through.
As discussed in the introduction, a SaaS may begin with a purely Single Tenant model for several reasons. High infrastructure bills and poor resource utilization will quickly drive an Infrastructure Consolidation to Multi-Tenant servers and logically separated databases.
The exceptions to this rule are SaaS that have few very large clients or clients with high security requirements. These SaaS will have to price and market themselves accordingly.
Infrastructure Consolidation is an early driver away from a pure Single Tenant model to Multi-Tenancy. The change is mostly positive for clients, but does add additional security and client satisfaction risks.
If you are enjoying this series, please subscribe to my mailing list so that you don’t miss an installment!
In the first post on Saas Tenancy Models, I introduced the two idealized models – Single and Multi-Tenant. Many SaaS companies start off as Single Tenant by default, rather than strategy, and migrate towards increasingly multi-tenant models under the influence of 4 main factors – complexity, security, scalability, and consistent performance.
After publishing, I realized that I left out an important fifth factor, synergy.
In the context of this series, synergy is the increased value to the client as a result of mixing the client’s data with other clients. A SaaS may even become a platform if the synergies become more valuable to the clients than the original service.
Another aspect of synergy is that the clients only gain the extra value so long as they remain customers of the SaaS. When clients churn, the SaaS usually retains the extra value, even after deleting the client’s data. This organically strengthens client lock in and increases the SaaS value over time. The existing data set becomes ever more valuable, making it increasingly difficult for clients to leave.
Some types of businesses, like retargeting ad buyers, create a lot of value for their clients by mixing client data. Ad buyers increase effectiveness of their ad purchases by building larger consumer profiles. This makes the ad purchases more effective for all clients.
On the other hand, a traditional CRM, or a codeless service like Zapier, would be very hard pressed to increase client value by mixing client data. Having the same physical person in multiple client instances in a CRM doesn’t open a lot of avenues; what could you offer – track which clients a contact responds to? No code services may mix client data as part of bulk operations, but that doesn’t add value to the clients.
Sometimes there might be potential synergy, like in Healthcare and Education, but it would be unethical and illegal to mix the data.
Two of the factors, complexity and scalability, are generally invisible to clients. When complexity and scalability are noticed, it is negative:
A SaaS never wants a client asking these questions.
Security, Consistent Performance and Synergy are discussion points with clients.
Many SaaS companies can adjust Security concerns and Consistent Performance through configuration isolation.
Synergy is a highly marketable service differentiator and generally not negotiable.
As much as possible I’m going to treat and draw things as 2-tier systems rather than N-tier. As long as the principles are similar, I’ll default to simplified 2-tier diagrams over N-tier or microservice diagrams.
Coming up I’ll be breaking down single to multi-tenant transformations.
Why a SaaS would want the transformation, what are the tradeoffs, and what are the potential pitfalls.
Please subscribe to my mailing list to make sure you don’t miss out!
The prospect of automating manual tasks emits a siren song to most developers. Like a siren, the call often leads you straight into disaster. Best intentions often end up leaving companies with code that’s more expensive to maintain and less useful than human labor. Reports and tasks become a leaky faucet for productivity.
Here are six questions to ask yourself, or a developer, before dancing to the automation music:
Weekly Business Intelligence reports change monthly, monthly ones change every quarter, and quarterly ones change every year. They are never stable enough to be worth automating by an outside developer. This is why BI tools that let non-technical users semi-automate reports are a 5 billion dollar industry.
On the other hand, regulatory and compliance reports are likely to be stable for years and make great targets.
If a task won’t be executed at least 10 times between changes, it probably won’t be worth automating.
Some tasks are likely to continue “forever”. Archiving old client data, scrubbing databases and other client onboarding/offloading tasks fall into this category.
Some tasks are never going to come up again.
If a task won’t be executed at least 5 more times, it probably won’t be worth automating.
You can automate turning on the office lights in the morning with a motion detector, but it won’t pay off in terms of time saved from flipping a switch.
How much of an interruption is doing the task? Turning on the lights on your way in the door isn’t an interruption for anyone. Phone support manually resetting a user password isn’t an interruption, but having the CFO process refunds for clients is a giant interruption.
Even if the reset and refund are both a single button click that takes 15 seconds, pulling the CFO away is a much bigger deal. Also the context switch for the CFO will be measured in minutes because she’s not processing refunds all day long.
Use a sliding scale based on time and title. For entry level, don’t automate until the task takes more than an hour per person per day. For the C-Suite anything over 5 minutes per day is a good target.
Clients don’t care how long the task takes, they care about the lag between asking and receiving. It doesn’t matter that processing a refund only take 5 minutes if your team only processes refunds once a week.
If the client lag is more than a day, consider automating.
Software bugs can do all sorts of terrible things to your data and process, but after the first couple of times, the damage becomes predictable and you’ll get better at fixing the damage.
Automating the fix is one way of fixing the bug. That’s how bugs become features.
If you don’t want to make the bug an official part of your software, don’t automate the fix.
Mistakes are inevitable when humans are manually performing routine tasks.
Mistakes are inevitable when developers first automate a routine task. Assume that developer mistakes will equal one instance of a manual mistake.
For an automation to save money you have to expect to prevent at least 2 manual errors.
As an equation:
[Cost to Automate] + [Cost of a mistake] < [Cost of a mistake] * [Frequency of mistakes]
Because the cost of mistakes is relatively easy to quantify, tasks with expensive mistakes are usually automated early on.
Developers always want to automate things, sometimes it pays off, sometimes it’s a mistake.
If you ask these six questions before automating you’re much more likely to make the right choice:
By the time a company gets to midsize, there will be a pool of people to perform ad hoc, automatable tasks. Most of these tasks involve manually generating reports for someone else. To someone who writes software, this can seem infuriatingly inefficient and wasteful. A sign of tech debt and bad software!
Counterintuitively, it is extremely difficult to automate reports and actually save the company time or money.
Imagine Alice, the CEO, has Bob, in accounting, make a weekly report by hand pulling numbers from 3 different systems and plugging the data into a spreadsheet. Alice doesn’t care how Bob generates the report, so Carol, a developer, decides to help Bob by automating the report.
To keep the numbers simple, Bob earns 75k/year (~$36/hr) and Carol earns 150k/yr (~$72/hr). Let’s say it takes Bob 1 hour/week to generate the report, making it cost $1,872/year.
All the data is in databases, and Bob is doing specific steps in a specific order. He’s miserable because making the report is dull, error prone, repetitive and endless. Worst of all, the report is meaningless to Bob! He’s generating it for Alice, because he’s capable, understands what the numbers mean, and has access. Carol is Bob’s friend and wants to help!
Carol’s time is more valuable than Bob’s, so she can only spend 3 days automating the report for the company to break even. That includes testing, formatting, and deployment.
Let’s assume Carol is able to automate the report on time. The report shows up in Alice’s inbox every week, same as before, but Bob is able to work on other things. Break out the champagne!
Except, Alice needs changes to the report twice a year. When Bob was generating the report she could just tell him and it would take maybe 30 minutes to change the process because Bob works in accounting and understands all of the numbers. There was no lag between the ask and the update, Bob would make the updated report next time.
Now that Carol has automated the report, Alice will have to talk to Bob, and Bob will have to talk to Carol. Alice can’t ask Carol directly because Carol doesn’t work in accounting and doesn’t know what the numbers mean. Additionally, Carol doesn’t spend her days working on reports so she’ll have to get up to speed, talk to Bob and make the change. Let’s say it takes one full 8hr work day.
As the CEO, Alice’s changes are important, but probably not important enough to pull Carol out of her current work. It usually takes 2 report cycles to get to the top of her list.
Now the automated report costs $1,152/yr to maintain, and it’s not the exact report that the CEO wants 4 weeks, or 7% of the time. 4 out of date reports easily outweighs the $720 in savings and Alice will probably have Bob manually generate the report when the code is out of date. Finally, the sunk cost of building the software will result in Carol having to maintain the software going forward, even though the software generates no real value.
The conflict between “can be automated” and “can’t be done profitably” often results in DBAs become a report generation team. It’s also the impetus behind business intelligence tools; find a way to get developer style automation, without using a developer.
I’ll go over a developer’s “should I automate this task” framework next time.
It was the best of times, it was the worst of times. The company hired contractors to do maintenance work so that the full time developers can write a new system, the company hired new developers to write a new system stuck original developers with the maintenance work.
When software has been bankrupted by tech debt, a common strategy is to start over. Just like financial bankruptcy, the declaration is a way to buy time while continuing to run the business. The software can’t be turned off, someone has to keep the old system running while the new system is built. Depending on your faith and trust in your current employees you will be tempted to either hire “short term” contractors to keep things running, or hire a new team to write the new system.
The two tactics play out differently, but the end result is the same, the new system will fail.
Contractors fail because they won’t have the historical context for the code. They may know how it works today, but not how it got there. Without deep business knowledge their actions are limited to superficial bug fixes and tactical features. The job is to keep the lights on, not to push back on requests or ask Who, What, Where, When, Why.
Since the contractors are keeping the lights on for you while current employees build a new system, they aren’t going to care about the long term quality of the system. It’s already bad, it’s already going away, why spend the extra time refactoring.
Contractors won’t be able to stabilize the system and buy your employees enough time to build the new system. Equally bad, sometimes the contractors *will* stabilize the old system which becomes an excuse to cancel the new system and not restructure your debts.
The flip side is to leave the current team in place and hire a new team to write the new system. This is super attractive when you suspect that the original team will recreate the same mistakes that led to disaster.
This tactic fails because the new team won’t have the historical context for the code. Documentation and details will be superficial and lack all of the critical edge cases. All of which delays the project. Meanwhile, your original team will become extremely demoralized.
If they couldn’t keep things running before, wait until you tell them their whole job is to keep things running. The most capable members will quickly leave and you’ll have to pull developers from the new team onto the old system.
You’ll end up pulling the new developers onto the old team and praying they don’t see it as bait-and-switch.
Both tactics fail because the teams are divided between old vs new. In the end it doesn’t matter who is on which team, because you need old *and* new to be successful.
Keeping the team together ensures that everyone has context into how the system works and has skin in the game.
Instead of splitting old vs new, find a way to split the system in half. Have each team own half the old system responsibilities, and one of the two new systems. This gives everyone a stake in keeping the old system alive, a chance to work on the new system, and reduces the business risk of either team failing.
Once you find a way to split the responsibilities in half, you might even find a way to make iterative improvements, it’s your Best Alternative to a Total Rewrite!