The Only Client Experience

The only client experience you can offer clients is the one you have today.  Yesterday’s experience is a memory, and tomorrow is a promise.

Delivering iteratively, baby step improvements to your client’s experience, makes the experience better today, tomorrow’s promise more believable, and memories of things improving.

Promising a moonshot keeps the only experience your client’s have exactly the same; and if it was a good one, you wouldn’t need a moonshot.

Building Your Way Out OF A Monolith – Create A Seam

Why Build Outside The Monolith

When you have a creaky monolith the obvious first step is to build new functionality outside the monolith.  Working on a greenfield, without the monolith’s constraining design, bugs, and even programming language is highly appealing.

There is a tendency to wander those verdant green fields for months on end and forget that you need to connect that new functionality back to the monolith’s muddy brown field.

Eventually, management loses patience with the project and pushes the team to wrap up.  Integration at this point can take months!  Worse, because the new project wasn’t talking to the monolith, most of the work tends to be a duplication of what’s in the monolith.  Written much better to be sure!  But, without value to the client.

Integration is where greenfield projects die.  You have to bring two systems together, the monolith which is difficult to work with, and the greenfield, which is intentionally unlike the monolith.  Now you need to bring them together, under pressure, and deliver value.

Questions to Ask

When I start working with a team building outside their monolith, integration is the number one issue on my mind.

I push the team to deliver new functionality for the client as early as possible.  Here are 3 starting questions I typically ask:

  1. What new functionality are you building?  Not what functionality do you need to build; which parts of it are new for the client?
  2. How are you going to integrate the new feature into the monolith’s existing workflows?
  3. What features do you need to duplicate from the monolith?  Can you change the monolith instead?  You have to work in the monolith sooner or later.

First Create the Seam

I don’t look for the smallest or easiest feature.  I look for the smallest seam in the monolith.

For the feature to get used, the monolith must use it.  The biggest blocker, the most important thing, is creating a seam in the monolith for the new feature!

A seam is where your feature will be inserted into the workflow.  It might be a new function in a procedural straight away, an adapter in your OOP, or even a strangler at your load balancer.  

The important part is knowing where and how your feature will fit into the seam. 

Second Change The Monolith

Once you have a seam, you have a place to start modifying the monolith to support the feature.  This is critical to prevent spending time recreating existing functionality.

Instead of recreating functionality, refactor the seam to provide it to your new service.

Finally Build Outside the monolith

Now that the monolith has a spot for your feature in its workflow, and it can support the external service, building the feature is easy.  Drop it right in!

Now, the moment your external service can say “Hello World!”, it is talking to the monolith.  It is in production, and even if you don’t finish it 100%, the parts you do finish will still be adding value.  Odds are, since your team is delivering, management will be happy to let you go right on adding features and delivering value.

Conclusion

Starting with a seam lets you develop outside the monolith while still being in production with the first release.  No working in a silo for months at a time.  No recreating functionality.

It delivers faster, partially by doing less work, partially by enabling iterations.

Scaling is Legacy System Rescue That Pays 4x

In my last article, You Won’t Pay Me to Rescue Your Legacy System, I talked about my original attempt at specializing, and why it didn’t work.  I bumbled along until I lucked into a client that helped me understand when Legacy System Rescue becomes an Expensive Problem.

Rather than Legacy System Rescue, I was hired to do “keep the lights on” work.  The company had a 3 developer team working on a next generation system, all I had to do was to keep things running as smoothly as possible until they delivered.

The legacy system was buckling under the weight of their current customers.  Potential customers were waiting in line to give them money, and had to be turned.  Active customers were churning because the platform was buckling.

That’s when I realized - Legacy System Rescue may grudgingly get a single developer, but Scaling gets three developers to scale and one to keep the lights on.  Scaling is an expensive problem because it involves churning existing customers and turning away new ones.

Over 10 months I iteratively rescued the legacy system by fixing bugs and removing choke points.  After investing over 50 developer months, the next generation system was completely scrapped.

The Lesson - Companies won’t pay to rescue a legacy system, but they'd gladly pay 4x to scaleup and meet demand.

You Won’t Pay Me to Rescue Your Legacy System

When I first started consulting, I tried to specialize in Legacy System Rescue.  I quickly learned that this is terrible positioning because Legacy System Rescue isn’t an Expensive Problem.  Jonathan Stark defines an Expensive Problem as a problem that someone would like to spend a lot of money on to solve right now.

Legacy System Rescue is certainly a Big Problem.  Everyone agrees that a buggy system that makes development slow and painful is bad.  Errors in production are bad.  Spending time and resources to mitigate production outages are bad.  But there is no immediacy.  There’s no reason to spend a lot of money right now instead of waiting until the next feature ships, or the next quarter.  Letting things go just a little bit longer is usually why the system needs a rescue.

Hiring someone like me to come in, analyze the codebase and find a way to untangle the mess is a lot of work.  Fixing bugs and making it easy to add new features is a low leverage situation.  It takes a lot of time by highly skilled developers.  Highly skilled developers in low leverage situations makes Legacy System Rescue an Expensive Solution.  It will probably pay off for the company, but no one department is going to get enough value from fixing the legacy system to cover the costs.  The ROI gets worse when you factor in the resentments of the developers.  Bringing in an outsider to judge their work and dictate fundamental changes doesn’t fill people with joy.

Combine the two and you have a Tragedy of the Commons - a Big Problem that requires an Expensive Solution.  What you don’t have is a business case to spend a lot of money, right now, to fix things.

You won’t pay me to rescue your legacy system because paying a lot, right now, for the solution is worth less to you than living with the problem.

A Guide to Prepare for a System Rewrite

I often talk about the Best Alternative To A Total Rewrite.  The idea that you should know the alternatives to a rewrite before making a decision.

Even when there is no alternative to a rewrite you still need to work through the common problems that cause the project to fail.

These open ended questions are designed to guide your preparations:

How will a rewrite solve your users’ pain?

Where is the current system failing your users?  How will a rewrite fix those problems? You and your fellow developers may hate the current codebase, but that isn’t a compelling argument.

Clients don’t want the Next Generation of your system.

Are there really no incremental improvements?

Now that you know what user pain you are trying to solve, is there really no way to resolve the problem in the current codebase?  

If the system is slow, is there really nothing you can do to make it faster?  Maybe a rewrite would make things so fast that users won’t notice any lag. But if a refactor would reduce frustrating lag by 50% in a month, you should refactor.

Once you've focused on your client's problems, reducing pain today is preferable to fixing the problem 6 months from now.

Do you have to recreate the existing system before adding new features?

After bugs and latency, the number one issue with old codebases is that it’s hard to add new features.  The original code was designed to expand in one way, and now it needs to expand differently. A rewrite will let you design a system to meet those new requirements.

What about a new system that is nothing but new features?  Can you find a way to make a new system that is a client of the old system?

A rewrite requires you have to recreate all the features of the existing system before adding the new features that create new business value.

How will you ensure that the new system doesn’t have the same problems as the original system?

The forces that caused the original system to go off the rails are going to work against the rewrite too.  

Management didn't want to spend time on unit tests?  They still won’t.  

Constantly changing business requirements?  Just wait until the demands aren’t constrained by the existing code.

Inadequate resources, logging, metrics and other tools?  A rewrite won’t make any of that better.

After a brief honeymoon a rewrite will face more pressure than the original system.  You need a plan to keep all of those external forces at bay.

What’s the release plan?

Replacement is not a release plan.  How is the rewrite getting into production?  Can you find a way to run the two systems in parallel?  Can you replace things a piece at a time?

You will have to release the new system eventually. "When it's done" takes forever.

Conclusion

When you propose a rewrite, you need to be able to answer these questions to your teammates and managers.  Bringing the questions and answers up in your initial pitch will show that you’ve thought things through.

Don’t proceed until you can answer all five questions!  You just might happen to discover an alternative to a total rewrite.  Even if you don’t use it, you can move more confidently once you can speak to the alternatives.

Your Database is not a queue – A Live Example

A while ago I wrote an article, Your Database is not a Queue, where I talked about this common SaaS scaling anti-pattern. At the time I said:

Using a database as a queue is a natural and organic part of any growing system.  It’s an expedient use of the tools you have on hand. It’s also a subtle mistake that will consume hundreds of thousands of dollars in developer time and countless headaches for the rest of your business.  Let’s walk down the easy path into this mess, and how to carve a way out.

Today I have a live example of a SaaS company, layerci.com, proudly embracing the anti-pattern. In this article I will compare my descriptions with theirs, and point out expensive and time consuming problems they will face down the road.

None of this is to hate on layerci.com. An expedient solution that gets your product to market is worth infinitely more than a philosophically correct solution that delays giving value to your clients. My goal is to understand how SaaS companies get themselves into this situation, and offer paths our of the hole.

What's the same

In my article I described a system evolving out of reporting, layerci's problem:

We hit it quickly at LayerCI - we needed to keep the viewers of a test run's page and the github API notified about a run as it progressed.

I described an accidental queue, while layerci is building one explicitly:

CREATE TYPE ci_job_status AS ENUM ('new', 'initializing', 'initialized', 'running', 'success', 'error');

CREATE TABLE ci_jobs(
	id SERIAL, 
	repository varchar(256), 
	status ci_job_status, 
	status_change_time timestamp
);

/*on API call*/
INSERT INTO ci_job_status(repository, status, status_change_time) VALUES ('https://github.com/colinchartier/layerci-color-test', 'new', NOW());

I suggested that after you have an explicit, atomic, queue your next scaling problem is with failures. Layerci punts on this point:

As a database, Postgres has very good persistence guarantees - It's easy to query "dead" jobs with, e.g., SELECT * FROM ci_jobs WHERE status='initializing' AND NOW() - status_change_time > '1 hour'::interval to handle workers crashing or hanging.

What's different

There are a couple of differences between the two scenarios. They aren't material towards my point so I'll give them a quick summary:

  • My system imagines multiple job types, layerci is sticking to a single process type
  • layerci is doing some slick leveraging of PostgreSQL to alleviate the need for a Process Manager. This greatly reduces the amount of work needed to make the system work.

What's the problem?

The main problem with layerci's solution is the amount of developer time spent designing the solution. As a small startup, the time and effort invested in their home grown solution would almost certainly have been better spent developing new features or talking with clients.

It's the failures

From a technical perspective, the biggest problem is lack of failure handling. layerci punts on retries:

As a database, Postgres has very good persistence guarantees - It's easy to query "dead" jobs with, e.g., SELECT * FROM ci_jobs WHERE status='initializing' AND NOW() - status_change_time > '1 hour'::interval to handle workers crashing or hanging.

Handling failures is a lot of work, and something you get for free as part of a queue.

Without retries and poison queue handling, these failures will immediately impact layerci's clients and require manual human intervention. You can add failure support, but that's throwing good developer time after bad. Queues give you great support out of the box.

Monitoring should not be an afterthought

In addition to not handling failure, layerci's solution doesn't handle monitoring either:

Since jobs are defined in SQL, it's easy to generate graphql and protobuf representations of them (i.e., to provide APIs that checks the run status.)

This means that initially you'll be running blind on a solution with no retries. This is the "Our customers will tell us when there's a problem" school of monitoring. That's betting your client relationships on perfect software with no hiccups. I don't like those odds.

SCALING Databases is expensive

The design uses a single, ever growing jobs table ci_jobs, which will store a row for every job forever. The article points out postgreSQL's amazing ability to scale, which could keep you ahead of the curve forever. Database scaling is the most expensive piece in any cloud application stack.

Why pay to scale databases to support quick inserts, updates and triggers on a million row table? The database is your permanent record, a queue is ephemeral.

Conclusion

No judgement if you build a queue into your database to get your product to market. layerci has a clever solution, but it is incomplete, and by the time you get it to work at scale in production you will have squandered tons of developer resources to get a system that is more expensive to run than out of the box solutions.

Do you have a queue in your database? Read my original article for suggestions on how to get out of the hole without doing a total rewrite.

Is your situation unique? I'd love to hear more about it!

Four ways Scaling Bugs are Different

Scaling Bugs don’t really exist, you will never find “unable to scale” in your logs.  Scaling bugs are timing, concurrency and reliability bugs that emerge as your system scales.  Today I’m going to show you 4 signs that your system is being plagued by scaling bugs, and 4 things you can do to buy time and minimize your client’s pain.

Scaling bugs boil down to “Something that used to be reliable is no longer reliable and your code doesn’t handle the failure gracefully”.  This means that they are going to appear in the oldest parts of your codebase, be inconsistent, bursty, and hit your most valuable clients the hardest.

Scaling Bugs appear in older, stable, parts of your codebase

The oldest parts of your are typically the most stable, that’s how they managed to get old.  But, the code was also written with lower performance needs and higher reliability expectations.

Reliability bugs can lay dormant for years, emerging where you least expect it.  I once spent an entire week finding a bug deep in code that hadn't changed in 10 years.  As long as there were no problems, everything was fine, but a database connection hiccup in one specific function would cause a cascading failure on a distributed task being processed on over 30 servers.

Database connectivity is ridiculously stable these days, you can have hundreds of servers and go weeks without an issue.  Unless your databases are overloaded, and that’s when the bug struck.

Scaling Bugs Are Inconsistent

Sometimes the system has trouble, sometimes things are fine.  Even more perplexing is that they occur regardless of multi-threading or the statefulness of your code.

This makes scaling bugs difficult to find, since you’ll never be able to reproduce them locally.  They won’t appear for a single test execution, only when you have hundreds or thousands of events happening simultaneously.

Even if your code is single threaded and stateless, your system is multi-process and has state.  A serverless design still has scaling bottlenecks at the persistence layer.

Scaling Bugs Are Bursty

Bursty means that the bugs appear in clusters, usually in ever increasing numbers after ever shorter intervals.  Initially the error crops up once every few weeks and does minimal damage, so it gets documented as low priority and never worked on.  As your platform scales though, the error starts popping up 5 at a time every few days, then dozens of time once a day. Eventually the low priority, low impact bug becomes an extremely expensive support problem.

Scaling Bugs Hit Your Most Valuable Clients Hardest

Which are the clients with the most contacts in a CRM?  Which are the ones with the most emails? The most traffic and activity?

The same ones paying the most for the privilege of pushing your platform to the limit.

The impact of scaling bugs mostly fall on your most valuable clients, which makes their potential impact high in dollar terms.

Four ways to buy time

These tactics aren’t solutions, they are ways to buy time to transform your system to one that operates at scale.   I’ll cover some scaling tactics in a future post!

Throw money at the problem

There’s never a better time to throw money at a problem then the early stages of scaling problems!  More clients + larger clients = more dollars available.

Increase the number of servers, upgrade the databases, and increase your network throughput.  If you have a multi-tenant setup, add shards and decrease the number of customers running on the same hardware.

If throwing money at the problem helps, then you know you have scaling problems.  You can also get a rough estimate of the time-for-money runway. If the improved infrastructure doesn’t help you can downgrade everything and stop spending the extra money.

Keep your Error Rate Low

It’s common for the first time you notice a scaling bug to be when it causes a cascading system failure.  However, it’s rare for that to be the first time the bug manifested itself. Resolving those low priority rare bugs is key to keeping catastrophic scaling bugs at bay.

I once worked on a system that ran at over 1 million events per second (100 billion/day).  We had a saying: The nice thing about this system is that something that’s 1 in a million happens 60 times a minute.  The only known error we let stand: Servers would always fail to process the first event after a restart.

Retries

As load and scale increases, transient errors become more  common. Take a design cue from RESTful systems and add retry logic.  Most modern databases support upsert operations, which go a long way towards making it safe to retry inserts.

Asynchronous Processing

Most actions don’t need to be processed synchronously.  Switching to asynchronous processing makes many scaling bugs disappear for a while because the apparent processing greatly improves.  You still have to do the processing work, and the overall latency of your system may increase. Slowly and reliably processing everything successfully is greatly preferable to praying that everything processes quickly.

Congratulations!  You Have Scaling Problems!

Scaling bugs only hit systems that gets used.  Take solace in the fact that you have something people want to use.

The techniques in this article will help you buy enough time to work up a plan to scale your system.  Analyze your scaling pain points to gain insight into which parts of your system are most useful to your clients and prioritize your refactoring accordingly.

Remember that there are always ways to scale your current system without resorting to a total rewrite!

Site Footer