A Tale of Two Failures

It was the best of times, it was the worst of times.  The company hired contractors to do maintenance work so that the full time developers can write a new system, the company hired new developers to write a new system stuck original developers with the maintenance work.

When software has been bankrupted by tech debt, a common strategy is to start over.  Just like financial bankruptcy, the declaration is a way to buy time while continuing to run the business.  The software can’t be turned off, someone has to keep the old system running while the new system is built. Depending on your faith and trust in your current employees you will be tempted to either hire “short term” contractors to keep things running, or hire a new team to write the new system.

The two tactics play out differently, but the end result is the same, the new system will fail.

Why Maintenance Contractors Fail

Contractors fail because they won’t have the historical context for the code.  They may know how it works today, but not how it got there. Without deep business knowledge their actions are limited to superficial bug fixes and tactical features.  The job is to keep the lights on, not to push back on requests or ask Who, What, Where, When, Why.

Since the contractors are keeping the lights on for you while current employees build a new system, they aren’t going to care about the long term quality of the system.  It’s already bad, it’s already going away, why spend the extra time refactoring.

Contractors won’t be able to stabilize the system and buy your employees enough time to build the new system.  Equally bad, sometimes the contractors *will* stabilize the old system which becomes an excuse to cancel the new system and not restructure your debts.

Why Hiring a New Team Fails Too

The flip side is to leave the current team in place and hire a new team to write the new system.  This is super attractive when you suspect that the original team will recreate the same mistakes that led to disaster.

This tactic fails because the new team won’t have the historical context for the code.  Documentation and details will be superficial and lack all of the critical edge cases. All of which delays the project.  Meanwhile, your original team will become extremely demoralized.  

If they couldn’t keep things running before, wait until you tell them their whole job is to keep things running.  The most capable members will quickly leave and you’ll have to pull developers from the new team onto the old system.

You’ll end up pulling the new developers onto the old team and praying they don’t see it as bait-and-switch.

Don’t split on Old vs New

Both tactics fail because the teams are divided between old vs new.  In the end it doesn’t matter who is on which team, because you need old *and* new to be successful.

Keeping the team together ensures that everyone has context into how the system works and has skin in the game.

Instead of splitting old vs new, find a way to split the system in half.  Have each team own half the old system responsibilities, and one of the two new systems.  This gives everyone a stake in keeping the old system alive, a chance to work on the new system, and reduces the business risk of either team failing.

Once you find a way to split the responsibilities in half, you might even find a way to make iterative improvements, it's your Best Alternative to a Total Rewrite!

A Guide to Prepare for a System Rewrite

I often talk about the Best Alternative To A Total Rewrite.  The idea that you should know the alternatives to a rewrite before making a decision.

Even when there is no alternative to a rewrite you still need to work through the common problems that cause the project to fail.

These open ended questions are designed to guide your preparations:

How will a rewrite solve your users’ pain?

Where is the current system failing your users?  How will a rewrite fix those problems? You and your fellow developers may hate the current codebase, but that isn’t a compelling argument.

Clients don’t want the Next Generation of your system.

Are there really no incremental improvements?

Now that you know what user pain you are trying to solve, is there really no way to resolve the problem in the current codebase?  

If the system is slow, is there really nothing you can do to make it faster?  Maybe a rewrite would make things so fast that users won’t notice any lag. But if a refactor would reduce frustrating lag by 50% in a month, you should refactor.

Once you've focused on your client's problems, reducing pain today is preferable to fixing the problem 6 months from now.

Do you have to recreate the existing system before adding new features?

After bugs and latency, the number one issue with old codebases is that it’s hard to add new features.  The original code was designed to expand in one way, and now it needs to expand differently. A rewrite will let you design a system to meet those new requirements.

What about a new system that is nothing but new features?  Can you find a way to make a new system that is a client of the old system?

A rewrite requires you have to recreate all the features of the existing system before adding the new features that create new business value.

How will you ensure that the new system doesn’t have the same problems as the original system?

The forces that caused the original system to go off the rails are going to work against the rewrite too.  

Management didn't want to spend time on unit tests?  They still won’t.  

Constantly changing business requirements?  Just wait until the demands aren’t constrained by the existing code.

Inadequate resources, logging, metrics and other tools?  A rewrite won’t make any of that better.

After a brief honeymoon a rewrite will face more pressure than the original system.  You need a plan to keep all of those external forces at bay.

What’s the release plan?

Replacement is not a release plan.  How is the rewrite getting into production?  Can you find a way to run the two systems in parallel?  Can you replace things a piece at a time?

You will have to release the new system eventually. "When it's done" takes forever.

Conclusion

When you propose a rewrite, you need to be able to answer these questions to your teammates and managers.  Bringing the questions and answers up in your initial pitch will show that you’ve thought things through.

Don’t proceed until you can answer all five questions!  You just might happen to discover an alternative to a total rewrite.  Even if you don’t use it, you can move more confidently once you can speak to the alternatives.

Dead Database Pixel Tracking

Pixel Tracking is a common Marketing SaaS activity used to track page loads.  Today I am going to try and tie several earlier posts together and show how to evolve a frustrating Pixel Tracking architecture into one that can survive database outages.

Pixel Tracking events are synchronously written to the database.  A job processor uses the database as a queue to find updates, and farms out processing tasks.

Designed to Punish Users

This design is governed by database performance.  As the load ramps up, users are going to notice lagging page loads.  Worse, each event recorded will have to be processed, tripling the database load.

Designed to Scale

You can relieve the pressure on the user by making your Pixel Tracking asynchronous.  Moving away from using your database as a queue is more complicated, but critical for scaling.  Finally, using Topics makes it easy to expand the types of processing tasks your platform supports.

Users are now completely insulated from scale and processing issues.

Dead Database Design

There is no database in the final design because it is no longer relevant to the users’ interactions with your services.  The performance is the same whether your database is at 0% or 100% load.  

The performance is the same if your database falls over and you have to switch to a hot standby or even restore from a backup.

With a bit of effort your SaaS could have a database fall over on the run up to Black Friday and recover without data loss or clients noticing. If you are using SNS/SQS on AWS the queue defaults are over 100,000 events! It may take a while to chew through the queues, but the data won't disappear.

When your Pixel Tracking is causing your users headaches, going asynchronous is your Best Alternative to a Total Rewrite.

Putting Your Reputation on the Hook

This is a true story of how an inappropriate software design put a SaaS company’s reputation on the hook for their client’s bugs.  All names have been changed.

Once upon a time I consulted for Chicago Freight Brokers, a freight brokerage that was struggling with the performance of their in house systems.  CFB had an internal service, Express, that talked to 5 different transportation SaaS companies, added “special sauce” business value, and helped the brokers match trucks with loads.  Express was an MVP that was being crushed under the weight of its own success.

One of SaaS companies Express integrated with was ChicagoTMS.  ChicagoTMS, was a Transportation Management Service that built loads of freight and spoke to CFB’s accounting and insurance systems.  When a broker hit save on ChicagoTMS’s website, they would push the data into Express using a Webhook. One day I got asked to jump on a call, “ChicagoTMS is unbearably slow, it takes 30 seconds to save a load!  And they say it’s our fault! Can you talk to them?”

I jumped onto the call expecting that something had been lost in translation, after all, how could CFB’s scaling issues have any impact on ChicagoTMS’s web performance.

An hour later, I understood.  ChicagoTMS’s save operation looked like this:

The Webhook was called Synchronously as part of the save process, meaning users had to wait for the Webhook to complete successfully before they could continue their work.  CFB’s scaling problems were causing ChicagoTMS’s performance problems!

Fortunately, the culprit was a missing database index and the save time quickly dropped from 30 seconds to 2 seconds.  Unfortunately for ChicagoTMS, their design made them responsible for their client’s performance issues. ChicagoTMS’s reputation took a hit from CFB’s simple mistake.

A Better Design

Webhooks are convenient, but they are not a reliable or guaranteed delivery mechanism.  

A better design would have been for ChicagoTMS to make their Webhooks asynchronous.  This would have kept ChicagoTMS’s services fast and responsive, regardless of whether CFB made mistakes, did a lot of processing of the message, or was overwhelmed by a spike in load.

Asynchronous Webhooks would have given the users a much better experience and saved ChicagoTMS a lot of repetitional damage with their clients.

TAKE Your Reputation oFF the Hook!

Webhook performance will always be outside your control.  When you call a client synchronously you put your reputation in the client’s hands.  Your users won’t understand or care who exactly is responsible, if it’s your website, you get the blame.  

Go asynchronous and take your reputation off the hook!

I wouldn’t have done it this way.

I confess to having said, “I wouldn’t have done it this way.”  The phrase seems like a polite way of trashing the current system architecture while implying that you know the correct design.

You might get a snicker and feel smart and clever, but you’re creating problems for yourself and your project.

It puts the original programmers on the defensive.  

You may be politely trashing their design, but you’re still trashing their design.  Instead of listening to you describe your superior solution, the original programmers are going to be thinking of arguments defending their work.

The Business People Don’t Care

The managers and executives in the room didn’t care about how the original programmers designed the system.  They don’t care about how you would have designed the system. They want to know what you’re going to do about their business problems.

It leaves the listener questioning your intent.  

Are you changing the design because you don’t like it, or because it needs to be done?  You never want anyone wondering if you are proposing a refactor or rewrite because of style.

Your way might not be possible

I once used PostgreSQL as a noSql system because all the other AWS options had row size limits that were too small.  For years after new developers would tell me that I should have used different technologies. I would explain the size constraint, and more often than not, show that AWS still had constraints that would prevent us from using the technology.

Instead of trashing the original design, try these two approaches instead:

Talk about the business problem with the current design.

“The current design won’t scale to the levels we need.”  Maybe it won’t scale because it was a bad design, maybe it was a massively successful MVP.  Either way, you need to replace it to go forward.

Be positive

“The current design was a great way to get started.”  If the original design was an abject failure, no one would be asking you to rebuild or expand it.  Acknowledge that, whatever its shortcomings, the original design moved the business forward.

Acknowledge your ignorance

“I don’t know what the original requirements were, but the current design isn’t a good fit for our needs”.  It’s useful to know why past choices were made so that you don’t miss requirements, and people are much more likely to tell you if you’re the first to admit you don’t know.

“I wouldn’t have done it this way” is an old developer cliche.  The developers who inherited your work are probably saying it about you right now.

Making Link Tracking Scale – Part 2 Edge Caching

In Part 1 of Making Link Tracking Scale I showed how switching event recording from synchronous to asynchronous processing creates a superior, faster and more consistent user experience.  In Part 2, I will discuss how Link Tracking scaling issues are governed by Long Tails, and how to overcome the initial burst using edge caching and tools like Memcache and Redis.

The Long Tails of Link Tracking

When your client sends an email campaign, publishes new content your link tracker will experience a giant burst of activity, which will quickly decay like this:

To illustrate with some numbers, imagine an email blast that results in 100,000 link tracking events.  80% of those will occur in the first hour.  

In our original design from Part 1, that would 22 URL lookups, and 22 inserts per second:

For simplicity, pretend that inserts and selects produce similar db load.  Your system would need to support 44 events/s to avoid slowdowns and frustrating your clients.

The asynchronous model:

Reduces the load to 22 URL lookups, and a controllable number of inserts.  Again for simplicity let’s go with 8 inserts/s, for a total of 30 events/s.  That’s a 1/3 reduction in load!

But, your system is still looking up the Original URL 22 times/s.  That’s a lot of unnecessary db load.

Edge Caching The Original URL

The Original URL is static data that can be cached on the web server instead of loaded from the database for each event.  Instead, each server would retrieve the Original URL from the db once, store it in memory, and reuse it as needed.

This effectively drops the lookup rate from 22 events/s to 0 events/s, reducing the db load to 8 events/s, a 55% drop!  Combined with the asynchronous processing improvements from Part 1, that’s an 80% reduction in max database load.

Edge Caching on the servers works for a while, but as your clients expand the number of URLs you’ll need to keep track of won’t fit in server memory.  At that point you’ll need to add in tools like Memcached or Redis.  Like web servers, these tools are a lot cheaper than scaling your database.

Consistent Load on the Database

The great thing about this design is that you can keep the db load consistent, regardless of the incoming traffic.  Whether the load is 44 events/s or 100 events/s you control the rate of asynchronous processing. So long as you have room on your servers for an internal queue, or if you use an external queue like RabbitMQ or SQS you can delay processing the events.

Scaling questions become discussions about cost and how quickly your clients need to see results.

Conclusion

Caching static data is a great way to reduce database load.  You can use prebuilt libraries like Guava for Java, cacheout for Python, or dozens of others.  You can also leverage distributed cache systems like Memcached and Redis. While there’s no such thing as a free lunch, web servers and distributed caches are much much cheaper to scale than databases. 

You’ll save money and deliver a superior experience to your clients and their users!

It’s Not Sabotage, They’re Drowning

As you gain visibility into system problems, you'll often experience push back like this:

I refactored the code from untested and untestable, to testable with 40% test coverage. The senior architect is refusing to merge because the test coverage is to low.

Me: I instrumented the save process. Saving takes between 10-25s, with the average around 16s!
Other Developer: That's crazy! Wait! How much time does the metric system add?
Me: About 500ms.
OD: That's over a 3% increase! You have to turn the metrics off!

Me: Good news, I can make the system 15% faster and support jobs about 6x our current max. Bad news, it uses short lived DB table, which will increase disk usage by up to 1GB/client.
Different Developer: You need to find another way, we can't have that much additional disk usage.
Me: What about all those orphaned short lived tables lingering about due to bugs and errors? Would cleaning those up get us enough space?
DD: I don't know, we don't have any visibility into the magnitude of that problem. You're going to have to find another solution.

I used to believe that this kind of push back was intentional sabotage. That the people responsible for creating the problems were threatened by exposure.

It took me many years to learn that most of the time, it's not sabotage, it's drowning people sinking the lifeboat in an attempt to save themselves.

When you add visibility to a system, the numbers are always bad. That's why you're putting in the effort to add visibility to an existing system. When the initial steps towards fixing your problems make that number worse, that's when the fear of drowning sets in.

I don't have a solution for dealing with your colleagues' reactions. Just remember, they aren't trying to sabotage you, they're drowning, and you're manning the lifeboat.

If you've got similar stories of metrics bringing pushback, I'd love to hear from you!

Making Link Tracking Scale – Part 1 Asynchronous Processing

Link Tracking is a core activity for Marketing and CRM SaaS companies.  Link Tracking often an early system bottleneck, one that creates a lousy user experience and frustrates your clients.  In this article I’m going to show a common synchronous design, discuss why it fails to scale, and show how to overcome scaling issues by making the design asynchronous.

What Is Link Tracking?

Link Tracking allows you to track who clicks on a link.  This lets you measure the effectiveness of your marketing, learn what offers appeal to which clients, and generally track user engagement.  When you see a link starting with fb.me or lnkd.in, those are tracking links for Facebook and LinkedIn.

Instead of having a link go to original target, the link is changed to a tracking link.  The system will track 3 pieces of data: which client, which user, and what url, and then redirect the user’s browser to the original link.

A Simple Synchronous Design

Here’s what that looks like as a sequence diagram. 

There are 2 trips to the database, first to discover what the original link is, and a second to record the click.  After all of that is done, the original link is returned to the user and their browser is redirected to the actual content they are looking for.

Best case on a cloud host like AWS the Server + Database time will be about 10ms.  That time will be dwarfed by the 50-100ms from general network latency getting to AWS, through the ELB and to the server.

This design is simple, speedy, and works well enough for your early days.

Why Synchronous Processing Fails to Scale

Link Tracking events tend to be spikey - there’s an email blast, an article is published, or some tweet goes viral.  Instead of 150,000 events/day uniformly spread over 2 events/s, your system will suddenly be hit with 100 events/s, or even 10,000/s.  Looking up the URL and recording the event will spike from 10ms to 1s or even 10s.

While your system records the event, the user waits.  And waits. Often closing the browser tab without ever seeing your content.

Upgrading the database’s hardware is an expensive way to buy time, but it’ll work for a while.  Eventually though, you’ll have to go asynchronous.

How Asynchronous Scales

With Asynchronous Processing, it becomes the responsibility of the Server to remember the Link Tracking event and process it later.  Depending on your tech stack this can be done a lot of different ways including Threads, Callbacks, external queues and other forms of buffering the data until it can be processed.

The important part, from a scaling perspective, is that the user is redirected to the original URL as quickly as possible.

The user doesn’t care about Link Tracking, and with Asynchronous Processing you won’t make your users wait while you write to the database.

Making the event processing asynchronous is an important first step towards making a scalable system. In part 2 I will discuss how caching the URLs will improve the design further.

Three Ways To Refactor a Legacy System – A Cheesy Analogy

Software is immortal, but systems age. They reach maximum capacity and can't scale to support additional clients. They get twisted into knots as your business evolves in ways the system wasn't designed to support.

Without constant vigilance you end up with a system that your developers hate to work on and your clients find frustrating. You realize the current system is holding your business back and ask for options.

The most common answer, unfortunately, is the "6 month rewrite", also known as "a big bang." Just give your developers 6 months and they will produce a new system that does all of the good things from the old system, and none of the bad.

The "6 month rewrite" almost never works and often leaves your company in a worse situation because of all the wasted time and resources. I'm going to try and explain why with a very cheesy analogy, and suggest 2 much more effective strategies.

A very cheesy analogy

Imagine, that this piece of string cheese is your system:

"A 6 month rewrite" or "big bang" is the idea that your developers are going to shove the whole thing in their mouths and chew the whole log.

You won't really see any progress during the system mastication, but you'll be able to see the developer's jaws chewing furiously.

6 months is a long time to have developers working on one and only one thing. Especially when the chewing takes longer than expected and you reach the 9, 12, 18 month point. If you stop you'll be left with this:

Your original system. Worse for the wear and tear, but fundamentally, the original system that is restricting your business.

It's the worst of all worlds, you get no value unless the whole cheese is chewed, and you loose all the potential value if you stop!

Cut it up INTO small pieces

A great strategy when your system is failing due to scaling issues is to cut it up and refactor small pieces. Scaling issues include your system not being fast enough, unable to handle enough clients, or unable to handle large clients.

You can analyze which of these pieces are responsible for the bottlenecks in your system and tackle just those pieces:

And if you have to stop work on a single piece?

Your potential loses are much smaller.

Steel threads

When your system has been tied in knots due to changing requirements, replacing individual pieces won't help. Instead, try peeling off small end-to-end slices, creating stand alone pieces that work the way your business works now:

This is the "steel thread" or "tracer bullet" model for refactoring a system.

It allows you to try small, quick, ways to build a new system. Each thread adds immediate value as it is completed. You don't run the risk of having a large body of work that isn't helping your clients.

Like the "small pieces" strategy, you can stop and start without much loss.

Conclusion

6 month rewrites are risky and likely to fail and leave you with nothing of value from your investment of time and resources. Small piece and steel thread strategies offer ways to quickly get incremental value into your client's hands, and greatly reduce the risk of wasted work. They're your best alternative to a total rewrite!

Your Database is not a queue – A Live Example

A while ago I wrote an article, Your Database is not a Queue, where I talked about this common SaaS scaling anti-pattern. At the time I said:

Using a database as a queue is a natural and organic part of any growing system.  It’s an expedient use of the tools you have on hand. It’s also a subtle mistake that will consume hundreds of thousands of dollars in developer time and countless headaches for the rest of your business.  Let’s walk down the easy path into this mess, and how to carve a way out.

Today I have a live example of a SaaS company, layerci.com, proudly embracing the anti-pattern. In this article I will compare my descriptions with theirs, and point out expensive and time consuming problems they will face down the road.

None of this is to hate on layerci.com. An expedient solution that gets your product to market is worth infinitely more than a philosophically correct solution that delays giving value to your clients. My goal is to understand how SaaS companies get themselves into this situation, and offer paths our of the hole.

What's the same

In my article I described a system evolving out of reporting, layerci's problem:

We hit it quickly at LayerCI - we needed to keep the viewers of a test run's page and the github API notified about a run as it progressed.

I described an accidental queue, while layerci is building one explicitly:

CREATE TYPE ci_job_status AS ENUM ('new', 'initializing', 'initialized', 'running', 'success', 'error');

CREATE TABLE ci_jobs(
	id SERIAL, 
	repository varchar(256), 
	status ci_job_status, 
	status_change_time timestamp
);

/*on API call*/
INSERT INTO ci_job_status(repository, status, status_change_time) VALUES ('https://github.com/colinchartier/layerci-color-test', 'new', NOW());

I suggested that after you have an explicit, atomic, queue your next scaling problem is with failures. Layerci punts on this point:

As a database, Postgres has very good persistence guarantees - It's easy to query "dead" jobs with, e.g., SELECT * FROM ci_jobs WHERE status='initializing' AND NOW() - status_change_time > '1 hour'::interval to handle workers crashing or hanging.

What's different

There are a couple of differences between the two scenarios. They aren't material towards my point so I'll give them a quick summary:

  • My system imagines multiple job types, layerci is sticking to a single process type
  • layerci is doing some slick leveraging of PostgreSQL to alleviate the need for a Process Manager. This greatly reduces the amount of work needed to make the system work.

What's the problem?

The main problem with layerci's solution is the amount of developer time spent designing the solution. As a small startup, the time and effort invested in their home grown solution would almost certainly have been better spent developing new features or talking with clients.

It's the failures

From a technical perspective, the biggest problem is lack of failure handling. layerci punts on retries:

As a database, Postgres has very good persistence guarantees - It's easy to query "dead" jobs with, e.g., SELECT * FROM ci_jobs WHERE status='initializing' AND NOW() - status_change_time > '1 hour'::interval to handle workers crashing or hanging.

Handling failures is a lot of work, and something you get for free as part of a queue.

Without retries and poison queue handling, these failures will immediately impact layerci's clients and require manual human intervention. You can add failure support, but that's throwing good developer time after bad. Queues give you great support out of the box.

Monitoring should not be an afterthought

In addition to not handling failure, layerci's solution doesn't handle monitoring either:

Since jobs are defined in SQL, it's easy to generate graphql and protobuf representations of them (i.e., to provide APIs that checks the run status.)

This means that initially you'll be running blind on a solution with no retries. This is the "Our customers will tell us when there's a problem" school of monitoring. That's betting your client relationships on perfect software with no hiccups. I don't like those odds.

SCALING Databases is expensive

The design uses a single, ever growing jobs table ci_jobs, which will store a row for every job forever. The article points out postgreSQL's amazing ability to scale, which could keep you ahead of the curve forever. Database scaling is the most expensive piece in any cloud application stack.

Why pay to scale databases to support quick inserts, updates and triggers on a million row table? The database is your permanent record, a queue is ephemeral.

Conclusion

No judgement if you build a queue into your database to get your product to market. layerci has a clever solution, but it is incomplete, and by the time you get it to work at scale in production you will have squandered tons of developer resources to get a system that is more expensive to run than out of the box solutions.

Do you have a queue in your database? Read my original article for suggestions on how to get out of the hole without doing a total rewrite.

Is your situation unique? I'd love to hear more about it!

Site Footer