Making Link Tracking Scale – Part 1 Asynchronous Processing

Link Tracking is a core activity for Marketing and CRM SaaS companies.  Link Tracking often an early system bottleneck, one that creates a lousy user experience and frustrates your clients.  In this article I’m going to show a common synchronous design, discuss why it fails to scale, and show how to overcome scaling issues by making the design asynchronous.

What Is Link Tracking?

Link Tracking allows you to track who clicks on a link.  This lets you measure the effectiveness of your marketing, learn what offers appeal to which clients, and generally track user engagement.  When you see a link starting with fb.me or lnkd.in, those are tracking links for Facebook and LinkedIn.

Instead of having a link go to original target, the link is changed to a tracking link.  The system will track 3 pieces of data: which client, which user, and what url, and then redirect the user’s browser to the original link.

A Simple Synchronous Design

Here’s what that looks like as a sequence diagram. 

There are 2 trips to the database, first to discover what the original link is, and a second to record the click.  After all of that is done, the original link is returned to the user and their browser is redirected to the actual content they are looking for.

Best case on a cloud host like AWS the Server + Database time will be about 10ms.  That time will be dwarfed by the 50-100ms from general network latency getting to AWS, through the ELB and to the server.

This design is simple, speedy, and works well enough for your early days.

Why Synchronous Processing Fails to Scale

Link Tracking events tend to be spikey – there’s an email blast, an article is published, or some tweet goes viral.  Instead of 150,000 events/day uniformly spread over 2 events/s, your system will suddenly be hit with 100 events/s, or even 10,000/s.  Looking up the URL and recording the event will spike from 10ms to 1s or even 10s.

While your system records the event, the user waits.  And waits. Often closing the browser tab without ever seeing your content.

Upgrading the database’s hardware is an expensive way to buy time, but it’ll work for a while.  Eventually though, you’ll have to go asynchronous.

How Asynchronous Scales

With Asynchronous Processing, it becomes the responsibility of the Server to remember the Link Tracking event and process it later.  Depending on your tech stack this can be done a lot of different ways including Threads, Callbacks, external queues and other forms of buffering the data until it can be processed.

The important part, from a scaling perspective, is that the user is redirected to the original URL as quickly as possible.

The user doesn’t care about Link Tracking, and with Asynchronous Processing you won’t make your users wait while you write to the database.

Making the event processing asynchronous is an important first step towards making a scalable system. In part 2 I will discuss how caching the URLs will improve the design further.

Your Database is not a queue – A Live Example

A while ago I wrote an article, Your Database is not a Queue, where I talked about this common SaaS scaling anti-pattern. At the time I said:

Using a database as a queue is a natural and organic part of any growing system.  It’s an expedient use of the tools you have on hand. It’s also a subtle mistake that will consume hundreds of thousands of dollars in developer time and countless headaches for the rest of your business.  Let’s walk down the easy path into this mess, and how to carve a way out.

Today I have a live example of a SaaS company, layerci.com, proudly embracing the anti-pattern. In this article I will compare my descriptions with theirs, and point out expensive and time consuming problems they will face down the road.

None of this is to hate on layerci.com. An expedient solution that gets your product to market is worth infinitely more than a philosophically correct solution that delays giving value to your clients. My goal is to understand how SaaS companies get themselves into this situation, and offer paths our of the hole.

What’s the same

In my article I described a system evolving out of reporting, layerci’s problem:

We hit it quickly at LayerCI – we needed to keep the viewers of a test run’s page and the github API notified about a run as it progressed.

I described an accidental queue, while layerci is building one explicitly:

CREATE TYPE ci_job_status AS ENUM ('new', 'initializing', 'initialized', 'running', 'success', 'error');

CREATE TABLE ci_jobs(
	id SERIAL, 
	repository varchar(256), 
	status ci_job_status, 
	status_change_time timestamp
);

/*on API call*/
INSERT INTO ci_job_status(repository, status, status_change_time) VALUES ('https://github.com/colinchartier/layerci-color-test', 'new', NOW());

I suggested that after you have an explicit, atomic, queue your next scaling problem is with failures. Layerci punts on this point:

As a database, Postgres has very good persistence guarantees – It’s easy to query “dead” jobs with, e.g., SELECT * FROM ci_jobs WHERE status='initializing' AND NOW() - status_change_time > '1 hour'::interval to handle workers crashing or hanging.

What’s different

There are a couple of differences between the two scenarios. They aren’t material towards my point so I’ll give them a quick summary:

  • My system imagines multiple job types, layerci is sticking to a single process type
  • layerci is doing some slick leveraging of PostgreSQL to alleviate the need for a Process Manager. This greatly reduces the amount of work needed to make the system work.

What’s the problem?

The main problem with layerci’s solution is the amount of developer time spent designing the solution. As a small startup, the time and effort invested in their home grown solution would almost certainly have been better spent developing new features or talking with clients.

It’s the failures

From a technical perspective, the biggest problem is lack of failure handling. layerci punts on retries:

As a database, Postgres has very good persistence guarantees – It’s easy to query “dead” jobs with, e.g., SELECT * FROM ci_jobs WHERE status='initializing' AND NOW() - status_change_time > '1 hour'::interval to handle workers crashing or hanging.

Handling failures is a lot of work, and something you get for free as part of a queue.

Without retries and poison queue handling, these failures will immediately impact layerci’s clients and require manual human intervention. You can add failure support, but that’s throwing good developer time after bad. Queues give you great support out of the box.

Monitoring should not be an afterthought

In addition to not handling failure, layerci’s solution doesn’t handle monitoring either:

Since jobs are defined in SQL, it’s easy to generate graphql and protobuf representations of them (i.e., to provide APIs that checks the run status.)

This means that initially you’ll be running blind on a solution with no retries. This is the “Our customers will tell us when there’s a problem” school of monitoring. That’s betting your client relationships on perfect software with no hiccups. I don’t like those odds.

SCALING Databases is expensive

The design uses a single, ever growing jobs table ci_jobs, which will store a row for every job forever. The article points out postgreSQL’s amazing ability to scale, which could keep you ahead of the curve forever. Database scaling is the most expensive piece in any cloud application stack.

Why pay to scale databases to support quick inserts, updates and triggers on a million row table? The database is your permanent record, a queue is ephemeral.

Conclusion

No judgement if you build a queue into your database to get your product to market. layerci has a clever solution, but it is incomplete, and by the time you get it to work at scale in production you will have squandered tons of developer resources to get a system that is more expensive to run than out of the box solutions.

Do you have a queue in your database? Read my original article for suggestions on how to get out of the hole without doing a total rewrite.

Is your situation unique? I’d love to hear more about it!

What does Go Asynchronous mean?

In an earlier post I suggested Asynchronous Processing as a way to buy time to handle scaling bugs.  Remembering my friend and his comment “assume I have a hammer, a screwdriver, and a database”, today’s post will explain Synchronous versus Asynchronous processing and discuss how asynchronous processing will help your software scale.

Processing: Synchronous versus Asynchronous

Synchronous Explained

Synchronous processing means that each step starts, does some action, and then starts the next step.  Eventually the last action completes and returns, and so on back.

A basic synchronous web request looks like this:

A user clicks save and the browser tells the server to save the data.  The server tells the database. The database returns OK, then the server returns OK, and the browser shows a Save Successful message.

Simple to understand, but when you are having scaling problems, sometimes that save time can go from 100ms to 10s.  It’s a horrible user experience and unnecessary wait!

Asynchronous Explained

Asynchronous Processing gives a superior user experience by returning to the browser immediately. The actual save will be processed later. This makes things more complex because the request has been decoupled from the processing.

The user is now insulated from scaling issues.  It doesn’t matter if the save takes 100ms or 10s, the user gets a consistent experience.

In an asynchronous model, the user doesn’t get notified that the save was successful.  For most cases this is fine, the user shouldn’t be worried about whether their actions are succeeding, the client should be able to assume success.

The client being able to assume success does not mean your system can assume success!  Your system still needs to handle failures, exceptions and retries! You just don’t need to drag the user into it. Since you no longer have a direct path from request through processing, asynchronous operations can be harder to reason about and debug.

For instances where “blind” asynchronous isn’t acceptable you need a polling mechanism so that the user can check on the status.

How Asynchronous Processing Helps Systems to Scale

With synchronous processing your system must process all of the incoming activity and events as they occur, or your clients will experience random, intermittent, failures.

Synchronous scaling results in numerous business problems:

  • It runs up infrastructure costs. The only way to protect service level agreements is by greatly over provisioning your system so that there is significant excess capacity.
  • It creates repetitional problems. Clients can easily impact each other with cyclical behavior.  Morning email blasts, hourly advertising spending rates, and Black Friday are some examples.
  • You never know how much improvement you’ll get out of the next fix.  As your system scales you will always be rate-limited by a single bottleneck.  If your system is limited to 100 events/s because your database can only handle 100 events/s, doubling the hardware might get you to 200 events/s, or you might discover that your servers can only handle 120 events/s. 
  • You don’t have control over your system’s load.  The processing rate is set by your clients instead of your architecture. There is no way to relieve pressure on your system without a failure.

Asynchronous processing gives you options:

  • You can protect your service level agreements by pushing incoming events onto queues and acknowledging the event instantly.  Whether it takes 100ms, 1s, or 10 minutes to complete processing, your system is living up to its service level agreements.
  • After quickly acknowledging the event, you can control the rate at which the queued events are processed at a client level.  This makes it difficult for your large clients to starve out the smalls ones.
  • Asynchronous architecture forces you to loosely couple your system’s components. Each piece becomes easy to load test in isolation, giving you’ll have a pretty good idea about how much a fix will actually help. It also makes small iterations much more effective.  Instead of spending 2x to double your databases when your servers can only support another 20%, you can increase spending 20% to match your server’s max capacity. Loosely coupled components can also be worked on by different teams at the same time, making it much easier to scale your system.
  • You regain control over system load.  Instead of everything, all at once, you can set expectations.  If clients want faster processing guarantees, you can now not only provide them, but charge accordingly.

Conclusion

Shifting from synchronous to asynchronous processing will require some refactoring of your current system, but it’s one of the most effective ways to overcome scaling problems.  You can be highly tactical with your implementation efforts and apply asynchronous techniques at your current bottlenecks to rapidly give your system breathing room.  

If your developers are ready to give up on your current system, propose one or two spots to make asynchronous. You will get your clients some relief while rebuilding your team’s confidence and ability to iterate. It’s your best alternative to a total rewrite!

Creating Alternatives To Rewrites By Using Topics

I’ve been explaining what queues are, and why using a database as a queue is a bad idea, today I’m going to expand your toolkit, explain how Topics create alternatives to rewrites, and give a concrete example.

Topics allow multiple queues to register for incoming messages.  That means instead of publishing a message onto a queue, you publish onto zero or more queues at once, and there is no impact on the publisher.  One consumer, no consumer, 100 consumers, you publish one message onto a topic.

All of these situations require the same effort and resources from your publisher.

For a SaaS company with services running off queues, Topics give your developers the ability to create new services that run side-by-side with your existing infrastructure.  New functionality off of your existing infrastructure, without doing a rewrite! How does that work?

Adding a new consumer means adding another Queue to the Topic. 

No code changes for any existing services.  This is extremely valuable when the existing services are poorly documented and difficult to test.

You can test new versions of your code through end-to-end tests.  

Since you can now create two sets of the data, you can run the new version in parallel with the current version and compare the results.  Rinse, repeat until you feel confident in sending your customers results from the new system.

It’s not ideal, but you’ll sleep a whole lot easier at night knowing that the original code and original output remains untouched.

New uses for your message flow have no impact on your existing services.  

Consuming data becomes “loosely coupled”.  Freed from potentially impacting the existing, difficult, code, new reports, monitoring and other ideas become feasible and exciting instead of dread inducing.  New uses don’t even have to be in the same programming language!

A concrete example; How Topics can be used to create monitoring on a legacy system:

I worked for a company that was processing jobs off of a queue.  This was an older system that had evolved over a decade and was a mess of spaghetti code.  It mostly worked, but was not designed for any kind of observability. Because jobs like hourly reports would run, rerun, and even retry, knowing whether a specific hourly report completed successfully was a major support headache.

When challenged to improve the situation the lead developer would shrug and say that nothing could be done with the current code.  Instead, he had a plan to do a full rewrite of the scheduler system with logging, tests, and observability baked in. The rewrite would take 6 months.  The flaws, bugs and angry customers weren’t quite enough to justify a developer spending 6 months developing a new system. Especially since the new system wouldn’t add value until it was complete.  The company didn’t have the resources for a rewrite, but it did have me.

The original system was using SQS on AWS as the queue.  We changed the scheduler code to use AWS’s Topic service, SNS, instead.  We had SNS write incoming messages to the original SQS queue, and called it a release.

We now had the option and ability to add new services without any further disruption or risk to the original job processor.

We created a new service with the creative name Task Monitor, created a new SQS queue and added it as a listener to SNS.  Task Monitor maintained a list of active tasks. It would read messages off a queue and create an entry in an in memory list.  Every 5 minutes it would iterate the list and check the status of the task against the database and remove completed tasks.

Surviving tasks were added to “Over 5 min” list, “Over 10 min list”, etc and the data was exposed via a simple web api framework.  Anything over 45 minutes resulted in an alert being generated.

We now had visibility into which tasks were slipping through the cracks, and with the pattern exposed we were quickly able to fix the bugs.  Client complaints ceased (about scheduled reports anyway), which reduced the support load by about 60% of one developer. With almost 3 additional developer days per week we were able to start knocking out some long delayed features and refactoring.

All of these changes were created by a simple change of a call to SQS to a call to SNS.  We didn’t need to dive deep into the legacy system to add monitoring and instrumentation.

The additional cost and load of using Topics is negligible, but they create amazingly powerful opportunities, especially for legacy systems that are difficult to refactor.

When your developers say that there’s no way to improve a queue based system without rewriting it, look into Topics.  They’re your Best Alternative to a Total Rewrite.