Making Link Tracking Scale – Part 1 Asynchronous Processing

Link Tracking is a core activity for Marketing and CRM SaaS companies.  Link Tracking often an early system bottleneck, one that creates a lousy user experience and frustrates your clients.  In this article I’m going to show a common synchronous design, discuss why it fails to scale, and show how to overcome scaling issues by making the design asynchronous.

What Is Link Tracking?

Link Tracking allows you to track who clicks on a link.  This lets you measure the effectiveness of your marketing, learn what offers appeal to which clients, and generally track user engagement.  When you see a link starting with fb.me or lnkd.in, those are tracking links for Facebook and LinkedIn.

Instead of having a link go to original target, the link is changed to a tracking link.  The system will track 3 pieces of data: which client, which user, and what url, and then redirect the user’s browser to the original link.

A Simple Synchronous Design

Here’s what that looks like as a sequence diagram. 

There are 2 trips to the database, first to discover what the original link is, and a second to record the click.  After all of that is done, the original link is returned to the user and their browser is redirected to the actual content they are looking for.

Best case on a cloud host like AWS the Server + Database time will be about 10ms.  That time will be dwarfed by the 50-100ms from general network latency getting to AWS, through the ELB and to the server.

This design is simple, speedy, and works well enough for your early days.

Why Synchronous Processing Fails to Scale

Link Tracking events tend to be spikey – there’s an email blast, an article is published, or some tweet goes viral.  Instead of 150,000 events/day uniformly spread over 2 events/s, your system will suddenly be hit with 100 events/s, or even 10,000/s.  Looking up the URL and recording the event will spike from 10ms to 1s or even 10s.

While your system records the event, the user waits.  And waits. Often closing the browser tab without ever seeing your content.

Upgrading the database’s hardware is an expensive way to buy time, but it’ll work for a while.  Eventually though, you’ll have to go asynchronous.

How Asynchronous Scales

With Asynchronous Processing, it becomes the responsibility of the Server to remember the Link Tracking event and process it later.  Depending on your tech stack this can be done a lot of different ways including Threads, Callbacks, external queues and other forms of buffering the data until it can be processed.

The important part, from a scaling perspective, is that the user is redirected to the original URL as quickly as possible.

The user doesn’t care about Link Tracking, and with Asynchronous Processing you won’t make your users wait while you write to the database.

Making the event processing asynchronous is an important first step towards making a scalable system. In part 2 I will discuss how caching the URLs will improve the design further.

What is a Queue Anyway?

Several readers wrote in response to my article Your Database is not a Queue to tell me that they don’t know what a queue is, or why they would use one.  As one reader put it, “assume I have 3 tools, a hammer, a screwdriver and a database”. Fair enough, this week I want to talk about what a message queue is, what features it offers, and why it is a superior solution for batch processing.

To keep things as concrete as possible, I’ll use a real world example, Nightly Report Generation, and AWS technologies.

Your customers want to know how things are going.  If your SaaS integrates with a shopping cart, how many sales did you do?  If you do marketing, how many potential customers did you reach? How many leads were generated?  Emails sent? Whatever your service, you should be letting your customers know that you’re killing it for them.

Most SaaS companies have some form of RESTful API for customers, and initially you can ask clients to help themselves and generate reports on demand.  But as you grow, on demand reporting becomes to slow. Code that worked for a client with 200 customer shopping carts a day may be to slow at 2,000 or 20,000 carts.  These are great problems to have!

To meet your client’s needs, you need a system to generate reports overnight.  It’s not client facing, so it doesn’t need to be RESTful, and it’s not driven by client activity, so it won’t run itself.

Enter queues!

For this article we will use AWS’s SQS, which stands for Simple Queue Service.  There are lots of other wonderful open source solutions like Rabbit MQ, but you are probably already using AWS, and SQS is cheap and fully managed.

What does a queue do?

  • A queue gives you a list of messages that can only be accessed in order.
  • A queue guarantees that one, and only one, consumer can see a message at a time.
  • A queue guarantees that messages do not get lost, if the consumer takes a message from the queue, it has a certain amount of time to tell the queue that the message is done, or the queue will assume something has gone wrong and make the message available again.
  • A queue offers a Dead Letter Queue, if something goes wrong to many times the message is removed and placed on the Dead Letter Queue so that the rest of the list can be processed

As a practical matter all that boils down to:

  • You can run multiple instances of the report generator without worrying about missing a client, or sending the same report twice.  You get to skip the early concurrency and scaling problems you’d encounter if you wrote your own code, or tried to use a database.
  • When your code has a bug in an edge case, you’ll still be able to generate all of the reports that don’t hit the edge case.
  • You can alert on failures, see which reports failed, and after you fix the bug, you can click of a button put them back into the queue to run again.
  • Because SQS is managed by AWS you get monitoring and alerts out of the box
  • Because SQS is managed by AWS you don’t have to worry about the queue’s scale or performance.
  • You can add API endpoints to generate “on demand” messages and put them on the queue.  This means that the “standard” report messages, and client generated “on demand” report requests follow the same process and you don’t need to build and maintain separate reporting pipelines for internal and external requests.

Everything is tradeoffs, and queues don’t solve all problems well.  What kinds of problems are you going to have with SQS?

  • Priority/Re-sorting existing messages – You have 10,000 messages on the queue and an important client needs a report NOW!  Prioritized messages, or changing the order of messages on the queue is not something SQS supports. You can have a “fast lane” queue that gets high priority messages, but it’s clunky.
  • Random Access – You have 10,000 messages and want to know where in the queue a specific client’s report is?  Queues let you add at the end and take from the front, that’s it. If you need to know what’s in the middle you’ll need to maintain that information in a separate system.
  • Random Deleting – Probably not important for reports, but for cases like bulk importing, if a client changes her mind and wants to cancel a job, you can’t reach into the middle of the queue and remove the message.
  • Process order is not guaranteed – If you have a 3 part task, and you cannot start Task 2 until Task 1 finishes, you cannot add all 3 tasks onto the queue at once.  It is highly likely that a second worker will come along and start Task 2 while Task 1 is still in process. Instead you will need to have Task 1 add Task 2 onto the queue when it finishes.

None of these problems will crop up in your early iterations, and they are great problems to have!  They are signs that your SaaS is growing to meet your client’s needs, you and your clients are thriving!

To bring it full circle, what if you already have a home grown system for running reports?  How can you get on the path to queues? See my article on about common ways into the DB as a Queue mess, and some suggestions on how to get out.