Messaging systems don’t change the fundamental tradeoffs involved in data loading, but they do add a lot of opinionated weight to the choices.
First, messaging and streaming have a lot of potential meanings and implementations. To overgeneralize, there are queues, where each message should be read once, topics, where each client should receive each message, and broadcast, where there are no guarantees about anything.
Queues and topics work as polling loops within your software. Broadcast messaging comes directly to the machine at the network level; the abstractions are highly dependent on implementation.
For the rest of this article I’m going to cover queues. I’ll cover topics and broadcast in future articles.
Loading Patterns with Queues
The main feature of a message queue is that you want each message to be processed exactly once. This requires clients claim messages when taking them off the queue, notifications back to the queue when messages have been processed, and a client timeout, after which a claimed message will reappear at the head of the queue.
A client has to connect to a queue, ask for some number of messages and indicate how long it will wait for the max number. For example, AWS’ SQS the defaults are 10 messages or 30 seconds. The client will block until 10 messages are available, or it has been waiting for 30 seconds.
Set the message request too big and you will wait while the queue fills. Set the timeout to long and the first few messages will get stale while the client waits. You need to balance the time cost of polling the queue against the cost of having your workers idle.
The main concern however, is visibility timeouts. If your client grabs 10 messages and has a 30 second timeout it absolutely must finish all 10 in under 30 seconds. After 30 seconds the unacked messages will be released to the next waiting worker and your message will be processed twice.
Optimizing worker counts, polling settings, and timeouts requires maximizing execution consistency. When working at scale, how long an action takes matters much less than how consistent the timing is.
The Four Patterns of Data Loading are all about tradeoffs.

If your system is drinking from a firehose, you need to push everything towards a Pre-Cache model. Pre-Cache pushes data loading out of the critical path, which gives the most consistent timing for message processing.
Having fully Pre-Cached queue processors is rarely possible, there is too much data and it changes to often. Read Through Caching is the practical alternative.
Link Tracking Example
Link Tracking, recording when someone clicks on a link, is a common activity for Marketing SaaS. I’m going to use a simplified version and walk through each of the data patterns.
At a high level, we have a queue of events with 4 pieces of data that describe the click event: Customer Name, URL, Email, and Timestamp.
We want to process each event exactly once, as quickly as possible, and we don’t care about the order messages are processed. However, we can’t insert directly, we need to normalize Customer Name, URL, and Email into customerId, urlId, and emailId.

Lazy Load

The nice things about the Lazy Load pattern is that it starts quickly and is simple to understand.
Unfortunately, every event requires 3 trips to the database to normalize the data. This adds load to the database and has highly variable processing time.
It’s a fine place to start, but a terrible place to stay.
Pre-Fetch

The Pre-Fetch pattern separates fetching the data into a separate step from execution. In this example, the worker would be attempting to normalize multiple events at the same time. Once an event has been fully normalized, it is inserted into the database.
Pre-Fetch adds a lot of complexity because the workers now have internal concurrency in addition working in parallel with each other. This might be worth it if the data was being composed from different resources in a micro-service architecture and the three calls could be done in parallel. In this example though, you would be pounding the database.
Pre-Cache

The great thing about a Pre-Cache model is that there are no reads against the database during processing. It is read-only during init, and write only from the queue. That will really help the database scale!
Pre-Cache is impractical in this use case though. We might know the full set of customers at startup, but urls and emails are open ended. Pure Pre-Cache setups require that all of the data be known before starting. That can work for things like file imports and trading systems, but not link tracking.
Read Through Cache

The Read Through Cache Pattern is likely to be the best option for our use case. It is more flexible than the Pre-Cache Pattern and much friendlier to the database than Lazy Loading or Pre-Fetching. It pushes complexity to the cache where there are lots of great solutions. Most languages have internal caching mechanics, and external caches like memcached and redis are widely supported.
Conclusion
Reading from message queues is a common problem and usually requires composing data from a database. What’s the best way to load the data? As always, the right data loading pattern depends on a your conditions and assumptions.
Lazy Loading is always a decent place to start because it is simple. Performance, cost, and scaling constraints will push your design towards Read Through Caching and Pre-Caching. Pre-Fetch models are likely to be more effort than they are worth because they add a second source of concurrency and complexity.
Remember, requirements change over time, so will the best solution to your problem!
1 comments On How Running Off A Messaging Queue Impacts Data Loading Strategies
Pingback: Patterns of Data Loading – Topics and Broadcast – Sherman On Software ()