Incremental ETL Patterns For SaaS Developers

jeffpsherman

7 months ago

An ETL is a programmer’s way of saying “I’m moving data from one system to the other, and I have to do a bunch of things to make the data fit in the other system.” ETLs tend to get written by database administrators, business intelligence developers, and backoffice programmers. Developers that work with long lived processes that take 8-12 hours, often run overnight, and usually need to be finished first thing in the morning have different habits than SaaS developers.

SaaS developers live in the web development world – massive numbers of short lived processes. User latency is measured in milliseconds and some amount of failure is inevitable and acceptable.

Because SaaS developers are unlikely to be woken up at 3am to restart a ten hour process that needs to be finished by 9am, they aren’t aware of how incremental ETLs are structured.

The code is procedural; a series of specific steps, executed in a specific order.
Every step begins by loading state and ends by writing state to a file.
Every step is idempotent and deterministic, state is immutable.
Every step can be run manually so long as the previous step has completed successfully.

The Code Is Procedural

This isn’t about the code, it is about the system running the ETL. The code might be written in an object oriented language like python and java or a procedural language like sql. The system running the ETL, on the other hand, is designed to execute a series of steps, with retry logic and built in alerting.

Or, it can be as simple as a shell script. No retries, no alerting. Just a series of commands and echo statements that you will look at when you remember, or when someone tells you things are broken.

What an ETL is not, is a highly stateful piece of software that runs a complicated process. Procedural steps allow the process to resume close to where it failed. ETL runtimes are measured in hours, best case starting over will only cost you a day.

Every Step Begins By Loading State From Persistence And Ends By Writing State To Persistence

Persistence is a file, database, or any other durable system that will let you load or write the data over and over again.

Each task stands alone because each task can fail. The output from the last task is the input for the next. When something goes wrong, you can examine the inputs and outputs. When a simple transformation process that reads from FileA and writes to FileB stops on line 35, you can be pretty sure that there is something unexpected in line 36 of FileA.

Fix the input, fix the code, and resume. When the input is easily accessible you can often fix it manually, get the process moving, and then come back and fix the code. Remember, you are often doing it at 3am and racing the clock.

Every step is idempotent and deterministic, state is immutable

Every step needs to be tried and retried as many times as it takes to succeed. This means it can’t change or overwrite the original input, and it needs to produce the same output each time. The step ends by writing the results to persistence. So long as the step is run in isolation, with no process manager turning the outputs into inputs, it is idempotent.

You don’t need it to be perfect to rerun the job. You can fix one issue and run it again to see if it blows up again. Yes, in production because it is 3am and you can’t drag the data to a staging environment. But with immutable state, deterministic runs, and idempotentancy, even testing in production at 3am is safe.

Every Step Can Be Run Manually

The ETL is a series of steps, and those steps can be run over and over again until they are correct. Each step runs in isolation. Job running software is helpful, but when there is a problem, you can run it yourself.

Everything is laid out for you. You can examine how far the process went and where it died. There is no global state to worry about, no disabling parts of the process to get to where it failed.

How ETL Thinking Helps SaaS Developers

Outside of the database and business intelligence teams, SaaS developers rarely build ETLs. Which is a shame because ETL thinking makes it much easier to assemble complex workflows. Work incrementally, work idempotently, and your work will be powerful and easy to debug.

The Code Is Procedural

Every Step Begins By Loading State From Persistence And Ends By Writing State To Persistence

Every step is idempotent and deterministic, state is immutable

Every Step Can Be Run Manually

How ETL Thinking Helps SaaS Developers

Share this: