Writing a Run Book can be your first iterative step towards mitigating recurring problems. Recurring problems can cause massive productivity problems, but don’t get fixed because the immediate issue is elsewhere.
For example, background worker systems rarely fail on their own, instead some unique situation will cause the workers to get stuck, the controller to get confused, or the queue to be poisoned. Each time, there are really two issues. The bespoke issue that broke the background processes and the recovery of the background worker system.
Since each new failure is unique, there is a tendency to treat the background system recovery as a unique problem too. This increases recovery time and prevents you from learning from past mistakes. Because the bug isn’t in the background system itself, there is often no motivation to spend time on the code. Fix the bug, restart the system, and move on with your day.
Enter the run book.
Write down the steps needed to mitigate the problem. This is for humans, so it can be an open ended description of what to look for, it won’t be very programmatic.
Once you have it, keep iterating. Add code snippets, descriptions, and flow logic.
As you iterate, you will notice that some parts of the process can be scripted, or even automated.
Iteration after iteration, more and more of the run book will become code, which makes it easier to code up the remaining pieces.
Will you be able to iterate the recurring problem out of existence? Maybe, maybe not. But with a run book and a plan, you will make progress and not be waiting for the next outage to wreck your day.