Run to A Runbook

Giving users the ability to define their own searches, data segmentation and processes creates a lot of value for a SaaS.  The User Defined Parts of the codebase are also always going to contain the most “interesting” performance and scaling problems as users assemble the pieces into beautiful, powerful and mind boggling ways.

It’s not a bug, it’s performance

Performance bugs aren’t traditional bugs.  The code does come up with the right answer, eventually.  But when your clients think your system is slow, they don’t care why.  Whether it does too much work, can’t be run in parallel, or if your system allows the customer to shoot themselves in the foot, it’s all bugs to your clients.

You need to care about why because you get to do something to make things better.

Run to a Performance Runbook

A performance runbook can be nothing more than a list of tips and tricks for dealing with issues in User Generated land.  Because the problems aren’t bugs, they won’t leave obvious errors in the logs.  They require developing specialized techniques, tools, and pattern matching.

By writing down your debugging techniques, a runbook will help you diagnose problems faster.

Reduce Everyone’s Mental Load

Performance issues manifest everywhere in a tech stack.  The issues that a client is noticing are often far removed from the bottleneck.  

Having a centralized place to document issue triaging reduces the mental load on everyone in your organization.  Where do we start looking?  What’s that query?  A runbook helps you with those first common steps.

Support gets help with common trouble areas and basic solutions.  Listening to a client explain an issue and not being able to do anything but escalate is demoralizing for everyone involved.  Every issue support can fix improves the experience for the client and support.  Even something as simple as improving the questions support asks the client will pay off in time saved.

When senior support and developers are called in, they know that all the common solutions have been tried.  The basic questions have been asked and the data gathered.  They can skip the basics and move on to the more powerful tools and queries, saving everyone’s time.  New diagnosis and solutions go into the runbook making support more powerful.

The common questions and common solutions become automation targets.  You can proactively tell a client that they’re using the system “wrong”, and send them help and training materials.  The best support solutions are when you reach out to the client before they even realize they have a problem.

6 Questions To Start A Runbook

Common solutions to common problems?  Training?  Proactive alerting?  Sounds great, but daunting.

Runbooks are living documents.  The days when they were printed and bound into manuals ended decades ago.

Start small.

Talk to the developer who fixed the last issue:

  1. What did they look for in the logs?  
  2. What queries did they run?  
  3. What did they find? 
  4. How did they resolve the issue?

Write down the answers.  Repeat every time there’s a performance issue.

After a few incidents, patterns should emerge.

Bring what you’ve got to your support managers and ask:

  1. Could support have done any of the investigative work?
  2. If support had the answer, could they have resolved the issue? 

Help train support on what they can do, create tools for useful things support can’t do on their own.

Every time a problem gets escalated, that’s a chance to iterate and improve.

Conclusion - Runbooks Help Everyone

Building a performance runbook sounds a lot like accepting performance problems and working on mitigation.

Instead, it is about surfacing the performance problems faster, finding the commonalities, and fixing the underlying system.

Along the way the runbook improves the client experience, empowers support, and reduces the support load on developers.

Everyone wins when you run to a runbook!

You Won’t Pay Me to Rescue Your Legacy System

When I first started consulting, I tried to specialize in Legacy System Rescue.  I quickly learned that this is terrible positioning because Legacy System Rescue isn’t an Expensive Problem.  Jonathan Stark defines an Expensive Problem as a problem that someone would like to spend a lot of money on to solve right now.

Legacy System Rescue is certainly a Big Problem.  Everyone agrees that a buggy system that makes development slow and painful is bad.  Errors in production are bad.  Spending time and resources to mitigate production outages are bad.  But there is no immediacy.  There’s no reason to spend a lot of money right now instead of waiting until the next feature ships, or the next quarter.  Letting things go just a little bit longer is usually why the system needs a rescue.

Hiring someone like me to come in, analyze the codebase and find a way to untangle the mess is a lot of work.  Fixing bugs and making it easy to add new features is a low leverage situation.  It takes a lot of time by highly skilled developers.  Highly skilled developers in low leverage situations makes Legacy System Rescue an Expensive Solution.  It will probably pay off for the company, but no one department is going to get enough value from fixing the legacy system to cover the costs.  The ROI gets worse when you factor in the resentments of the developers.  Bringing in an outsider to judge their work and dictate fundamental changes doesn’t fill people with joy.

Combine the two and you have a Tragedy of the Commons - a Big Problem that requires an Expensive Solution.  What you don’t have is a business case to spend a lot of money, right now, to fix things.

You won’t pay me to rescue your legacy system because paying a lot, right now, for the solution is worth less to you than living with the problem.

Talking bout my generation

In the world of SaaS software, nothing is as sure a sign that a project will fail than calling it The Next Generation.  No one wants the next generation of your software! Your clients don’t want The Next Generation, they want the service that they’re paying for.  If they wanted something revolutionarily different, they wouldn’t be your clients. Your internal users don’t want The Next Generation, they want tools that help them do their job.

The Next Generation is something created by developers to sugar coat a full rewrite of the existing software.  The Next Generation has no business value. If you need The Next Generation of your software in order to write a new reporting module, it’s a sign you’ve given up on the current system.

When developers pitch The Next Generation they are being lazy.  It in an admission that the developers do not understand the business or care about client needs.  The Next Generation has no release plan other than replacement, and any theoretical client value comes at the end of the project.  At the same time it consumes resources that could have been used to for incremental features or code improvements.

The Next Generation starts with great fanfare, then goes silent for 6 months.  Then the team starts growing! Not because of success, but the sunk cost fallacy.  The company has spent so much and things are so close. Good money after bad, critical months after lost months.  Around 12 months management starts to micromanage. At 18 months the project is declared a failure. The team lead leaves.  Often the managers are not far behind. After millions of dollars spent on The Next Generation, you’re still where you started.

In SaaS, clients are buying This Generation.  If your developers are done with This Generation, you don’t need The Next Generation, you need to find your Best Alternative to a Total Rewrite!

Site Footer