State, Persistence, And Cold Restarts

In a SaaS, state exists in many places, some of them outside of your control.  If you don’t know where to look, you will never truly understand your system.

State You Can Control

As systems grow in complexity, state creeps into an ever growing number of places:

  • Databases
  • Caches
  • Running Software
  • Config Files
  • Shell scripts
  • Flat files

Do your caches have state?  If they went down and it impacted anything other than performance, then they have state.

Do your shell scripts have state?  If they reference specific customers then they certainly do.

The list goes on.

Some of your state has backups, like databases, source code, and some config.  Your only backups for running software and caches are more instances of running software and caches.

State You Can’t Control

State also creeps into things that aren’t software, and you can’t control:

  • Tribal knowledge
  • Manual Procedures
  • Anything in someone’s head that isn’t documented

People are a critical, and critically overlooked, part of any complex system.  The most statefull people aren’t the programmers.  It’s the operations and support people that interact with customers that have the most state and least backup.

Disaster Recovery and Cold Start

A cold restart is the hardest kind of disaster recovery.  You have all new server instances, whatever is in source control, and whatever is in your backups.

Things that live in people’s heads are even harder to recover.  Which repos are important, which services are supposed to run, what does the system even look like?

Depending on your setup you may be ok, you may never recover.

I was at GuaranteedRate when it acquired Discover’s Home Loan operations.  As part of the acquisition we got their mortgage software and their developers, but no state.  I watched the team of developers that created the system spend a year trying to do a cold restart.  They failed.

Consider State Now, Before It Gets Away From You

State grows organically with systems and keeping it under control requires effort.  Beyond the fear of disaster recovery, knowing where state exists in your system is key to maintenance and growth over time.

Once you start looking for state in your systems, you’ll find it everywhere.

Leave a Reply

Site Footer

Discover more from Sherman On Software

Subscribe now to keep reading and get access to the full archive.

Continue reading