State, Persistence, And Cold Restarts

In a SaaS, state exists in many places, some of them outside of your control.  If you don’t know where to look, you will never truly understand your system.

State You Can Control

As systems grow in complexity, state creeps into an ever growing number of places:

  • Databases
  • Caches
  • Running Software
  • Config Files
  • Shell scripts
  • Flat files

Do your caches have state?  If they went down and it impacted anything other than performance, then they have state.

Do your shell scripts have state?  If they reference specific customers then they certainly do.

The list goes on.

Some of your state has backups, like databases, source code, and some config.  Your only backups for running software and caches are more instances of running software and caches.

State You Can’t Control

State also creeps into things that aren’t software, and you can’t control:

  • Tribal knowledge
  • Manual Procedures
  • Anything in someone’s head that isn’t documented

People are a critical, and critically overlooked, part of any complex system.  The most statefull people aren’t the programmers.  It’s the operations and support people that interact with customers that have the most state and least backup.

Disaster Recovery and Cold Start

A cold restart is the hardest kind of disaster recovery.  You have all new server instances, whatever is in source control, and whatever is in your backups.

Things that live in people’s heads are even harder to recover.  Which repos are important, which services are supposed to run, what does the system even look like?

Depending on your setup you may be ok, you may never recover.

I was at GuaranteedRate when it acquired Discover's Home Loan operations.  As part of the acquisition we got their mortgage software and their developers, but no state.  I watched the team of developers that created the system spend a year trying to do a cold restart.  They failed.

Consider State Now, Before It Gets Away From You

State grows organically with systems and keeping it under control requires effort.  Beyond the fear of disaster recovery, knowing where state exists in your system is key to maintenance and growth over time.

Once you start looking for state in your systems, you’ll find it everywhere.

The Technical Problems You Should Solve At A Midsize SaaS

20 Things You Shouldn’t Build At A Midsize SaaS was about technical problems developers at a midsize SaaS shouldn’t try to solve.  Is midsize SaaS development all glue work?  Have all the problems been solved?

Of course not.  Midsize SaaS is the garden of Refactoring, Scaling, and Performance.

To make it sound fancy: The deep work for developers at a midsize SaaS is designing solutions for emergent architectural problems.

A Midsize SaaS Has Different Problems Than A Startup SaaS

Startups have unproven theories about what customers want.  They need to get features out as quickly as possible to test theory against reality.  Worrying about multiple data centers, global latency, or the performance of features customers can’t see, is a waste of time.

At a startup you should write good code, find product market fit, and don’t worry about how the system will perform when you have 10,000 paying customers.

Once you have thousands of paying customers, that’s when architectural gardening kicks in.

How To Support What Customers Want

The startup phase will leave you with a valuable product and an almost random set of assumptions.  You get to puzzle out the assumption, the reality, and choose solutions.

If your systems are in the United States, and all of your customers are in the United States, you will have different architecture needs than if your customers are globally distributed.  Linear and exponential scaling produce different problems and require different solutions.

You need to identify which problems you have, and iterate towards standard solutions.  Standard solutions are critical because it makes your competitive advantage, the differences that are valuable, shine through.  You can’t find the valuable unique differences when everything in your system is bespoke. 

Conclusion

The deep work at a midsize SaaS is identifying emerging problems and iterating towards solved solutions.  Pathfinding from wherever the startup phase has left you towards known destinations.  Moving towards known standard solutions makes it easier to find and improve valuable differentiators.  Building unique versions of everything makes everything harder without adding value to your customers.

Things I Learned By Helping To Program A Clavinet Clone

Over the past few months I have helped my brother-in-law create a Clavinet Clone for musical keyboards.  Those of you who know me are undoubtedly laughing at the image of me creating a musical instrument because I have no rhythm, no timing, no ear, and generally musical ability of any kind.

But, I do have over 20 years of programming experience, and I haven’t worked with someone who knows “just enough to be dangerous” in a long long time.  So I was surprised by some of the lesions.

Source Control Is Not Intuitive

Older source control systems like CVS were pretty simple, but didn’t support distributed development.  Git, on the other hand, is complicated.  Each time we did a handoff, I would make a branch, develop, and merge back to main.

When I handed it back, my brother-in-law would copy the file, append “-new” and continue.  Every handover resulted in new copies of the files.  Sometimes entire directories would be copied and get “-old” added to the name.

Branching, merging, and visual diffs may be second nature to full time developers, everyone else is going to need lesions.

IDEs Will Spoil You

The code is in Kontakt scripting language, which is an internally designed and developed scripting language.  I would do my development in Sublime Text, then switch over to the UI to compile and test.

This got old after the 3rd syntax error.  A 15 second code-to-compile loop wasn’t bad back in 1992; now I just wanted syntax highlighting.

Custom Scripting Languages Make Everything Harder

Custom scripting languages make everything harder.  They make it harder on the implementer because they have to build and maintain a custom scripting language.  They make it harder on users because they have to learn a new language.

A language with minimal functionality, limited documentation, and few examples.  The biggest time waster is the lack of a testing framework.

With all that said, the Kontakt scripting language may not have been a mistake.  The same code and controls work on physical and virtual keyboards.  Getting code to work consistently on multiple hardware platforms is no small feat.

If You Can’t Have Tests, Commit Continuously

Continuous committing and diffs are a poor way to handle regressions, but they are better than nothing.  If you’re ever in this situation, make your commits as small and targeted as possible.  Regressions will still be frustratingly common and tedious, but at least you will be able to step back through your changes.

Learning Things Sounds Great!

You can buy the best sounding Clavinet Clone here!

Full disclosure: I helped program it, it belongs to my brother-in-law, and I do not get a commission.

You’ve Built A Bad Implementation Of Other Software, Now What?

The immediate response to 20 Things You Shouldn’t Build At A Midsize SaaS was, what do you do if you HAVE built some of these things?  Great question!

First, don’t panic.  We’ve all been there.  I have built some of these things myself - multiple CSV writers and parsers, a DateTime library, and an ORM.  We are all human, and we all make mistakes from time to time.

Whether you built, or are responsible for, a bad implementation of other software, what do you do now?  It depends on which of 3 buckets the implementation falls into.

It works and you don’t need to touch it

CSV parsers and writers get written a lot because they are very simple.  Yes there are edge cases when it comes to escaping strings and unicode, but mostly once you have it working, it’s “fine”.

If you don’t need to touch it, don’t!  Let sleeping dogs lie.  Ideally put a comment in the code that the next dev should replace it with a standard library instead of expanding the code.

It works ok, but it’s incomplete

This is the biggest bucket.  Implementations of DSLs, ORMs, and Frameworks usually fall into this group - the software works ok, but it is missing key features and robustness.  Observability is usually extremely lacking.

This makes the software low priority tech debt.  There’s opportunity cost from continuing to use it, a cost to replacing it, but no direct development costs.

Generally speaking, this is the kind of thing that gets ignored until you need a missing key feature.  Then the debate between continuing with the local implementation and replacing it, resumes.

The best way to approach these situations is to look at the interfaces and requirements for the replacement software, and converge the local implementation over time.  This combines regular maintenance with replacement work.   As a bonus, it lowers the cognitive load on developers by making the local implementation work more like standard offerings.

It requires continuous maintenance

Some kinds of software require continuous maintenance.  When you write your own, you have to do all the maintenance yourself.

Scripting languages are a wonderful source of Halting Problem issues.  Your customers will be excellent at creating scripts that will never exit and will consume resources until the server restarts.  The more successful you are, the worse the problem will become.

I saw a SaaS with 2 full time developers working to close all of the ways that customers were accidentally creating infinite loops with their homegrown scripting language.

I worked at a company using objective-c which had extended Apple’s compiler.  This was back in the days of installed, on-prem, software.  Every time Apple had a release, we had to warn customers not to upgrade until we could re-extend the compiler, recompile, and ship the code.

When you encounter implementations that require continuous maintenance, you have to start working to remove them immediately.  The longer they remain, the more entrenched they become.  The sunk cost fallacy will work against you in strange and insidious ways.

Iterative removal is the only real option in these cases.

No matter which bucket the badly implemented software falls into, the first step is always to recognize the problem.  If there’s already an implementation, you shouldn’t write your own without a good reason.

20 Things You Shouldn’t Build At A Midsize SaaS

I have seen developers build a lot of unnecessary and counterproductive pieces of software over the years.  Generally, developers at small to midsize SaaS companies shouldn’t build any software that doesn’t directly help them deliver a service to their customers.

Whether it was the zero interest rate period, bad management, or hubris, developers spent a lot of company money on projects that never made sense given their employer’s goals and size.  I have seen custom implementations of every type of software on this list.  None of it worked better than open source, and none offered a competitive advantage.

If you find yourself developing or managing any of these twenty types of projects, stop and seriously consider what you are doing.

  1. Scripting languages
  2. Compiler extensions
  3. Transpilers
  4. Database extensions
  5. Databases
  6. DSLs
  7. ORMs
  8. Queues
  9. Background work schedulers
  10. GraphQL
  11. Stateful REST
  12. Frontend Frameworks
  13. Backend Frameworks
  14. Servers
  15. Dependency Injectors
  16. CSV writers or parsers
  17. Cryptography Implementations
  18. Logging Libraries
  19. DateTime libraries
  20. Anything from “First principles”

There are always exceptions, if building this software has some competitive advantage, go ahead.  In general, anyone suggesting these projects is biting off more than they can chew and doesn’t fully understand the problem they are trying to solve.

Most often things start out as a quick hack - “I’ll just concatenate these strings with a comma, it will be faster than finding a full CSV library.”  Soon you’re implementing custom separators and string escaping.

If your company has done their own implementations don’t despair, iterate towards a better library!

Learning When To Stop Developing A Project

Robert Moses used lies and trickery to ensure that his projects were completed.  He loved to start projects and let pride, politics, and sunk costs pull them to completion.  Software development is notorious for grinding on and delivering projects years late that don’t solve the original problem.

SaaS project deadlines are artificial, usually the only thing that can stop a project is completion.  Even projects with no developers will shamble on, zombie-like, eating a bit of everyone’s brain as they stumble through the code.

When you find yourself confronted with a long lived, poorly defined project, start asking questions:

  1. Was there a time element to the project, and has it passed?
  2. Have the assumptions behind the project changed?
  3. Have the company’s goals changed?

Was there a time element to the project, and has it passed?

Calendars are cyclical; it’s technically never too late or too early to get ready for your industry’s high volume period.  But if you see a project to scale for Black Friday, in January, there’s a good chance that you don’t need to finish it.

Your company got through Black Friday without the project, why do you need it now?

Have the assumptions behind the project changed?

I have been involved with many “if we switch from technology X to Y, we can save a lot of money” type projects.  Less than half produced significant cost savings.  In most of the cases we knew that the savings wouldn’t materialize early in the project.  The projects kept going anyway.

It is much easier to rationalize “the cost savings may not be there, but technology Y is better” than to work out if the new tech justifies the project on technical merits.

Have the company’s goals changed?

Long running, nebulous, projects run the risk of having the company’s goals change.  I “increase performance” and “scale the system”.  But when the customer profile changes, I am often scaling the wrong part of the system.

Learning To Question Is The First Step

Before you can stop a project, you have to question the project.  Question the timing, the assumptions, and the company’s goals.

If the answers are no, it might be time to stop development.

Abandoned Wells And Other Dangers In Your Codebase

Giant, long lived, codebases are full of code that should never be used for anything new.  Not only shouldn’t the code be used for anything new, it shouldn’t be used at all; but if it ain’t broke, don’t fix it.

The brownfield codebase gets littered with mineshafts, wells, and other hazards from products that were discontinued or never made it into production.  Will the shaft collapse if you start digging?  Maybe.  Are those abandoned wells the right size to swallow new developers?  Absolutely.

How do you move forward when your codebase has become a hazardous environment?

When a developer hangs the UI for your biggest customers because a reporting function is slow; is that the UI developer’s fault for not testing well enough, the report developer’s fault for not adding a warning about the function’s performance, or a sign of a dangerous codebase?

Once your codebase becomes hazardous it is everyone’s responsibility to be cautious, and everyone’s responsibility to warn of danger.  Develop as a team and watch each other’s backs, otherwise you might end up stuck in a well.

The Anna Karenina Principle of Scaling Software Systems

The Anna Karenina Principle says “All happy families are alike; each unhappy family is unhappy in its own way.”  The same is true for software scalability.  All scalable systems are alike; each unscalable system is unscalable in its own way.

Scalable Systems Scale Linearly

Scaling software systems means doing more work with more resources.  As software scales various state management issues will require ever more resources.  In the beginning processing 2x more requests will require less than 2x more resources.  Over time, the ratio will become 1-to-1 and continue to slide.  2x more requests will require 4x more resources.

So long as the ratio is linear, the system is scalable.  When the ratio becomes exponential, for example 2x more requests require x^2 more resources, the system is no longer scalable.

Scalable Systems Are Fault Tolerant

A system with 5 nines will process 99.999% of all events correctly.  Failures are 1 in a million.  A system doing 1 million events per second will have a failure every second.

Scalable systems have defined ways of handling failure - they will return an error and make the caller handle it, they will do retries, they may even fail silently.  What they won’t do is crash or error out.

Scalable Systems Are Idempotent

Exactly once delivery is nearly impossible.  As a system scales, it becomes inevitable that some events will be processed multiple times.

Scalable systems are idempotent and indifferent to how many times an event is processed.  It doesn’t matter how many times an event enters the system, once, twice, 100 times, it is all the same.

Non-idempotent systems are much easier to build, but they will fail in all kinds of different ways as customers and networks send in duplicate events.

Scalable Systems Have Minimal Human Interaction

Humans are expensive and do not scale.  The more human intervention a system requires, the less scalable it is.

Scalable systems do not require humans to deploy code, scale up or down, or watch the logs.

Conclusion

Remember, all scalable systems are alike:

  1. Linear Scaling
  2. Fault Tolerant
  3. Idempotent
  4. Require Minimal Human Interaction

Systems that are not scalable are not scalable in their own unique way.

If you don’t know why your system isn’t scalable, these 4 points are a great place to start looking.

Musketeering Makes Problems Intractable

Musketeering is lumping multiple difficult problems together to present a giant, intractable, disaster.

The name comes from the famous slogan: All for one, and one for all!  Each musketeer supports the group, and the group supports each musketeer.  When multiple problems form as one, they become impossible to defeat.

Most developers have faced a classic Three Musketeer problem with legacy code:

  1. The code is full of bugs
  2. Unit testing is nearly impossible
  3. Touching anything can have unknown side effects

Each of these issues are fixable on their own, together they bring development to a halt.

Why can’t you fix the bugs?  Because testing is nearly impossible and everything you touch has side effects.

Why can’t you write tests?  Because the code is tightly coupled, which produces side effects.  Also, it is full of bugs so we don’t know what the correct functionality is.

Why can’t you reduce side effects?  Because the code is buggy and there are no tests.  If you can’t separate the concerns, you can’t make progress.

Why SaaS Projects With “New” Or “Next” In The Name Are Likely To Fail

Replacing useful software is hard.  Naming things is hard.  Somehow naming projects to replace software seems easy.  

If your system is called [X], the replacement project is called New [X] or Next Generation [X].

These names aren’t “easy”, they are a sign that your project is poorly thought out and likely doomed.  This is especially true in SaaS where your customers are paying for this generation.

New [X] is a sign that your project is inward looking and hasn’t considered your customer’s needs.  Do they need “New” or do they need specific features?

New [X] invites scope creep: new idea + new [x] = features in new [x].  Does the new idea fit in with the purpose of New [X]?  Well, it’s new!

New [X] turns up the pressure on the release.  It’s new!  We couldn’t ship the features to customers incrementally, we have to do a grand reveal and make a splash!

A poorly considered project, with lots of scope creep, and the pressure of making a splash combine to doom New [X].

It is completely fine to replace useful software with a better design and new technology.  Even if you do it iteratively, the end goal is a new version of the same system.  And it turns out that there is a simple naming system you can use!

If your system is called [X], the next iteration of the system is called [X][int++].
Your customers don’t care if you replace MySaaSBackend with MySaaSBackend2.  They shouldn’t even notice.

Site Footer