Blog Posts

Real Life Testing In Production – Hotel Opening Disaster

This past week I stayed at a brand new resort hotel during opening week.  Opening week was a major disaster for the hotel, culminating with them giving away free alcoholic drinks to all guests.  For a week.

This post isn’t about all of the problems, it is about testing in production, and how much worse things could have been.  You see, the resort will eventually be 9 buildings, and they opened with 3.  As a firm believer in iteration I commend them because their opening would have been so much worse if they waited.

Eventually, the resort will be nine 4-story buildings in a U shape around a water park.  3 buildings were complete and receiving guests, the exteriors of 3 were complete, and 3 were being constructed.  Too late to learn any architectural lessons from the early guests, but not too early to learn a lot about the hotel’s operations.

First up - you need maintenance staff on site, because our first room had no hot water.  Maybe we were the first guests, maybe the system was installed wrong and broke.  Whatever the reason, there was no hot water, and no one who could diagnose and fix the issue.

You could have someone walk through with a checklist to test for things like hot water.  Or you could let your customers do it!  You could have maintenance staff on site, or you could literally file a ticket and wait for someone to show up and fix the problem.

The hotel wasn’t full, so we were moved to another room.  Another new room!

This one had no cold water in the bathroom sink.  Fortunately I could debug this issue!  The shutoff valve was closed; I opened it.  Yes, I flipped a switch to turn on a feature.

Another opening week oops - the hotel had purchased cone coffee makers and basket coffee filters.

But the ultimate config issue wasn't no hot water, or no cold water, or the wrong shaped coffee filters.  The ultimate config issue was no liquor license.  At a resort hotel.  For over a week.  They had a fully stocked bar, 2 bar tenders, and 4 wait staff.  But, no liquor license, means no alcohol sales.  They resolved the “alcohol permission issue” by giving it away.

All of these issues would have been so much worse with 3 times more guests.  They could have caught the quality issues with better testing.  They could have avoided giving away thousands in alcohol by waiting until they were ready.  There were many, many ways the hotel could have opened better.

But at least they iterated!  By opening with ⅓ of their rooms available, they were able to limit the fallout from testing in production.

SaaS Code Does Not Have a Final State

I was making a diplomatic comment about some software I was refactoring, “Regardless of the original programmer’s vision, the original code doesn’t work with the final state.”  I realized that my version of the code won’t be the final state either.  When you work in SaaS, code doesn’t have a final state.

SaaS code has history; it was written to solve a problem.  It may have evolved to solve the problem better, it may have evolved because the problem changed.


SaaS code has a present; it is what it is.  It solves a problem for some customers.  The code might be amazing, but frustrates customers because the problem has changed.  The code could be terrible but delights customers because it perfectly solves a static problem.

The code may have a future.  The problem can change, the implementation can change.

What SaaS code doesn’t have is a final state.  Until you delete it, you can never look at a piece of code and say that won’t be changed again.

Don’t make the mistake of thinking that once you refactor some code, the code will be in its final state.  The Service in SaaS will change over time, your code will change with it.

The Never Rewrite Podcast, Episode Ninety-Seven, Fulfilling Friday – A Tattoo Scam?

Isaac's plan for a full arm sleave gains the attention of a famous tattoo artist. He just needs to act quickly and send money to secure the appointment…

Watch on YouTube or listen to it at Spotify, Apple Podcasts, or your favorite podcast app, and let us know if you have ever been involved in a rewrite. We would love to have you on the show to discuss your experience!

Sherman’s Law of Doomed Software Projects

Sherman’s Law of Doomed Software Projects - Software projects with the word “Next” or “New” in the name are doomed.

Sherman’s Law only applies to the internal name used by the people working on the software itself.  Names used in marketing or external communication don’t count.

Projects with the word “Next” in the title are doomed because no one wants the next version of your software.  Paying customers aren’t paying today because of the promise of a Next generation of the software.  Internal customers want software that helps them do their job today.  They are using the software today, because it solves a problem today.

Could there be future customers out there who have the problem your current software solves, but who will only buy if you release the next generation?  Sure, but if they’re willing to wait the problem isn’t that pressing and they probably won’t buy the next generation either.

Projects with the word “New” in the title are doomed because new is temporary and muddles project goals.  The new version starts off with clear goals and business value, but time is the enemy.  Whatever gets released is new, regardless of the goals and value in the final product.

If “next” and “new” are names for doomed projects, what names can you use?

You’ve got an existing product or service, and you need to do major work on it.  At the end, you’ll have the “next” generation, or “new” version of the product or service.  Speak to the reason you need to do the work.

If you discover that your software is fundamentally insecure, you don’t need a “Next Generation” project, you need “Maximum Security”.  If your system is slow, don’t start a “New” version that is faster, you need project “Lightspeed”.

I once wrote a piece of software named Polaris.  When the time came for major work, which name would focus the team and drive alignment better - Next Generation Polaris, or Maximum Security Polaris?  New Polaris, or Lightspeed Polaris?

Don’t doom your project, don’t use “Next” or “New”!

New Is Temporary

“New” is a temporary adjective; one that will disappear when the original disappears.

This is especially true when applied to software.

The “New UI” becomes just the UI.
The “New Reports” become the reports.
Any “New Experience” will fade into the experience.

Your current customers won’t remember “new”.  Customers that join after the release will never know about “new” because they never experienced the “old”.

The only ones who know, or care, about “new”, or “old” are the people who built and maintain the code.

“New” versions of existing services aren’t new, they’re the same service, with the same limitations.  Truly new experiences have new names that speak to customer value.

If you are talking about “New Service”, you’re not talking to the customer, you’re talking to yourself. New is temporary, take the time to figure out what you're really building before it becomes just the current version of what you had before.

The Never Rewrite Podcast, Episode Ninety-Six, Inverting the Testing Pyramid – Testing Infrastructure Changes ft. Rob Gonnella

Infrastructure changes aren't captured by unit tests, so how do you test them before going into production? In this episode Isaac and I welcome back guest Rob Gonnella to discuss testing infrastructure changes. Rob shares his experiences in developing local testing environments, engaging developers, and identifying bugs through end-to-end tests.

If you're thinking about changing things outside of your code and wondering how to test, this is the episode for you!

Watch on YouTube or listen to it at Spotify, Apple Podcasts, or your favorite podcast app, and let us know if you have ever been involved in a rewrite. We would love to have you on the show to discuss your experience!

Rewrites Have Two Teams – Team Rewrite and Team Maintenance.  Join Team Maintenance

When management agrees to rewrite software, they inevitably split the existing team in two - Team Rewrite and Team Maintenance.  Team Rewrite is in charge of creating a brand new system that recreates everything useful and good about the legacy system.  Team Maintenance is in charge of maintaining the legacy system until Team Rewrite completes the rewrite.  Everyone wants to be on Team Rewrite and no one wants to stay on Team Maintenance; everyone is wrong.

The benefits of Team Rewrite are obvious, you get to write new code without all the horrors of the legacy system.  You’ll use the latest technology!  You’ll do things the right way!  Your work won’t be in production, so you won’t have production incidents!  No angry customers!  The list goes on.

The benefits of Team Maintenance aren’t clear.  You get to work in the horrible legacy system.  The system that is so bad, so unfixable, that management has agreed to a rewrite.  Plus you’ll be responsible for incidents and outages!  Why would anyone join Team Maintenance?

Because Team Maintenance is both a flight to safety and an opportunity. 

It’s not glamorous, but the people maintaining the legacy system are critical.  The members of Team Rewrite contribute to theoretical future value, Team Maintance’s work supports customers today.  If things go badly with the rewrite, and usually do, the entire rewrite team can be fired.  The maintenance team remains critical so long as the legacy system lives.

Team Maintenance is also two kinds of opportunity.  

First, is the opportunity to clean up the legacy system.  The legacy system doesn’t have to continue to be terrible.  It’s likely that most or all of the people who said it couldn’t be fixed are now on Team Rewrite.  Everyone left is committed to working on it for the duration.  It’s a great time to clean things up; not to save the system, but for your own sakes.  Very quickly the legacy system won’t be horrible.  It may never be great, but you can work on it sobbing.

The second opportunity is new features.

Customer needs don't stop just because there’s a rewrite going on.  Management will limit the new features so that the rewrite isn’t trying to hit a moving target.  You’ll only be working on the most important, most critical, and most impactful features.  And you’ll get to work on the critical features, because you’re Team Maintenance.

Does this mean you join Team Maintenance rooting for Team Rewrite to fail?  Not at all!  If Team Rewrite succeeds that’s also great for you.  You’ve shown that you’re a selfless team player - you took on the work that no one else wanted!  Seize the opportunities that come along and you’ll show that you can be trusted to improve your team and deliver critical features.

When a rewrite comes along, your best move is to join the maintenance team.  Volunteering for Team Maintenance is safer, comes with more opportunity, and brings you to management’s attention.  Rewrites usually fail and Team Rewrite leaves the company.  Join Team Maintenance for career growth!

Messaging Patterns: What They Are, When To Use Them

Whether you use Kafka, RabbitMQ, or even SMS, messaging infrastructure is neutral about what you are sending and why.  It is up to you, the developer, to decide on the contract between the producer and consumer of your messages.

There are 2 main considerations:

  1. Are your messages Predefined or Ad Hoc?
  2. Are your messages Ignorable or is Processing Expected?

It is critical that you set appropriate expectations for your messaging structure.  Otherwise, you’ll end up with the Command-Event Anti-Pattern.

Event Pattern - Ad Hoc & Ignorable

The Event Pattern has the least structure and fewest guarantees: You publish events in any format you want, and they may or may not get consumed.

The publisher has no expectations about whether the consumer cares about the event.  The consumer has minimal expectations about the event’s structure or data.

Generally, the only time to use the Event Pattern is with logging.  

Logs are minimally structured.  You want enough structure for your logs to get consumed by your observability platform, but not enough that it is difficult to add logs in the code.

Log messages may or may not be consumed.  Most logging systems determine whether to log based on severity settings.  In production, ERROR will almost always be on, while DEBUG will almost always be off.  Those are run time decisions though, the code doesn’t have any expectations.

State Change Pattern - Structured & Ignorable

With the State Change Pattern each event represents a change to the state of the system.  The messages are highly structured so that the consumer understands how the state just changed.  However, there is no guarantee that anyone will consume the message, and no guarantees about what the consumers will do about the change.

The State Change Pattern is extremely powerful, and difficult to do correctly.

The largest State Change Messaging platforms publish market data from stock exchanges.  Each message is either a new order, a canceled order, or an execution (trade).  Trading software uses the data to determine current prices, build books, and do everything else needed to help stock traders make decisions.  

The stock market (the publisher) doesn’t have any expectations about what the consumer (the trader) on the other end will do about each message.

A more technical example of State Change is database replication.  The primary database publishes change events (called binlogs in Mysql) and the replicas database servers consume the messages to stay in sync.  From the primary server’s perspective it doesn’t matter if there are 0 replicas or 100.  Or if the replicas are only doing partial replication.  The primary server will still publish all changes.

Command Pattern - Structured with Processing Expected

In the Command Pattern, or RPC (Remote Procedure Call) each message represents an attempt to run a command or execute work.  The important difference from the Event Pattern and State Change Pattern is that the Command Pattern has expectations about the consumer’s behavior.

The publisher has the expectation that all of the messages will be processed by the consumers.  Some implementations allow the publisher to know about the consumers and direct specific messages to specific consumers, but that isn’t a requirement.

Background workers, control planes, and job queues are some of the places you would use the command pattern.

Infrastructure - Ad Hoc with Processing Expected

The final quadrant, Infrastructure, describes messaging platforms themselves.  The publisher can send whatever they want, and the platform will process it.

Because this pattern describes messaging infrastructure, there are few uses for it ON messaging infrastructure.  Having RabbitMQ tunnel through Kafka might be an interesting project, but it wouldn’t be very useful.

Beware The Command-Event Anti-Pattern

If you aren’t intentional about your messaging pattern, you will inevitably end up with the Command-Event Anti-Pattern.  This is when you have multiple, loosely defined, message structures, some of which place processing expectations on the consumer.

The Command-Event pattern makes it easy for incorrect messages to clog up the system.  It creates confusion about which messages can be ignored, and which must be processed.  You will have a muddled mess and a long hard transition to separate your message types.

Conclusion - Be Intentional About Your Messages

Remember, the messaging infrastructure will accept any structure, or no structure.  It is concerned with delivery, not processing.  So long as every consumer gets every message that it is supposed to, your infrastructure is working properly.

It is up to you, the developer, to add expectations.

How much structure do your messages need?  Can they be skipped?  Depends on what problem you are trying to solve!  If you go forward without deciding you’ll end up with a mess known as the Command-Event anti-pattern.

If you’ve got a mess, you can fix it iteratively!  Never try a rewrite!  Iteratively separate your messages onto new, problem specific, streams.

The Never Rewrite Podcast, Episode Ninety-Five, We’re Writing a Book!

Never Rewrite is going analog! Out of 2000 minutes over the last two years we think we've got plenty of content to coalesce it into a book. If you have a rewrite story to share, now's your chance to get forever immortalized in the podcast hosted by two randos.

Watch on YouTube or listen to it at Spotify, Apple Podcasts, or your favorite podcast app, and let us know if you have ever been involved in a rewrite. We would love to have you on the show to discuss your experience!

Announcing Never Rewrite: The Low Risk Guide To Modernizing Your Legacy Software

Isaac and I are excited to announce that we are taking our podcast to a book!

Never Rewrite: The Low Risk Guide To Modernizing Your Legacy Software is for everyone who maintains, or relies on, systems that you hate.

Systems that are poorly designed, unreliable, untestable, won’t scale, or make you tear your hair out.  Where every change creates 2 new bugs.  Systems that are so bad that they can’t be fixed.  

Where it seems like the only option is to start over from scratch.

When it seems that you need a rewrite, you need Never Rewrite: The Low Risk Guide To Modernizing Your Legacy Software.

Because no one actually wants a rewrite -

Developers don’t want a rewrite, they want to work with well designed and tested code that doesn’t fight them every step of the way.  They want to take pride and get joy from their work.

Managers don’t want a rewrite, they want the people they manage to be happy, for bug reports to be few, and for work to be delivered at a consistent pace.

Leadership doesn’t want a rewrite, they want to empower their people, have reliable systems, and consistent delivery.

If no one wants a rewrite, why does it seem like a reasonable solution?  Because it seems like there are no other options.

We’re here to tell you that there is another way -

Never Rewrite will teach you how to modernize legacy software without a rewrite.  The book will show you how rewrites destroy teams, increase turnover, and prevent growth.  We will show you how to escape a rewrite in progress.

Along the way we share real stories of rewrite; from Sonos’ recent $500 million debacle, companies that paused new features for years, and projects that cost millions and never made it into production.  Our case studies are full of lost opportunity, wasted money, and derailed careers. 

If you’d like to learn more, please sign up to our mailing list and be the first to hear about our progress!

Site Footer