How I Will Talk You Out Of A Rewrite

If you come to me saying “we need a rewrite”, I will run you through a lot of “why” questions to discover if you actually need a rewrite.  I am basically trying to answer these three questions:

  1. What will the new system do differently?
  2. Why can’t you do that in the current system?
  3. Why can’t you build the new things separately?

Answering these questions yourself will help you think through whether you really need to rewrite your existing system.

Let’s examine each more closely.

What Will The New System Do Differently

What will be different about the new system?

For this exercise you have to ignore code quality, bugs, and developer happiness.  Those are all important, but corporate reality hasn’t changed.  The same forces that resulted in low quality code, endless bugs, and developer unhappiness are still there.  A rewrite might get you temporary relief, but over the long term, the same negative forces will be at work on the new system.

So, what is different about the new system?  The differences could be technical, like a new programming language, framework, or architecture.  Maybe you need to support a new line of business and the existing system can’t stretch to cover the needs.

Get clear on what will be different, and write it down.  These are your New Things.

Why Can’t You Do It In The Current System

Now that you are clear on what New Things you need, why can’t you build the New Things into the existing system?

We are still putting aside issues like the existing code quality, bugs, and developer happiness.  If those forces are all that is stopping you from doing the new work in the existing system, I have bad news, those forces will wreck the new system as well.  You don’t need new software, you need to change how you write software.  Stop now, while you only have one impossible system to maintain.

Other than the forces that cause you to write bad software, why you can’t use the current system.  Get clear.  Write it down.  These are your Forcing Functions.

Why Can’t You Build The New Things Separately?

At this point we know what the New Things are and we know the Forcing Functions that prevent you from extending the current system.  Why can’t the New Things live alongside the existing system?

A rewrite requires rewriting all of your existing system.  Building New Things in a new system because of Forcing Functions, only requires building the New Things.  Why can’t you do that?

By this point office politics are out.  Office politics can’t overcome Forcing Functions.

Quality and bugs are also out, because there is no existing code to consider.

Get clear.  Write it down.  This is your Reasoning.

Now, take your Reasoning, backed by the Forcing Functions, and you have explained how getting the New Things requires a rewrite.  If your Reasoning can convince your coworkers, then I’m sorry, you do actually need a rewrite.  

If not, it is time to talk about alternatives.

What Happens After I Talk You Out Of A Rewrite

Most rewrites are driven by development culture issues, not the software itself.  This brings us back to code quality, bugs, and developer happiness.  A rewrite won’t fix any of those issues.

The good news is that you can fix all of them without a rewrite.  Even better news is that fixing them will only take about as much effort as you think a rewrite would take.  The bad news is that your culture is pushing against making the fixes.

Take it one step at a time, and keep delivering.

Sherman’s Law of Doomed Software Projects

Sherman’s Law of Doomed Software Projects - Software projects with the word “Next” or “New” in the name are doomed.

Sherman’s Law only applies to the internal name used by the people working on the software itself.  Names used in marketing or external communication don’t count.

Projects with the word “Next” in the title are doomed because no one wants the next version of your software.  Paying customers aren’t paying today because of the promise of a Next generation of the software.  Internal customers want software that helps them do their job today.  They are using the software today, because it solves a problem today.

Could there be future customers out there who have the problem your current software solves, but who will only buy if you release the next generation?  Sure, but if they’re willing to wait the problem isn’t that pressing and they probably won’t buy the next generation either.

Projects with the word “New” in the title are doomed because new is temporary and muddles project goals.  The new version starts off with clear goals and business value, but time is the enemy.  Whatever gets released is new, regardless of the goals and value in the final product.

If “next” and “new” are names for doomed projects, what names can you use?

You’ve got an existing product or service, and you need to do major work on it.  At the end, you’ll have the “next” generation, or “new” version of the product or service.  Speak to the reason you need to do the work.

If you discover that your software is fundamentally insecure, you don’t need a “Next Generation” project, you need “Maximum Security”.  If your system is slow, don’t start a “New” version that is faster, you need project “Lightspeed”.

I once wrote a piece of software named Polaris.  When the time came for major work, which name would focus the team and drive alignment better - Next Generation Polaris, or Maximum Security Polaris?  New Polaris, or Lightspeed Polaris?

Don’t doom your project, don’t use “Next” or “New”!

New Is Temporary

“New” is a temporary adjective; one that will disappear when the original disappears.

This is especially true when applied to software.

The “New UI” becomes just the UI.
The “New Reports” become the reports.
Any “New Experience” will fade into the experience.

Your current customers won’t remember “new”.  Customers that join after the release will never know about “new” because they never experienced the “old”.

The only ones who know, or care, about “new”, or “old” are the people who built and maintain the code.

“New” versions of existing services aren’t new, they’re the same service, with the same limitations.  Truly new experiences have new names that speak to customer value.

If you are talking about “New Service”, you’re not talking to the customer, you’re talking to yourself. New is temporary, take the time to figure out what you're really building before it becomes just the current version of what you had before.

Rewrites Have Two Teams – Team Rewrite and Team Maintenance.  Join Team Maintenance

When management agrees to rewrite software, they inevitably split the existing team in two - Team Rewrite and Team Maintenance.  Team Rewrite is in charge of creating a brand new system that recreates everything useful and good about the legacy system.  Team Maintenance is in charge of maintaining the legacy system until Team Rewrite completes the rewrite.  Everyone wants to be on Team Rewrite and no one wants to stay on Team Maintenance; everyone is wrong.

The benefits of Team Rewrite are obvious, you get to write new code without all the horrors of the legacy system.  You’ll use the latest technology!  You’ll do things the right way!  Your work won’t be in production, so you won’t have production incidents!  No angry customers!  The list goes on.

The benefits of Team Maintenance aren’t clear.  You get to work in the horrible legacy system.  The system that is so bad, so unfixable, that management has agreed to a rewrite.  Plus you’ll be responsible for incidents and outages!  Why would anyone join Team Maintenance?

Because Team Maintenance is both a flight to safety and an opportunity. 

It’s not glamorous, but the people maintaining the legacy system are critical.  The members of Team Rewrite contribute to theoretical future value, Team Maintance’s work supports customers today.  If things go badly with the rewrite, and usually do, the entire rewrite team can be fired.  The maintenance team remains critical so long as the legacy system lives.

Team Maintenance is also two kinds of opportunity.  

First, is the opportunity to clean up the legacy system.  The legacy system doesn’t have to continue to be terrible.  It’s likely that most or all of the people who said it couldn’t be fixed are now on Team Rewrite.  Everyone left is committed to working on it for the duration.  It’s a great time to clean things up; not to save the system, but for your own sakes.  Very quickly the legacy system won’t be horrible.  It may never be great, but you can work on it sobbing.

The second opportunity is new features.

Customer needs don't stop just because there’s a rewrite going on.  Management will limit the new features so that the rewrite isn’t trying to hit a moving target.  You’ll only be working on the most important, most critical, and most impactful features.  And you’ll get to work on the critical features, because you’re Team Maintenance.

Does this mean you join Team Maintenance rooting for Team Rewrite to fail?  Not at all!  If Team Rewrite succeeds that’s also great for you.  You’ve shown that you’re a selfless team player - you took on the work that no one else wanted!  Seize the opportunities that come along and you’ll show that you can be trusted to improve your team and deliver critical features.

When a rewrite comes along, your best move is to join the maintenance team.  Volunteering for Team Maintenance is safer, comes with more opportunity, and brings you to management’s attention.  Rewrites usually fail and Team Rewrite leaves the company.  Join Team Maintenance for career growth!

State, Persistence, And Cold Restarts

In a SaaS, state exists in many places, some of them outside of your control.  If you don’t know where to look, you will never truly understand your system.

State You Can Control

As systems grow in complexity, state creeps into an ever growing number of places:

  • Databases
  • Caches
  • Running Software
  • Config Files
  • Shell scripts
  • Flat files

Do your caches have state?  If they went down and it impacted anything other than performance, then they have state.

Do your shell scripts have state?  If they reference specific customers then they certainly do.

The list goes on.

Some of your state has backups, like databases, source code, and some config.  Your only backups for running software and caches are more instances of running software and caches.

State You Can’t Control

State also creeps into things that aren’t software, and you can’t control:

  • Tribal knowledge
  • Manual Procedures
  • Anything in someone’s head that isn’t documented

People are a critical, and critically overlooked, part of any complex system.  The most statefull people aren’t the programmers.  It’s the operations and support people that interact with customers that have the most state and least backup.

Disaster Recovery and Cold Start

A cold restart is the hardest kind of disaster recovery.  You have all new server instances, whatever is in source control, and whatever is in your backups.

Things that live in people’s heads are even harder to recover.  Which repos are important, which services are supposed to run, what does the system even look like?

Depending on your setup you may be ok, you may never recover.

I was at GuaranteedRate when it acquired Discover's Home Loan operations.  As part of the acquisition we got their mortgage software and their developers, but no state.  I watched the team of developers that created the system spend a year trying to do a cold restart.  They failed.

Consider State Now, Before It Gets Away From You

State grows organically with systems and keeping it under control requires effort.  Beyond the fear of disaster recovery, knowing where state exists in your system is key to maintenance and growth over time.

Once you start looking for state in your systems, you’ll find it everywhere.

The Technical Problems You Should Solve At A Midsize SaaS

20 Things You Shouldn’t Build At A Midsize SaaS was about technical problems developers at a midsize SaaS shouldn’t try to solve.  Is midsize SaaS development all glue work?  Have all the problems been solved?

Of course not.  Midsize SaaS is the garden of Refactoring, Scaling, and Performance.

To make it sound fancy: The deep work for developers at a midsize SaaS is designing solutions for emergent architectural problems.

A Midsize SaaS Has Different Problems Than A Startup SaaS

Startups have unproven theories about what customers want.  They need to get features out as quickly as possible to test theory against reality.  Worrying about multiple data centers, global latency, or the performance of features customers can’t see, is a waste of time.

At a startup you should write good code, find product market fit, and don’t worry about how the system will perform when you have 10,000 paying customers.

Once you have thousands of paying customers, that’s when architectural gardening kicks in.

How To Support What Customers Want

The startup phase will leave you with a valuable product and an almost random set of assumptions.  You get to puzzle out the assumption, the reality, and choose solutions.

If your systems are in the United States, and all of your customers are in the United States, you will have different architecture needs than if your customers are globally distributed.  Linear and exponential scaling produce different problems and require different solutions.

You need to identify which problems you have, and iterate towards standard solutions.  Standard solutions are critical because it makes your competitive advantage, the differences that are valuable, shine through.  You can’t find the valuable unique differences when everything in your system is bespoke. 

Conclusion

The deep work at a midsize SaaS is identifying emerging problems and iterating towards solved solutions.  Pathfinding from wherever the startup phase has left you towards known destinations.  Moving towards known standard solutions makes it easier to find and improve valuable differentiators.  Building unique versions of everything makes everything harder without adding value to your customers.

How To Say No As A Staff Engineer

I say “no” to tech plans.  A lot of tech plans.  Most of the time, if you ask my opinion about a tech plan, I’m going to tell you not to do whatever you’re planning.  If product owners are involved the odds of me saying no go way up.  And yet, few people dread asking me about their plans.

Staff Engineers can’t just say “no” and leave it at that.  The power of no comes with responsibility.  When I say “no”, I have to explain why, offer alternatives, and make myself available to work through emergent issues.  When product owners are involved it is my responsibility, as a Staff Engineer, to help separate the product goals, which are fine, from the solution, which got a “no”.

Let’s break it down.

Explain Why

My expertise is in Scaling and Performance.  I say no when the solution won’t scale, won’t perform well, or has problematic edge cases.  Often, something that works fine for the average customer won’t work well for the largest 10% of customers.

When I say no to a design that won’t scale, I speak to real numbers.  “That’s fine for a table with 1,000 rows.  We have customers with a million rows in that table.  They’ll be waiting minutes for a response from the backend.”

Offer Alternatives

Decades of experience doesn’t make me right, but it does give me a lot of experience to draw on.  There were specific problems in the design that caused me to say no.  Those problems are solvable.  The solutions may not be appealing, but they exist.

If you have data centers all over the world, a central database full of global data is going to be a problem.  There are lots of alternatives depending on why you wanted global data in the first place.  Alternatives are a conversation, not prescription.

Be Available For Emergent Issues

Now that I’m part of the solution, I’m responsible for helping with emergent issues.  As the project continues, new constraints will emerge and new requirements will be discovered.  Some of the emergent issues are going to impact Scale and Performance.

Things change and the old recommendations may become invalid.  Staff Engineering recommendations have to be living documents.

Saying No Should Be More Work!

It would be much easier to say “Looks good to me!”  If I pretend to be on board I’m no longer responsible.  I have sabotaged the asker, but they’ll be too busy fighting endless fires to realize that I could have stopped everything from going wrong.

For Staff Engineers, saying no to a proposal comes with an implied offer to help.

20 Things You Shouldn’t Build At A Midsize SaaS

I have seen developers build a lot of unnecessary and counterproductive pieces of software over the years.  Generally, developers at small to midsize SaaS companies shouldn’t build any software that doesn’t directly help them deliver a service to their customers.

Whether it was the zero interest rate period, bad management, or hubris, developers spent a lot of company money on projects that never made sense given their employer’s goals and size.  I have seen custom implementations of every type of software on this list.  None of it worked better than open source, and none offered a competitive advantage.

If you find yourself developing or managing any of these twenty types of projects, stop and seriously consider what you are doing.

  1. Scripting languages
  2. Compiler extensions
  3. Transpilers
  4. Database extensions
  5. Databases
  6. DSLs
  7. ORMs
  8. Queues
  9. Background work schedulers
  10. GraphQL
  11. Stateful REST
  12. Frontend Frameworks
  13. Backend Frameworks
  14. Servers
  15. Dependency Injectors
  16. CSV writers or parsers
  17. Cryptography Implementations
  18. Logging Libraries
  19. DateTime libraries
  20. Anything from “First principles”

There are always exceptions, if building this software has some competitive advantage, go ahead.  In general, anyone suggesting these projects is biting off more than they can chew and doesn’t fully understand the problem they are trying to solve.

Most often things start out as a quick hack - “I’ll just concatenate these strings with a comma, it will be faster than finding a full CSV library.”  Soon you’re implementing custom separators and string escaping.

If your company has done their own implementations don’t despair, iterate towards a better library!

Robert Moses On Completing Projects

“Once you sink that first stake, they’ll never make you pull it”
Robert Moses, The Power Broker by Robert A Caro, page 207

Robert Moses built most of the highways, parkways, bridges, tunnels, and parks in and around New York city.  His most effective tool was to lie about the costs so that he could start the project.  Once the project was started he knew that the approver’s refusal to admit a mistake and the sunk cost fallacy would allow the project to continue to completion.

Moses’s method is a clear moral hazard - start a project and the approver has to help complete it, or admit that they made a mistake by giving the initial approval.

Does the project cost more than expected?  Keep going.

Will the project take longer than expected?  Keep going.

The project no longer makes sense?  Learned that the project won’t solve the problem?  That the solution isn’t cost effective?  Keep going.

Depending on which article you read software 45-90% of software projects are late, cost 15-50% more than expected, and have a 20% chance of not delivering the expected value.

I once joked about a company that “started projects on time” but never had the resources to finish anything.  Unfinished projects drain resources and make it harder to finish any other project.

Because as Robert Moses knew, once you start, projects keep going.

Do You Need Permission To Write Quality Code?

A developer recently confided in me, “I wanted to write good code on this project, but when I asked my manager he said we needed to do whatever it took to hit our delivery date.”  He was explaining why he shoved his changes into the existing, giant, untestable, functions instead of refactoring and writing tests.

He asked his manager for permission to write quality code, when he didn’t get it he wrote shit code, and he missed the delivery date.

Asking permission to write quality code is the same faulty thinking that leads managers to skip testing, the idea that low quality is faster.  Low quality code is just low quality, developers won’t create it faster than high quality code.

I wrote lousy code in college because I had no idea what I was doing.  I wrote lousy code for my first year as a professional programmer because I still had no idea what I was doing.  After being on the job for about 18 months I knew I was writing lousy code but I did it anyway because that’s what I was used to doing.

After about 18 months I started copying the more effective developers and writing higher quality code.  My velocity increased with quality.  By the end of my second year as a professional programmer I was writing high quality code because that was the fastest way for me to deliver results.

When you ask for permission, you are asking for someone else to take responsibility for the decision.  If developers ever ask for permission it is a sign that they don’t believe in quality, or worse, that they don’t believe that their manager believes.

If you’re a manager and ever get asked for permission to “do it right”, ask yourself, “what has gone wrong, and how do I fix it?”

Site Footer