Cheerfully Going To Fail In Production

This weekend I cheerfully went to lose at a fencing tournament.

That’s not negativity, I’ve been fencing off and on for 30 years and I just started up again after a multi-year hiatus.  I fenced terribly, which I already knew from practice.  What I didn’t know, what practice couldn’t tell me, is where I would break down when I fenced for real.

I was just competent enough at the things I knew to watch for, that my opponents destroyed me in ways I didn’t even remember to practice.

What does this have to do with software and my mantra of Never Rewrite?

After years of not doing maintenance, I am a legacy system.  Practice is literally a testing environment.  If I had spent months working on my known problems, when I finally got to production, I would still have been destroyed by my blind spots.

You can choose to spend months rewriting only to have your system fail when you finally make it to production.

Or, you can choose to cheerfully go to production with what you’ve got so that you can see what breaks and iterate.

Acceptable Beer Bellies in your codebase

How do Beer Bellies begin?

The panel is fully functional, when you push the button a light turns on and the elevator comes.  It is also obviously wrong - the top button is flush with the mount, and the bottom button sticks out.

I found this sad, Beer Belly Elevator Panel, at a high end resort and wondered how it happened.

Certainly whomever installed the mismatched button knew it was wrong.  Did the tech not care?  Was using the wrong button the only way to get the panel repaired?  Was the plan to come back and fix it when the right parts came in?

The hotel maintenance staff had to sign off on it.  Did they care about the quality of the repair?  Were they only able to give a binary assessment of “working” or “not working”?

Did the hotel manager not care?  Were they told to keep costs down?  It isn’t broken now, it would be a waste to fix something that wasn’t broken.

Quality vs Letting Your Gut Hang Out

Employees at the hotel see the mismatched panel every day.  It is a constant reminder that letting things slide, just a little, is acceptable at this hotel.

When you let consistency and quality slide because something works, you’re creating beer bellies in your codebase.

One small button at a time until everyone sees that this is acceptable here.

So long as a light turns on when you hit the button does it matter if the light is green, red or blue?  Does it matter if the light is in the center or on the edge?

But I’m running a SaaS, not a Hotel

Your SaaS may not maintain elevator panels, but your codebase is probably full of beer bellies.

“It works, we’ll clean it up on the next release” bellies.

“This is a hack” bellies.

“This is the legacy version, we’re migrating off of it” bellies.

When you let sad little beer bellies into your codebase, your employees see exactly what you find acceptable.

Is This A Quality Panel?

I recently stepped into an elevator and saw this panel:

The panel was clean, full of high quality materials, and everything worked.

Quality is about more than the functionality, does this look like a quality panel?

Everything works!

Push a button and it lights up!

Sure, it might light up red, green or blue. 
And the light might be around the edge or in the center. 
And some of the buttons are flush with the mount, while others extend out; but that doesn't impact the light turning on.

There are 12 possible button implementations, and 5 of them appear randomly.

But when you push the button, the light turns on!

What does that have to do with SaaS Scaling?

No matter how excellent any individual endpoint implementation is, having an API with endpoints that work differently decreases the overall quality of your product.

Having a UI with mismatched widgets and styles increases the user’s cognitive load and decreases quality, even when the differences don’t change any functionality.

Consistency during the scaleup period can be difficult as multiple new teams spin up, but it’s critically important if you want a quality product.

2022 Year In Review

It’s the time of year to look back, reflect, and highlight some interesting pieces you might have missed.

By The Numbers

I set a personal goal of publishing every week, or at least 50 articles for 2022.
All told, I published 32 articles - 64% of my goal.

I may not have hit my goal, but I'm very proud of the pieces I did write!

Top 3 Articles From 2022

These were the most popular articles that I wrote in 2022:

  1. My series of articles on SaaS Tenancy Models
  2. The Opposite Of Iterative Delivery
  3. The Chestburster Anti-Pattern

Top 3 Articles In 2022

These were the three most popular articles in 2022 that I wrote in past years:

  1. Links VS Tags, A Rabbit Hole
  2. The 5 Ws For Developers
  3. Your Database is Not A Queue

Two Published Articles

I became a published author this year, with 2 pieces on leaddev.com

  1. The Dangers Of Pulling Rank as a Staff Engineer
  2. How to Break the "Get Me Everything Cycle"

Goals For 2023

I'm going to spend more time writing in 2023. My goal is to publish 100 pieces, approximately 2 per week.

Happy Holidays and I'll be back in the new year!

Do You Punish Customers For Loyalty?

Does your Customer’s experience with your service get better over time?

Does it get worse?

SaaS software often punishes long term clients in subtle and frustrating ways.

Do your CRM customer screens show a decade of buying history?

How many emails can a contact open before you can’t open the contact?

Do marketing campaigns, contact lists, and tags accumulate over the years?

Do database inserts slow down as you write the 10 millionth row into a log table?

There are countless ways to punish customers for staying with you for years.  It’s not a startup problem, it sneaks in as you become a scaleup.  The flood of new customers blinds you to the slow leak as your most loyal customers churn.

When your longest customers complain about performance more than your largest, chances are your software is punishing them for being loyal.

Chipping Away

You have a goal, you know what it means, and what it implies.

You also know what’s blocking your progress.

It’s time to iterate!

Iterating Against the Blockers

Ask yourself:

  • What’s the first step?
  • What if all I needed was a tool?
  • How will I know if it’s working?

These three primary questions will help you chip away at the Blockers.

You want progress, not victory.  When you keep iterating away at the Blockers, eventually achieving your goals becomes easy.

Besides the questions, the main constraint to keep in mind is that each iteration needs to leave the system in a state where you can walk away from the project.

Unused functionality?  Totally fine.  

Refactored code with no new usage?  Great!

Tools that work but don’t do anything useful?  One day.

Nothing that requires keeping two pieces of code in sync.

Nothing that would prevent other developers from evolving existing code.

Whatever you do, it must go into the system with each iteration.

Example: Continuing from the Async Processing Blockers

I want to make my API asynchronous so that client size doesn’t impact the API’s responsiveness.  But, I can’t make the API asynchronous because:

  • I don’t have a system to queue the requests.
  • I don’t have a system to process the requests off of a queue.
  • I don’t have a way to make the processing visible to the client.

Attempting to do all of this work in one giant step is a recipe for a project that gets delivered 6 months to a year late.

I’m going to hand wave and declare that we are using SQS, AWS’s native queuing system.  This makes setting up the queue trivial and reliable.

I don’t have a system to queue the requests.

What’s the first step?

Write a data model and serializer.  What am I even going to write onto this queue?

What if all I needed was a tool?

Instead of worrying about a system, create a command line tool in your existing codebase to push data to SQS.  It won’t be reliable, it won’t have logging, and it won’t have vizability.

But you’re the only one using it, so that’s fine.

How will I know it’s working?

Manually!  AWS has great observability.  You don’t need to do anything.

Combining your first step, a data model, your tool and AWS observability you’ll be able to push data onto a queue and view what got sent.

The data model will be wrong and the tool will not be production ready!

That’s ok because no existing functionality is blocked or broken.  Getting interrupted doesn’t create risk, which means you can work even if you only have a little time.

I don’t have a system to process the requests off of a queue.

What’s the first step?  

Write a data model and deserializer.  What data do I need to be on the queue in order to recreate the event I need to process?

What if all I needed was a tool?

Create a tool to pull the message off the queue and deserialize.  Send the result to a data validator.  (You’re accepting customer requests from an API, you’d better have a data validator)

How will I know it’s working?

Manually!  AWS has great observability.  You don’t need to do anything.

Combining the three gets the ability to manually pull data off the queue, deserialize and validate.

You can do this before, after or during your work to get the data onto a queue.  It’s not production ready, but it also doesn’t create risk.

I don’t have a way to make the processing visible to the client.

What’s the first step? 

What does visibility look like to the client?  Where does the data go in your UI?  What would you want to know?

What if all I needed was a tool?

Make an endpoint that calls AWS and returns the data you think you need.

How will I know it’s working?

Manually!  Compare what your endpoint tells you with what AWS tells you.  Don’t start until you have tools for adding and removing events from the queue.

Combining the three gets you an endpoint that tells you about the queue.

The endpoint should be safe to deploy to production.  The queue is always empty.

Conclusion

Iterating allows you to chip away at your blockers until there’s nothing stopping you.

Apply the three questions:

  • What’s the first step?
  • What if all I needed was a tool?
  • How will I know if it’s working?

Always keep the system in a state where you can walk away from the project.

Keep iterating against your blockers, and you’ll be amazed at how soon you’ll achieve your goals!

The Blockers

Implications, defining what has to be true for your goal to succeed, are the hardest step.

Step 4, The Blockers, are relatively easy.

What’s Stopping You?

What is stopping you and your team from making your implications true?

I’m not working on it, and there aren’t enough hours in the day, aren’t valid answers.

The Implications are too big, too complex and too scary to tackle head on.  If they weren’t, you wouldn’t be stuck.

Defining the blockers is about naming the steps you can’t take; the chasims you can’t cross.

Continuing with the Async Processing Example

Continuing with the example of using asynchronous processing to make API response time consistent for customers of any size.

Why can’t I make the API asynchronous? 

Three big reasons:

  1. I don’t have a system to queue the requests.
  2. I don’t have a system to process the requests off of a queue.
  3. I don’t have a way to make the processing visible to the client.

The reasons create a giant requirement loop.  I can’t build a job system until I have a queue to read off of, and I can’t queue the work until I can process it.  I also need endpoints and UI to keep the customer informed of how the processing is going, if anything got rejected, and a way to remediate failures.

That’s a lot of work!

Even dropping customer observability, adding queueing and processing is a big step.

Big steps are scary, and not iterative.

Next Steps

Before going on to the next step, pick your 2-3 most important Implications, and write down 3 Blockers for each.

I recommend you don’t write down more than 9 Blockers before moving on to Step 5 - Weakening The Blockers

The Implications Of Your Characteristics

Part Three of my series on iterative delivery 

You have a goal, you have characteristics, now it’s time to ponder the implications.

The implications are things that would have to be true in order for your characteristic to be achieved.  They are levers you can pull in order to make progress.

Let’s work through an example

In part 2 I suggested that a Characteristic of a system that can support clients of any size is that API endpoints respond to requests the same amount of time for clients of any size.

What would need to happen to an existing SaaS in order to make that true?

  • The API couldn’t do direct processing in a synchronous manner.  It would have to queue the client request for async processing.  Adding a message to a queue should be consistent regardless of how large the client, or the queue, becomes.
  • For data requests endpoints would need to be well indexed and have constrained results.
  • Offer async requests for large, open ended, data.  An async request works by quickly returning a token, which can then be used to poll for the results.

Implications are measurable

How much processing can be done asynchronously today?  How much could be done last month?

How many data endpoints allow requests that will perform slowly?  Is that number going up or down?

How robust is the async data request system?  Does it exist at all?

Implications are Levers

Progress on implications pushes your service towards its goal.  Sometimes a little progress will result in lots of movement, sometimes a lot of progress will barely be noticeable.

Speaking to Implications

It is important that you can speak to how the implications drive progress towards your goal.

Asynchronous processing lets your service remain responsive.  It doesn’t mean you can process the data in a timely manner yet.  It sets the stage for parallel processing and other methods of handling large loads.

Next Steps

Before continuing on, try to come up with 3 implications for your most important characteristics.

You’ll want a good selection of implications for the next part - blockers.

We will explore what’s preventing you from moving your system in the direction it needs to go.

Defining Iterative Characteristics

Part 2 of my series on Iterative Delivery

Congratulations, you have a goal!  Now what?

Now, it is time to write down what your goal means to you.  What measurable things will be true about your system by the time you achieve your goal.  What measurable things will be false?

Picking measurable characteristics allows you to iterate towards the goal.

Each one should be 1-2 sentences; short enough that you can still explain them quickly and long enough to remove most ambiguity.

For example, if you are attracting clients that are 10x the size you are used to supporting, a good goal would be something like: We should be able to support clients of any size!

Your characteristics would be something like:

  • Pages in the UI render the same for any size client.  The site should never be ponderous or slow.
  • API endpoints respond to requests the same amount of time for clients of any size.
  • Backend processes always complete fast enough that the customer doesn’t notice them.

Like goal setting, you don’t need to know how to achieve your characteristics.  They still don’t need to be achievable!

You need a general idea of how you will measure the characteristics, but you don’t have to measure or set up measurement tools before proceeding:

Pages in the UI render the same for any size client.  The site should never be ponderous or slow.

Page rendering time can be measured with tools like Chrome’s Lighthouse and network logs.

You should come up with 3-10 characteristics before moving on to the next step, implications.

Site Footer