In Yes, Software Execution Has Variation, I laid out a dozen places where successful software will have variation:

There are at least a dozen places for variation in a single, successful, RESTful POST:

1. Time to establish a connection between client and server
2. Time to send the data between client and server
3. Time for the event to go up the OSI stack from physical to application layer
4. Time for the application to process the event. This is impacted by CPU, Memory, Thread models, etc.
5. Time for the event to go back down the OSI stack
6. Time needed to connect to the database
7. Time database needs to perform the action
8. Network time to and from the database
9. Time for the event to go back up the OSI stack
10. Time for the application to process the database’s response and prepare a response to the client
11. Time for the event to go back down the OSI stack
12. Time to send the response to the client

Let’s say that your SaaS has an SLA that all calls should return to the customer within 300ms. Looking at the system metrics, you see that your endpoint meets the SLA 95% of the time. What to make of the remaining 5%?

Common Cause vs Special Cause Variation

Common Causes are issues due to the nature of the system and will continue until the system is changed. In software, Special Causes are bugs and hiccups and will appear and disappear at random.

For our RESTful POST that needs to return in under 300ms, a source of Common Cause could be the physical distance between the client and server. If you are running on AWS in US-EAST-1, it is physically impossible to meet the SLA for customers in Asia. Round trip to places like Seoul and Tokyo is at least 450ms!

Requests from Asia will fall outside the SLA 100% of the time. The only solutions are to change the SLA or stand up an instance of your system closer to your Asian customers. You must change the system.

An example of a Special Cause could be an overloaded database. Some requests will be fine, others will break the SLA. The issue will go away once load decreases. The answer may be to change the system by increasing the size of the databases. Or keep the system the same and change the software to make the database insert asynchronous. The software change decouples the SLA from database performance.

Deming’s Path Of Frustration

Fixing all the Special Causes won’t solve all the problems.

Knowing that 5% of your requests break the SLA doesn’t tell you anything about Common vs Special Cause. Developers can fix some of the problems with software, but some can only be fixed by changing the system architecture.

Until you analyze and determine the source of your variation, you’ll be stuck on the Path Of Frustration. Pouring ever more resources into ever smaller gains.

Sherman On Software.

Common Cause Vs Special Cause in Software Variation

Common Cause vs Special Cause Variation

Deming’s Path Of Frustration

Like this:

jeffpsherman

Yes, Software Execution Has Variation

The Never Rewrite Podcast, Episode Fifty-Nine, The Myth of the Boy Scout Rule

Related Posts:

Smile And Dial CRM – Part 3 – Invisible Shipping

The Pause

Messaging Patterns: What They Are, When To Use Them

Leave a ReplyCancel reply

You Can’t Vibe Code Reputation (Never Rewrite Ep. 161)

TheseusShipping Outside of the Tech Stack (Never Rewrite Ep. 160)

The Never Rewrite Podcast, Episode One Hundred Fifty-Nine: Vibe Coding

The Never Rewrite Podcast, Episode One Hundred Fifty-Eight: Token-Based Billing Is Nuts

The Never Rewrite Podcast, Episode One Hundred Fifty-Seven: Is AI Making You The New John Henry?

Common Cause Vs Special Cause in Software Variation

Deming’s Path Of Frustration

Share this:

Like this:

Post Navigation

Related Posts:

Leave a ReplyCancel reply

Discover more from Sherman On Software