Common Cause Vs Special Cause in Software Variation

In Yes, Software Execution Has Variation, I laid out a dozen places where successful software will have variation:

There are at least a dozen places for variation in a single, successful, RESTful POST:

1. Time to establish a connection between client and server
2. Time to send the data between client and server
3. Time for the event to go up the OSI stack from physical to application layer
4. Time for the application to process the event.  This is impacted by CPU, Memory, Thread models, etc.
5. Time for the event to go back down the OSI stack
6. Time needed to connect to the database
7. Time database needs to perform the action
8. Network time to and from the database
9. Time for the event to go back up the OSI stack
10. Time for the application to process the database’s response and prepare a response to the client
11. Time for the event to go back down the OSI stack
12. Time to send the response to the client

    Let’s say that your SaaS has an SLA that all calls should return to the customer within 300ms.  Looking at the system metrics, you see that your endpoint meets the SLA 95% of the time.  What to make of the remaining 5%?

    Common Cause vs Special Cause Variation

    Common Causes are issues due to the nature of the system and will continue until the system is changed.  In software, Special Causes are bugs and hiccups and will appear and disappear at random.

    For our RESTful POST that needs to return in under 300ms, a source of Common Cause could be the physical distance between the client and server.  If you are running on AWS in US-EAST-1, it is physically impossible to meet the SLA for customers in Asia.  Round trip to places like Seoul and Tokyo is at least 450ms!

    Requests from Asia will fall outside the SLA 100% of the time.  The only solutions are to change the SLA or stand up an instance of your system closer to your Asian customers.  You must change the system.

    An example of a Special Cause could be an overloaded database.  Some requests will be fine, others will break the SLA.  The issue will go away once load decreases.  The answer may be to change the system by increasing the size of the databases.  Or keep the system the same and change the software to make the database insert asynchronous.  The software change decouples the SLA from database performance.

    Deming’s Path Of Frustration

    Fixing all the Special Causes won’t solve all the problems.

    Knowing that 5% of your requests break the SLA doesn’t tell you anything about Common vs Special Cause.  Developers can fix some of the problems with software, but some can only be fixed by changing the system architecture.

    Until you analyze and determine the source of your variation, you’ll be stuck on the Path Of Frustration.  Pouring ever more resources into ever smaller gains.

    Leave a Reply

    Site Footer

    Discover more from Sherman On Software

    Subscribe now to keep reading and get access to the full archive.

    Continue reading