What Does This Button Do?

I regularly point out problems with physical buttons.  Physical buttons help visualize the conceptual problems behind buttons on websites, without the distractions of technology.

Today’s example comes courtesy of a minivan I rented in Texas.  As a car, it was fine.  As a UX platform, it was full of reused visual elements with different actions.

As an example, here is a closeup of the door panel.

This looks like pretty standard door stuff.  Two doors, two handles, two buttons.

The front set (right side) performs like most other car door handles.

Sometimes the door button will toggle lock state, sometimes it only locks the doors.  Likewise, some door handles will automatically unlock the car, and others will only allow you to open the door if the car is unlocked.

Pretty standard stuff.

The rear set of controls is for the minivan’s sliding door.  They work entirely differently.

The sliding door is automatic.  The handle and the button both engage the mechanism.  

If you pull the handle, you should then let go, because the car door will open itself.  The handle is also a toggle, if the door is open and you pull the handle, the door will close.  At no time should you use human muscle to open or close the door.  This is different from the front door, which requires human muscle to both open and close.

The rear button is also a door open and close toggle.  While the front button performs an idempotent “lock only” action, pressing the rear button will always change the state of the door.

Same Element, Multiple Actions

There are 2 buttons that look the same, are placed the same, and do different things.  The 2 handles also look the same, are placed the same and do different things.

The sliding door handle is redundant, it does the same thing as the button.  It also introduces a major point of failure because it invites humans to use muscle against a self powered mechanism.  To illustrate the fragility, I took a minivan taxi home after leaving Texas and the driver shouted not to touch the door handle.  He explained that people pull hard on the handle and break the door mechanism.

Design Failures And Standard Resolutions

In summary there are 2 design failures here:

  1. Using the same visual element to do different actions.  When buttons and handles do different things, they should look different.
  2. Using a design that encourages users to do the wrong thing.  All door handles, except minivan sliding doors, are meant to open doors by pulling.  Since the sliding door doesn’t want people to pull hard, it shouldn’t use a standard door handle.

Pushing My Buttons

After a few mistakes I stopped opening the sliding door when I meant to lock the car.  I stopped using the sliding door handle entirely.  I quickly learned the UX, but I never escaped the extra friction.  My eyes would still lock onto the buttons and I would have to think about what they did before taking action.

Like the minivan, UIs are littered with inconsistent buttons, multiple ways to do the same thing, and the ability to do “the wrong thing” because your expectations don’t match what the designer was thinking.  Users will learn the system, but the extra friction will never disappear.

20 Things You Shouldn’t Build At A Midsize SaaS

I have seen developers build a lot of unnecessary and counterproductive pieces of software over the years.  Generally, developers at small to midsize SaaS companies shouldn’t build any software that doesn’t directly help them deliver a service to their customers.

Whether it was the zero interest rate period, bad management, or hubris, developers spent a lot of company money on projects that never made sense given their employer’s goals and size.  I have seen custom implementations of every type of software on this list.  None of it worked better than open source, and none offered a competitive advantage.

If you find yourself developing or managing any of these twenty types of projects, stop and seriously consider what you are doing.

  1. Scripting languages
  2. Compiler extensions
  3. Transpilers
  4. Database extensions
  5. Databases
  6. DSLs
  7. ORMs
  8. Queues
  9. Background work schedulers
  10. GraphQL
  11. Stateful REST
  12. Frontend Frameworks
  13. Backend Frameworks
  14. Servers
  15. Dependency Injectors
  16. CSV writers or parsers
  17. Cryptography Implementations
  18. Logging Libraries
  19. DateTime libraries
  20. Anything from “First principles”

There are always exceptions, if building this software has some competitive advantage, go ahead.  In general, anyone suggesting these projects is biting off more than they can chew and doesn’t fully understand the problem they are trying to solve.

Most often things start out as a quick hack - “I’ll just concatenate these strings with a comma, it will be faster than finding a full CSV library.”  Soon you’re implementing custom separators and string escaping.

If your company has done their own implementations don’t despair, iterate towards a better library!

Musketeering Makes Problems Intractable

Musketeering is lumping multiple difficult problems together to present a giant, intractable, disaster.

The name comes from the famous slogan: All for one, and one for all!  Each musketeer supports the group, and the group supports each musketeer.  When multiple problems form as one, they become impossible to defeat.

Most developers have faced a classic Three Musketeer problem with legacy code:

  1. The code is full of bugs
  2. Unit testing is nearly impossible
  3. Touching anything can have unknown side effects

Each of these issues are fixable on their own, together they bring development to a halt.

Why can’t you fix the bugs?  Because testing is nearly impossible and everything you touch has side effects.

Why can’t you write tests?  Because the code is tightly coupled, which produces side effects.  Also, it is full of bugs so we don’t know what the correct functionality is.

Why can’t you reduce side effects?  Because the code is buggy and there are no tests.  If you can’t separate the concerns, you can’t make progress.

Multiple Queues Vs Prioritized Queues For SaaS Background Workers

Every SaaS has background workers.  Processes that sync data between platforms, aggregate stats, run billing, send alerts, and a million other things that don’t happen through direct user interaction.  Every SaaS also has limited resources; servers, databases, caches, lambdas, and other infrastructure all cost money.  

Most SaaS go through three main phases as they mature and discover that queues are harder than they seem:

On Demand -> Homegrown database as a Queue -> 3rd party queue software

Whether driven by a database or proper queue, this high level system emerges:

Enter The Dream Of Priority

Because resources are limited and some jobs, and some customers, are more important than others, the idea of a Priority Queue will emerge.  

There are hours of work on the queue, and an important batch of jobs needs to be done now!  If only some jobs can move to the front of the line.

A Priority Queue seems like a great solution.  The processes that create work can set a priority, add it to the queue, and sorting will take care of everything!

Unfortunately, what happens is that only high priority work gets processed.  This is known as resource starvation. Low priority jobs sit at the back of the queue and wait until there are no high priority jobs.  

Priority Queues only give you two options: add enough workers to prevent queue starvation, or make the priority algorithm more complex.  Since resources are limited, engineers start getting creative and work on algorithms involving priority and age.

There is a much simpler solution.

Multiple Queues Have Priority Without Starvation

A Multiple Queue system is a prioritized collection of simple fifo queues.  Each queue also has a dedicated worker pool, which prevents starvation.  The key difference is that the workers will check the other queues when theirs is empty.

When all the queues have work, they behave like independent queues:

When the priority queue is empty, the priority workers pick up low priority tasks.

Multiple Queues solve priority in SaaS friendly ways:

  • No resource starvation.  Some customers may be more important, but no customer ever gets ignored.
  • No wasted resources.  High priority workers never sit idle waiting for high priority work.

Multiple Queues Push Complexity To Workers

Instead of having a complex Priority Queue, Multiple Queue systems push some complexity to the workers.  Workers have to know which queue is their primary, and the order of the remaining queues.

Multiple Queues Have More Adjustment Options

Priority Queues only had 2 adjustment options: add more workers or adjust the algorithm.  Multiple Queues allow much finer grained controls:

  • The number of different queues.  Adding a new, super duper higher priority pool would not require a code change.
  • The size of the worker pool for each queue.  Queues do not need to have the same size pools, and can be adjusted dynamically.
  • The relative priority of the queues.  Priority becomes config, not an algorithm.

Conclusion - For SaaS Workers, Use Multiple Queues

Background workers are a common, critical, feature for most SaaS companies.  Resource constraints make it impossible to run all background jobs as soon as they are created.  Some jobs will have different priorities, which will require implementing either Priority Queues or Multiple Queues.  Priority Queues sound like the correct answer because they describe the problem, but create resource starvation and ever increasing complexity.  Multiple Queues are simple, safe from starvation, and much more effective for SaaS use cases.

Common Cause Vs Special Cause in Software Variation

In Yes, Software Execution Has Variation, I laid out a dozen places where successful software will have variation:

There are at least a dozen places for variation in a single, successful, RESTful POST:

1. Time to establish a connection between client and server
2. Time to send the data between client and server
3. Time for the event to go up the OSI stack from physical to application layer
4. Time for the application to process the event.  This is impacted by CPU, Memory, Thread models, etc.
5. Time for the event to go back down the OSI stack
6. Time needed to connect to the database
7. Time database needs to perform the action
8. Network time to and from the database
9. Time for the event to go back up the OSI stack
10. Time for the application to process the database’s response and prepare a response to the client
11. Time for the event to go back down the OSI stack
12. Time to send the response to the client

    Let’s say that your SaaS has an SLA that all calls should return to the customer within 300ms.  Looking at the system metrics, you see that your endpoint meets the SLA 95% of the time.  What to make of the remaining 5%?

    Common Cause vs Special Cause Variation

    Common Causes are issues due to the nature of the system and will continue until the system is changed.  In software, Special Causes are bugs and hiccups and will appear and disappear at random.

    For our RESTful POST that needs to return in under 300ms, a source of Common Cause could be the physical distance between the client and server.  If you are running on AWS in US-EAST-1, it is physically impossible to meet the SLA for customers in Asia.  Round trip to places like Seoul and Tokyo is at least 450ms!

    Requests from Asia will fall outside the SLA 100% of the time.  The only solutions are to change the SLA or stand up an instance of your system closer to your Asian customers.  You must change the system.

    An example of a Special Cause could be an overloaded database.  Some requests will be fine, others will break the SLA.  The issue will go away once load decreases.  The answer may be to change the system by increasing the size of the databases.  Or keep the system the same and change the software to make the database insert asynchronous.  The software change decouples the SLA from database performance.

    Deming’s Path Of Frustration

    Fixing all the Special Causes won’t solve all the problems.

    Knowing that 5% of your requests break the SLA doesn’t tell you anything about Common vs Special Cause.  Developers can fix some of the problems with software, but some can only be fixed by changing the system architecture.

    Until you analyze and determine the source of your variation, you’ll be stuck on the Path Of Frustration.  Pouring ever more resources into ever smaller gains.

    How Running Off A Messaging Queue Impacts Data Loading Strategies

    Messaging systems don’t change the fundamental tradeoffs involved in data loading, but they do add a lot of opinionated weight to the choices. 

    First, messaging and streaming have a lot of potential meanings and implementations.  To overgeneralize, there are queues, where each message should be read once, topics, where each client should receive each message, and broadcast, where there are no guarantees about anything.

    Queues and topics work as polling loops within your software.  Broadcast messaging comes directly to the machine at the network level; the abstractions are highly dependent on implementation.

    For the rest of this article I’m going to cover queues.  I’ll cover topics and broadcast in future articles.

    Loading Patterns with Queues

    The main feature of a message queue is that you want each message to be processed exactly once.  This requires clients claim messages when taking them off the queue, notifications back to the queue when messages have been processed, and a client timeout, after which a claimed message will reappear at the head of the queue.

    A client has to connect to a queue, ask for some number of messages and indicate how long it will wait for the max number.  For example, AWS’ SQS the defaults are 10 messages or 30 seconds.  The client will block until 10 messages are available, or it has been waiting for 30 seconds.

    Set the message request too big and you will wait while the queue fills.  Set the timeout to long and the first few messages will get stale while the client waits.  You need to balance the time cost of polling the queue against the cost of having your workers idle.

    The main concern however, is visibility timeouts.  If your client grabs 10 messages and has a 30 second timeout it absolutely must finish all 10 in under 30 seconds.  After 30 seconds the unacked messages will be released to the next waiting worker and your message will be processed twice.

    Optimizing worker counts, polling settings, and timeouts requires maximizing execution consistency.  When working at scale, how long an action takes matters much less than how consistent the timing is.

    The Four Patterns of Data Loading are all about tradeoffs.

    If your system is drinking from a firehose, you need to push everything towards a Pre-Cache model.  Pre-Cache pushes data loading out of the critical path, which gives the most consistent timing for message processing.

    Having fully Pre-Cached queue processors is rarely possible, there is too much data and it changes to often.  Read Through Caching is the practical alternative.

    Link Tracking Example

    Link Tracking, recording when someone clicks on a link, is a common activity for Marketing SaaS.  I’m going to use a simplified version and walk through each of the data patterns.

    At a high level, we have a queue of events with 4 pieces of data that describe the click event: Customer Name, URL, Email, and Timestamp.  

    We want to process each event exactly once, as quickly as possible, and we don’t care about the order messages are processed.  However, we can’t insert directly, we need to normalize Customer Name, URL, and Email into customerId, urlId, and emailId.

    Lazy Load

    The nice things about the Lazy Load pattern is that it starts quickly and is simple to understand.

    Unfortunately, every event requires 3 trips to the database to normalize the data.  This adds load to the database and has highly variable processing time.

    It’s a fine place to start, but a terrible place to stay.

    Pre-Fetch

    The Pre-Fetch pattern separates fetching the data into a separate step from execution.  In this example, the worker would be attempting to normalize multiple events at the same time.  Once an event has been fully normalized, it is inserted into the database.

    Pre-Fetch adds a lot of complexity because the workers now have internal concurrency in addition working in parallel with each other.  This might be worth it if the data was being composed from different resources in a micro-service architecture and the three calls could be done in parallel.  In this example though, you would be pounding the database.

    Pre-Cache

    The great thing about a Pre-Cache model is that there are no reads against the database during processing.  It is read-only during init, and write only from the queue.  That will really help the database scale!

    Pre-Cache is impractical in this use case though.  We might know the full set of customers at startup, but urls and emails are open ended.  Pure Pre-Cache setups require that all of the data be known before starting.  That can work for things like file imports and trading systems,  but not link tracking.

    Read Through Cache

    The Read Through Cache Pattern is likely to be the best option for our use case.  It is more flexible than the Pre-Cache Pattern and much friendlier to the database than Lazy Loading or Pre-Fetching.  It pushes complexity to the cache where there are lots of great solutions.  Most languages have internal caching mechanics, and external caches like memcached and redis are widely supported.

    Conclusion

    Reading from message queues is a common problem and usually requires composing data from a database.  What’s the best way to load the data?  As always, the right data loading pattern depends on a your conditions and assumptions.  

    Lazy Loading is always a decent place to start because it is simple.  Performance, cost, and scaling constraints will push your design towards Read Through Caching and Pre-CachingPre-Fetch models are likely to be more effort than they are worth because they add a second source of concurrency and complexity.

    Remember, requirements change over time, so will the best solution to your problem!

    Tradeoffs with the Four Patterns Of Data Loading

    The Four Patterns Of Data Loading are about two main trade offs: simplicity for performance, and freshness for execution consistency.

    This may seem odd because the quadrants are defined by loading and caching strategies, not simplicity, performance or execution consistency.

    Simple or Performant

    The decision to use caching is about trading simplicity for performance.  You can simply load the data every time you need it.  If you’re using MySql on AWS, a basic query will take about 2ms to return.  The pattern is very simple and self contained: load data when needed.

    Caching, saving data for reuse, improves performance by reducing the time it takes to use the data again.  In exchange, you have to think about your code and determine:

    • Will I use the data again?
    • Is the data likely to change in the DB while I have it cached?
    • If the data does change, do I want to use the latest version or the version that the process has been using so far?
    • How much server memory will I need for the cache?

    Example - Adding a Tag to a Contact

    Imagine a simple operation, adding a tag to a contact.  The tag is a string and the contact is represented by an email address.  You need to transform the tag and email into ids and store them in a normalized database table.  For simplicity's sake, let’s say all DB operations take 2ms.

    There are 3 DB Operations

    1. Load the contactId based on email
    2. Load the tagId based on tag
    3. Insert into contact_tags

    With the On Demand access pattern, we do each action every time.  This requires 3 trips to the DB for 6ms.

    Similarly, with the Pre-Load pattern, we spend 2ms pre-loading the tagId, and each operation takes 4ms.

    Using a Read Through Cache, we store the tagId after the first load.  The first operation takes 6ms and each additional operation takes 4ms.

    Finally, with the Pre-Cache pattern, we spend 2ms pre-loading the data and each operation takes 4ms.

    1 Tag, 1 Contact1 Tag, 10 Contacts10 Tags, 10 Contacts
    InitExecInitExecInitExec
    On Demand0ms 6ms0ms60ms0ms600ms
    Pre-Load2ms4ms20ms40ms200ms400ms
    Read Through Cache0ms6ms0ms42ms0ms420ms
    Pre-Cache2ms4ms2ms40ms20ms400ms

    Freshness or Execution Consistency

    The next tradeoff to consider the value of fresh data vs execution time consistency.  This goes beyond questions of caching, it also affects whether you can use the Pre-Load strategy at all.  A big advantage of the Pre-Load and Pre-Cache strategies is that the execution time is lower and less variable.

    Stock trading software is designed to pre-load as much data as possible and can spend minutes initializing so that the actual buying and selling happens in microseconds.  Similarly, internet ad networks like Google’s demand responses in 100ms or less.  Having consistent execution times in each piece of your software makes it much easier to monitor performance for signs of trouble.

    Security software and reporting sit on the other end of the spectrum.  It doesn’t matter if a user had permission 5 minutes ago and everyone hates waiting for report data to update.  In these cases the variance for each response is much less important than getting the most recent data.

    Some data never changes once it has been created.  In the example above of adding a tag to a contact, both tagId and contactId will never change during your program’s execution.  Generally, anything with ‘id’ in the name is safe to cache.  On the other hand counts, permissions, and timestamps change all the time.

    Strategies can be good for some situations and terrible for others.  Sometimes it depends on expectations vs money.

    Ids and static dataPermissionsCounts and Reporting
    On DemandBadGoodGood, until it doesn’t scale
    Pre-LoadGoodIt depends on time elapsedIt depends on time and money
    Read Through CacheGoodIt depends on time elapsedIt depends on time and money
    Pre-CacheBestBadIt depends on time and money

    Conclusion

    The “right” data loading pattern is a moving target.  Remember that in the beginning load is low and there are continuous changes.  Simplicity is always a great choice when there isn’t enough scale to justify complexity.

    As software matures two trade offs emerge: Simplicity vs Complexity and Freshness vs Consistency.

    You’re changing the software for a reason.  When you consider the tradeoffs it should become clear which patterns will help solve your problem.

    Fixing All The Bugs Won’t Solve All The Problems – Deming’s Path Of Frustration

    A program of improvement sets off with enthusiasm, exhortations, revival meetings, posts, pledges.  Quality becomes a religion.  Quality as measured by results of inspection at final audit shows at first dramatic improvement, better and better by the month.  Everyone expects the path of improvement to continue along the dotted line.

    Instead, success grinds to a halt.  At best, the curve levels off.  It may even turn upward.  Despondence sets in.  The management naturally become worried.  They plead, beg, beseech, implore, pray, entreat, supplicate heads of the organizations involved in production and assembly, with taunts, harassment, threads, based on the awful truth that if there be not substantial improvement, and soon, we shall be out of business.

    W. Edwards Deming, Out of the crisis, Page 276

    In software, as in manufacturing, some problems occur due to bugs or “special causes”, and some are “common cause” due to the nature of the system’s design and implementation.  Fixing bugs is removing special causes.  Removing bugs greatly improves software quality, but it won’t impact “common cause” issues.

    Some “common cause” software performance issues I have encountered:

    • The software is “in the cloud”, but really it is in one data center in the US.  As a result the software is slow and laggy for customers in Europe and Asia.
    • The software runs slowly because the hardware is underprovisioned.
    • The software runs slowly because large amounts of unnecessary data are being sent to the users.
    • The software runs slowly because of inefficient data access patterns.

    Even with no bugs, “common cause” issues can result in low quality software.  

    The way off of Deming’s Path Of Frustration is to attack system design and implementation issues with the same fervor used to fight bugs.

    Reduce Long Term Customer Churn From Data Growth

    Customer Relations are ever growing, a CRM must provide Management as well.  Punishing customers for loyalty is the inevitable consequence of not acknowledging and preparing for a long term relationship with your customers.  Destroying your customer's long term experience increases churn and kills your net-dollar retention.

    Saving long term customers requires work and planning from the frontend to the persistence layer.

    Here is one strategy for each layer to help get you started:

    1. Frontend: Dashboards over pagination.
    2. Backend: Historical limits by default.  
    3. Persistence: Data partitioning.  

    Dashboards Over Pagination

    New customers can see all of their contacts on one page.  Same for their marketing campaigns, tags, and deals.  That changes very quickly for customers who are using your system.  When customers can’t see all of their data on a single page, the solution isn’t pagination, you need dashboards.

    When your customers go to their contacts, you could show them “Page 1 of 10”, or you could show them a dashboard with contact activity.  New contact counts, engagement rates, contacts who need a follow up.

    A pagination is about databases, it is a terrible interface for getting things done.  Make your CRM about management and workflows with dashboards.

    Historic Limits By Default

    Add sensible defaults to everything to limit the amount of data searched.  Default to showing the last 6 months or 1 year of a customer’s interactions.  Keep the older data accessible, but don’t waste the user’s time loading history.

    Put another way, how many extra seconds should you make customers wait to find out if a contact opened an email 3 years ago?  The correct answer is 0 seconds, and also 0 milliseconds; update your APIs accordingly.

    Data partitioning

    Event logs keep track of every email open, click, and website visit.  Because events rarely get deleted, these tables grow and grow with your customers.  Over time, the amount of data in the tables will cause queries to become slow.  The more successful your customers get, the worse your database problems become!

    Partitioning directs the database to create different logical tables based on a key while presenting a view of the combined table.  Events are date based making them great candidates for partitioning by month or year.  The previous tip, Historic Limits By Default, places an upper bound on the data in a table scan.

    Conclusion

    Customer data accumulates over time, especially in CRMs.  Keeping customers for years requires preventing your system from punishing them for accumulating data.  Failing to address the problems will destroy the customer’s experience, increase churn, and reduce your net-dollar retention.

    These three strategies will help you start thinking about data accumulation and how customer needs change over time.  Don’t throw away your loyal customers because your systems only consider the new user experience!

    The Rewrite Release Fear

    Release Fear is a common killer of rewrite projects.  

    As a rewrite comes closer to release, higher ups will start taking an interest in the project.  These leaders won’t have context on the myriad discussions and compromises that went into the rewrite.  Instead, they focus exclusively on lost features.  Value creation gets ignored and the release gets blocked in the name of existing functionality.

    Is the missing functionality important?  Do the features outweigh the value that the rewrite brings to customers?  The team doing the work decided that these were good tradeoffs.  Loss Aversion ensures that leaders only see what was left out.

    Adding the missing functionality comes at the cost of time and the value of the rewrite.  As costs go up, each revision makes the fear worse.  I have seen multiple projects, years of programmer time, fail because they could not be released to production until they could prove that the release would be perfect.

    The alternative is to TheeSeeShip.  When you have 2 versions of the code you can have a discussion about which one should be in production.  When there is only one version, there’s nothing to discuss.

    TheeSeeShip to keep the stakes low and Release Fear won’t be a problem.

    Site Footer