Reducing Latency Won’t Increase Throughput Of Streaming Systems

A counter intuitive property of streaming systems is that latency has no long term impact on throughput.  Increasing or decreasing latency will give a short term change, but once the system stabilizes in its steady state, the throughput will be the same as before.

How can latency and throughput, two important performance metrics, be unrelated?

Let’s define some terms

Latency is the amount of time between when a message is sent and when it is fully processed.  This includes the time spent getting the message onto the stream, in queue waiting to process, and process time.

Throughput is the number of completions in a time period.  It could be 1 million messages a second, 5 per hour, or anything else.  Throughput doesn’t include processing time, that’s part of latency.  The million messages/s could have taken 10ms or 10 minutes each to process; so long as 1 million of them finish every second, the throughput is 1 million/s.

Steady State is when the system is fully warmed up and taking on its full load.  For a streaming system, this means that it is consuming the full stream, it is producing its maximum output, and the work in progress is being added to as rapidly as it is finished.

Example

Imagine two systems that receive 1 million events per second.  The first system takes 5s to process a million messages, the second system takes 2s to process the same messages.

The latency is different, the throughput is the same!

Implications beyond Latency and Throughput

Besides latency and throughput, there are 3 other notable differences between the two systems.

  1. Higher latency means more events in flight.  When it gets to steady state, the first system will be working on 5 million events at a time, the second system will only be working on 2 million.  This usually means that the first system will require more resources - bigger queues, more workers, a higher degree of parallelism, etc.
  2. Higher latency means slower startup.  It takes 5 seconds for events to start emerging from the first system, but only 2 seconds for the second system.
  3. Higher latency means slower shutdown.  At the other end of the lifecycle, systems with higher latency take longer to drain and safely shut down than systems with lower latency.

Summary

Why doesn’t latency matter?  Because streaming systems have constrained inputs.  So long as the system has enough capacity to handle 100% of the inputs, then latency doesn’t impact throughput.

Latency still controls the system requirements; slow is expensive!

Rewrite Case Study – The Three Musketeers Don’t Deliver

The Musketeer motto, “All for one, and one for all!” was a reminder that people are stronger when they work together.  Musketering works for problems as well; bringing them together makes them stronger and harder to resolve.  This is a case study where three unrelated problems: Data Latency, Data Quality, and UX were used to justify a full rewrite.

The All For One mentality, resulted in a team working hard and delivering no customer value for 12 months.  The team was only able to deliver once they pivoted from Musketering and began attacking the problems separately.

The Problem: Slow, Ugly, and Inconsistent Reports

Our reporting pages had three problems:

  1. Data Latency - The data loaded extremely slowly.  Our biggest customers could wait up to 5 minutes for a report to load.
  2. Data Quality - The data was inconsistent with itself.  The campaign report might claim that 1,000 people opened an email while a list of everyone who opened the email would only have 995 people.
  3. UX - The reports were extremely outdated.  The UX hadn’t been refreshed in 8 years and the front end code was written in Ember, a dead frontend language.  The company’s overall frontend had been iteratively migrating to React for years, reports was one of the last remaining pieces.

Addressing each issue would be a major undertaking, the team decided to tackle all 3 at once.

Insight: The three issues have very little to do with each other.  The data being slow was unrelated to the data being inconsistent.  The outdated UX had nothing to do with the available data.

Project Plan: New Data Stream, New Data Store, New UI - In 6 Weeks

The plan called for:

  • All events to be published to kafka
  • All data to be stored in Snowflake
  • A new backend service to handle requests written in Java
  • New Reports in React

At a high level the technical choices appeared reasonable.  None of the technology was new to the company.  Kafka was already in use and the new events wouldn’t add a significant amount of events onto the kafka stream.  There were several backend Java services.  More than 75% of the Ember pages had already been migrated to React.  Snowflake was available to some customers as an add-on feature.

The choices became unreasonable when all of the moving pieces were to be completed in 6 weeks.

All For One - This Is Blocked By That

The UI couldn’t proceed until the Java service was set up.  The java service couldn’t be built until there was data in Snowflake.  Data couldn’t get in Snowflake until the events were published to Kafka.

There was a natural order to the work; there was also a 6 week timeline.

Could the UI begin development using dummy data?  Absolutely!  

Would the UI require tons of rework when it came time to integrate with the Java service?  Absolutely!

The tight coupling would rattle timelines up and down the stack as team members working in isolation made new discoveries.  Each group was working on the entire scope of the project - all of the UI, all of the java service, all of the Snowflake data.  The project was set up to deliver everything, or nothing - All For One, One Release For All!

One For All!  After Nine Months, One Release

Nine months into the six week project, the reports were ready to be released to customers.

Data events were published to Kafka!  The Kafka stream was consumed by Snowflake!  The UI was in React!  There was a Java service acting as a middle layer between Snowflake and the React frontend.

Customers in the beta group were generally happy with the new reporting experience.  All was well, until the AWS bill came in and was 20x more than expected.

Tight Coupling And Tight Deadlines Left No Room For Changes

It turned out that Snowflake was the wrong tool for the job.  This is not Snowflake’s fault, any more than a hammer is to blame for hammering a screw.

The report needed unique ids in time series data, which isn’t something Snowflake is designed to support.  To get sets the developers added multiple array columns and did in memory operations to ensure that all of the array entries were unique.

This was extremely computationally expensive.

Because the project plan was supposed to deliver in 6 weeks the developers had to use an existing data store.  Because the project was late, there was no room to rethink the decision to use Snowflake.

Divide And Conquer

After 9 months of development, the new report had been released, and rolled back.  With Snowflake prohibitively expensive, the team needed a new plan.

Remember, the plan had called for:

  • All events to be published to kafka
  • All data to be stored in Snowflake
  • A new backend service to handle requests written in Java
  • New Reports in React

Without Snowflake there was no need to publish data to kafka, or a new Java service.

The team switched from Musketering to Divide And Conquer.

First, the team built new endpoints on the original service, using the original database.  It was slow, because the original system was slow.  It had inconsistencies, because the original system had inconsistencies.  The beautiful new UX was still beautiful.

Second, the team made the reports efficient and fast by adding rollup tables to the original database.  This made the report fast as well as beautiful.

Finally, the team worked on the data inconsistencies.  The inconsistencies were caused by race conditions and other concurrency issues.  There was a source of truth, and the team decided to mitigate the issue by periodically synching against the source of truth.

From start to finish the pivot took 3 months.  The final version was fast, consistent, and beautiful.  The rollup tables and periodic sync increased database load by a small amount, well within the budget.

Musketering Delays Delivery

Bringing multiple problems together helps make the case for a rewrite more compelling.  It also creates unnecessary coupling within a project.  Changes create huge delays which discourages revising the plan.

In this case almost everything written for the new reporting experience was thrown away.  Snowflake, because it was too expensive.  The Java Service and Kafka events because they existed to feed Snowflake.  Only the decision to rewrite 8 year old Ember reports in React turned out to be correct.

The opposite of Musketering is Divide and Conquer.   By taking on one problem at a time the team was able to succeed in ⅓ of the time, with a much simpler and less expensive solution.

20 Things You Shouldn’t Build At A Midsize SaaS

I have seen developers build a lot of unnecessary and counterproductive pieces of software over the years.  Generally, developers at small to midsize SaaS companies shouldn’t build any software that doesn’t directly help them deliver a service to their customers.

Whether it was the zero interest rate period, bad management, or hubris, developers spent a lot of company money on projects that never made sense given their employer’s goals and size.  I have seen custom implementations of every type of software on this list.  None of it worked better than open source, and none offered a competitive advantage.

If you find yourself developing or managing any of these twenty types of projects, stop and seriously consider what you are doing.

  1. Scripting languages
  2. Compiler extensions
  3. Transpilers
  4. Database extensions
  5. Databases
  6. DSLs
  7. ORMs
  8. Queues
  9. Background work schedulers
  10. GraphQL
  11. Stateful REST
  12. Frontend Frameworks
  13. Backend Frameworks
  14. Servers
  15. Dependency Injectors
  16. CSV writers or parsers
  17. Cryptography Implementations
  18. Logging Libraries
  19. DateTime libraries
  20. Anything from “First principles”

There are always exceptions, if building this software has some competitive advantage, go ahead.  In general, anyone suggesting these projects is biting off more than they can chew and doesn’t fully understand the problem they are trying to solve.

Most often things start out as a quick hack - “I’ll just concatenate these strings with a comma, it will be faster than finding a full CSV library.”  Soon you’re implementing custom separators and string escaping.

If your company has done their own implementations don’t despair, iterate towards a better library!

Learning When To Stop Developing A Project

Robert Moses used lies and trickery to ensure that his projects were completed.  He loved to start projects and let pride, politics, and sunk costs pull them to completion.  Software development is notorious for grinding on and delivering projects years late that don’t solve the original problem.

SaaS project deadlines are artificial, usually the only thing that can stop a project is completion.  Even projects with no developers will shamble on, zombie-like, eating a bit of everyone’s brain as they stumble through the code.

When you find yourself confronted with a long lived, poorly defined project, start asking questions:

  1. Was there a time element to the project, and has it passed?
  2. Have the assumptions behind the project changed?
  3. Have the company’s goals changed?

Was there a time element to the project, and has it passed?

Calendars are cyclical; it’s technically never too late or too early to get ready for your industry’s high volume period.  But if you see a project to scale for Black Friday, in January, there’s a good chance that you don’t need to finish it.

Your company got through Black Friday without the project, why do you need it now?

Have the assumptions behind the project changed?

I have been involved with many “if we switch from technology X to Y, we can save a lot of money” type projects.  Less than half produced significant cost savings.  In most of the cases we knew that the savings wouldn’t materialize early in the project.  The projects kept going anyway.

It is much easier to rationalize “the cost savings may not be there, but technology Y is better” than to work out if the new tech justifies the project on technical merits.

Have the company’s goals changed?

Long running, nebulous, projects run the risk of having the company’s goals change.  I “increase performance” and “scale the system”.  But when the customer profile changes, I am often scaling the wrong part of the system.

Learning To Question Is The First Step

Before you can stop a project, you have to question the project.  Question the timing, the assumptions, and the company’s goals.

If the answers are no, it might be time to stop development.

Skipping Tests To Deliver Faster

Managers with looming deadlines often tell developers to skip writing tests in order to deliver code faster.  This only makes sense if you don’t believe that unit tests pay off in initial development, or you view the impact of bugs as an externality.

I have encountered extremists who don’t believe that unit tests ever provide value; that’s not the case here.  Managers who want to skip writing tests to deliver faster are less extreme.  They are stating that unit tests pay for themselves over time and that they won’t provide a net benefit until after the first release.

This is not as crazy as it sounds.  Tests become more valuable over time as they allow future developers to refactor with confidence.  You can reason backwards and say that if the test is more valuable in the future, it must be less valuable today.  Right now, the test could even conceivably be worth less the cost of writing it.  And when you need to ship, you cut things that don’t add value.

Which brings forth the second part of the argument: bugs are an externality.

If you deliver a bug ridden project on time, have you succeeded?  Sadly, the answer is usually yes.  Managers get evaluated on delivering on time, developers get evaluated on quality. Managers can believe that tests add value to the code, but they don’t add value to the manager.

By their definition deadlines prioritize short term thinking.  Deadlines encourage managers to make short term tradeoffs at the expense of long term value creation.  When managers push to skip tests, the people who suffer most are customers who have to use the software.  The people who suffer least are the managers who traded tests for time.

Demos Disasters Teach You More Than Triumphs

In the vein of learning more from failure than success, a disastrous demo will teach you more than a triumphant one.  Don’t get me wrong, triumphant demos are awesome!  Triumph as much as possible and then get back to work.  

Disasters teach you, or your stakeholders, that something is wrong.  Remember, the purpose of a demo is feedback, and the feedback may be that you are on the wrong path.

As a developer, you may be doing a great job at building the wrong thing.  You may discover that you don’t understand what you are building or why.  The sooner you learn that you are off track, the sooner, and cheaper you can get on the right track.

But there are two sides to a demo.

Demo feedback is about more than just the presenter

I’ve walked into a dozen demos expecting triumph and congratulations only to hear “now that I’ve seen it, this isn’t going to work.”

Stakeholders get off track too.

You can build what your stakeholder asked for, understand everything about the problem, and still deliver something that won’t solve the problem.  

Your stakeholder will never know until they can see and touch it.  And they will see it, live, in your demo.

The sooner that stakeholders learn that they are off track, the sooner, and cheaper everyone can get on the right track.

Most Demos Aren’t Disasters or Triumphs

Most demos are constructive.  You present, you get some feedback, you move on to the next slice of work.  You learn a little about the next slice of work.

Disasters are no fun, but they do give everyone a chance to learn from the experience.  Demo early, demo often.  Demos are for feedback, and feedback leads to delivery.

The Purpose Of A Demo Is Feedback

Internal tech demos exist so that stakeholders can see and comment on the work being delivered.  Hopefully the demo produces smiles, cheers, and high fives.  Often, the stakeholders will ask questions and make objections that take everything off of the rails.  These can be painful moments; they are also extremely valuable.

Disastrous demos give you at least one of two great pieces of information:

You learn that you were building the wrong thing.  

Having bug free, highly performant code doesn’t matter if it is correctly doing something other than what stakeholders want.  The longer it takes to learn that you’re building the wrong thing, the more time and money you’re wasting.

It sucks to hear you spent a week or two building the wrong thing and need to scrap some work.  It sucks much harder when you’ve spent 6 months heading in the wrong direction.

You learn that you can’t speak to the business value

How you demo reflects your understanding of the software.  I’ve seen many demos where the developers have delivered the right thing, but they don’t understand why.  Software development in knowledge work; “I built what I was told” is the wrong answer.  

You need to know why stakeholders want the software, and speak to it in your demo.  When you have the value wrong the stakeholders will tear the demo apart looking for their business value.

Remember, demos are for feedback

Disastrous demos suck.  Disastrous demos are also successes - you learned something critical about the project that you didn’t know.

Don’t let disappointment distract from learning and improving.

ASCII Boobs in Your Codebase Won’t Fix Themselves

At the start of my career I worked in a codebase with a page of giant ascii boob art that we couldn’t delete.  The identity of whomever did it was lost to time when the codebase had migrated from CVS to Subversion.  The company had matured and no one would admit it now.  

We were in a heavily regulated industry and code changes were monitored by both our internal compliance department and random checks from external auditors.  Deleting the boobs meant putting your name on the code change and no one wanted to take the risk of having to explain why there had been ascii boobs in the codebase.  There were no questions asked about things that didn’t change.

So the boobs stayed for years and years.

Almost every codebase has areas that embarrass developers and the company.  They persist because tackling them requires admitting that problems have existed for years and you’ve been ignoring them.

New blood is often able to shake up the status quo; they haven’t spent years pretending not to see the obvious.

When it comes to the problems in your codebase, you can be brave and admit that you’ve been ignoring fixable problems, or you can wait to get replaced with someone new.  The choice is yours.  For a while.

Collaborative Breakdown: Estimating Full Projects

Have you ever been asked to fully estimate a full project so that someone else can decide if it is worth pursuing?  Did the request set off alarm bells in your head?  

It should!  Estimating full projects is a sign that the collaborative process has already broken down!  

Estimating full projects is a trap that prevents developers from bringing their most important skills to bear.  Instead of collaboration towards a common goal, estimation pushes toxic all-or-nothing demands:

  • The project has a value, but it isn’t being shared with you.  Instead the project owner is asking for an estimate; and it needs to be less than the project’s value.  If your estimate is above the line, you’ll get pressure to revise the estimate down.  Worse, your estimates will often be ignored and timelines will be dictated.  All so that the project will hit numbers you’ve never seen.
  • You are a professional software developer and you have to accept their diagnosis about the software solution.  Is this project the best way to pursue the opportunity?  Could you do it faster and cheaper some other way?  Doesn’t matter; estimate this project.  Your expertise and creative inputs have been rejected.
  • The project will be a big bang deliverable.  You’re given a scope of work and asked how long it will take.  The asker wants the full project.  You’ll have to fight to iterate, make small releases, de-risk, or even prove the concept.
  • The project scope will always miss some requirements; the larger the project, the larger the miss.  The misses will blow out the timeline, and you will get blamed for missing the estimate.

The alternative is a collaborative process!

Instead of pressure for your estimate to come in below an unknown ceiling, you can scope last.

Instead of pressure for you to accept a project, you can work together to shape the project.

Instead of pressure for you to deliver a giant project in one perfect step, you can work together to deliver iteratively.  

You even save all of the time spent creating a project plan, estimating the plan, and deciding whether the plan is worthwhile!

Pushing back will uncomfortable the first few times.  The first time you have a conversation that starts with “I see this opportunity, let’s talk about how we can seize it”, it will all have been worth it.

Why SaaS Projects With “New” Or “Next” In The Name Are Likely To Fail

Replacing useful software is hard.  Naming things is hard.  Somehow naming projects to replace software seems easy.  

If your system is called [X], the replacement project is called New [X] or Next Generation [X].

These names aren’t “easy”, they are a sign that your project is poorly thought out and likely doomed.  This is especially true in SaaS where your customers are paying for this generation.

New [X] is a sign that your project is inward looking and hasn’t considered your customer’s needs.  Do they need “New” or do they need specific features?

New [X] invites scope creep: new idea + new [x] = features in new [x].  Does the new idea fit in with the purpose of New [X]?  Well, it’s new!

New [X] turns up the pressure on the release.  It’s new!  We couldn’t ship the features to customers incrementally, we have to do a grand reveal and make a splash!

A poorly considered project, with lots of scope creep, and the pressure of making a splash combine to doom New [X].

It is completely fine to replace useful software with a better design and new technology.  Even if you do it iteratively, the end goal is a new version of the same system.  And it turns out that there is a simple naming system you can use!

If your system is called [X], the next iteration of the system is called [X][int++].
Your customers don’t care if you replace MySaaSBackend with MySaaSBackend2.  They shouldn’t even notice.

Site Footer