In the world of SaaS software, nothing is as sure a sign that a project will fail than calling it The Next Generation. No one wants the next generation of your software! Your clients don’t want The Next Generation, they want the service that they’re paying for. If they wanted something revolutionarily different, they wouldn’t be your clients. Your internal users don’t want The Next Generation, they want tools that help them do their job.
The Next Generation is something created by developers to sugar coat a full rewrite of the existing software. The Next Generation has no business value. If you need The Next Generation of your software in order to write a new reporting module, it’s a sign you’ve given up on the current system.
When developers pitch The Next Generation they are being lazy. It in an admission that the developers do not understand the business or care about client needs. The Next Generation has no release plan other than replacement, and any theoretical client value comes at the end of the project. At the same time it consumes resources that could have been used to for incremental features or code improvements.
The Next Generation starts with great fanfare, then goes silent for 6 months. Then the team starts growing! Not because of success, but the sunk cost fallacy. The company has spent so much and things are so close. Good money after bad, critical months after lost months. Around 12 months management starts to micromanage. At 18 months the project is declared a failure. The team lead leaves. Often the managers are not far behind. After millions of dollars spent on The Next Generation, you’re still where you started.
In SaaS, clients are buying This Generation. If your developers are done with This Generation, you don’t need The Next Generation, you need to find your Best Alternative to a Total Rewrite!
Topics allow multiple queues to register for incoming messages. That means instead of publishing a message onto a queue, you publish onto zero or more queues at once, and there is no impact on the publisher. One consumer, no consumer, 100 consumers, you publish one message onto a topic.
All of these situations require the same effort and resources from your publisher.
For a SaaS company with services running off queues, Topics give your developers the ability to create new services that run side-by-side with your existing infrastructure. New functionality off of your existing infrastructure, without doing a rewrite! How does that work?
Adding a new consumer means adding another Queue to the Topic.
No code changes for any existing services. This is extremely valuable when the existing services are poorly documented and difficult to test.
You can test new versions of your code through end-to-end tests.
Since you can now create two sets of the data, you can run the new version in parallel with the current version and compare the results. Rinse, repeat until you feel confident in sending your customers results from the new system.
It’s not ideal, but you’ll sleep a whole lot easier at night knowing that the original code and original output remains untouched.
New uses for your message flow have no impact on your existing services.
Consuming data becomes “loosely coupled”. Freed from potentially impacting the existing, difficult, code, new reports, monitoring and other ideas become feasible and exciting instead of dread inducing. New uses don’t even have to be in the same programming language!
A concrete example; How Topics can be used to create monitoring on a legacy system:
I worked for a company that was processing jobs off of a queue. This was an older system that had evolved over a decade and was a mess of spaghetti code. It mostly worked, but was not designed for any kind of observability. Because jobs like hourly reports would run, rerun, and even retry, knowing whether a specific hourly report completed successfully was a major support headache.
When challenged to improve the situation the lead developer would shrug and say that nothing could be done with the current code. Instead, he had a plan to do a full rewrite of the scheduler system with logging, tests, and observability baked in. The rewrite would take 6 months. The flaws, bugs and angry customers weren’t quite enough to justify a developer spending 6 months developing a new system. Especially since the new system wouldn’t add value until it was complete. The company didn’t have the resources for a rewrite, but it did have me.
The original system was using SQS on AWS as the queue. We changed the scheduler code to use AWS’s Topic service, SNS, instead. We had SNS write incoming messages to the original SQS queue, and called it a release.
We now had the option and ability to add new services without any further disruption or risk to the original job processor.
We created a new service with the creative name Task Monitor, created a new SQS queue and added it as a listener to SNS. Task Monitor maintained a list of active tasks. It would read messages off a queue and create an entry in an in memory list. Every 5 minutes it would iterate the list and check the status of the task against the database and remove completed tasks.
Surviving tasks were added to “Over 5 min” list, “Over 10 min list”, etc and the data was exposed via a simple web api framework. Anything over 45 minutes resulted in an alert being generated.
We now had visibility into which tasks were slipping through the cracks, and with the pattern exposed we were quickly able to fix the bugs. Client complaints ceased (about scheduled reports anyway), which reduced the support load by about 60% of one developer. With almost 3 additional developer days per week we were able to start knocking out some long delayed features and refactoring.
All of these changes were created by a simple change of a call to SQS to a call to SNS. We didn’t need to dive deep into the legacy system to add monitoring and instrumentation.
The additional cost and load of using Topics is negligible, but they create amazingly powerful opportunities, especially for legacy systems that are difficult to refactor.
When your developers say that there’s no way to improve a queue based system without rewriting it, look into Topics. They’re your Best Alternative to a Total Rewrite.
Mission critical legacy systems tend to attract two flavors of Senior Developer: The Lord and The Reformer. The Lord runs his team to ensure his usefulness to the company and his primacy in the developer hierarchy, while The Reformer focuses on rebuilding the team and business value.
If you are a manager taking on a new legacy team, or developer on a legacy system getting a new senior developer, the good news is that you will quickly be able to tell which group your senior falls into. The difference is stark and telling within 3-6 months if you’re paying attention for these 5 major signs:
Both The Lord and The Reformer may point to a piece of code and say “I’m the only one who can understand piece of code”. The Lord will say it as a point of pride and evidence of his superiority. The reformer says it sadly knowing that if only the Senior Developer can understand the code, then code is bad, even if it works correctly. Over time The Reformer will shrink the inscrutable parts of the codebase through refactoring and tests. The Lord will both proclaim that nothing can be done, and complain loudly and often, about being held back by the terrible state of the code.
When a junior developer joins the team and asks about documentation, both may say “this is a legacy system and there is no documentation”. The Lord will then suggest the new developer read the code to understand, and may even smirk and suggest that the new developer write some documentation during the journey to enlightenment. The Reformer will point to the parts of the system that have been documented, discuss the best way to get familiar with the code and offer to write additional documentation as questions arise. The Reformer expands documentation with every hire, The Lord will complain that adding people is taking away from his development time.
The Lord instinctively knows that critical developers can’t be fired and works to increase the company’s dependency on him. The Reformer knows that if you can’t be fired, you also can’t be promoted. The Reformer works to ensure he can step out of his current team at a moment’s notice so that he is available to step into a bigger role. This sign is a little subtle, but consider your bus factor. If the Senior got hit by a bus, how long would it take the rest of the team to handle support? The time for Reformers is measured in days, Lords in weeks.
The Lord believes that the system couldn’t possibly function without him and speaks of his overnight heroics as a point of pride. The Reformer views anything that requires his presence as a mistake that needs correcting. The Lord will rarely fix the underlying issues that result in heroics, while The Reformer makes them a priority. As a result The Lord will spend a steady amount of time dealing with production, while The Reformer quickly oversees a very calm environment.
The Lord will take all of the new feature work “to ensure it is architected correctly” and leave all of the maintenance work to the junior developers on the team. The Reformer will take all the maintenance work and teach the rest of the team how to build new features. The Lord’s team will have an ever increasing amount of maintenance, and production issues, while The Reformer’s team will see bugs and maintenance work drop dramatically.
The good news is that you’ll know very quickly which Senior you have on your team.
The bad news is that The Reformer won’t be around long. By improving the system and the team The Reformer won’t be needed and you’ll soon lose him to a new disaster. There is a silver lining, teams run by Reformers become attractive to other developers and you will have volunteers looking to join the new hot team.
The worst news though, is that The Lord isn’t going anywhere. His antics will actively drive away developers and managing him is your problem.
Several readers wrote in response to my article Your Database is not a Queue to tell me that they don’t know what a queue is, or why they would use one. As one reader put it, “assume I have 3 tools, a hammer, a screwdriver and a database”. Fair enough, this week I want to talk about what a message queue is, what features it offers, and why it is a superior solution for batch processing.
To keep things as concrete as possible, I’ll use a real world example, Nightly Report Generation, and AWS technologies.
Your customers want to know how things are going. If your SaaS integrates with a shopping cart, how many sales did you do? If you do marketing, how many potential customers did you reach? How many leads were generated? Emails sent? Whatever your service, you should be letting your customers know that you’re killing it for them.
Most SaaS companies have some form of RESTful API for customers, and initially you can ask clients to help themselves and generate reports on demand. But as you grow, on demand reporting becomes to slow. Code that worked for a client with 200 customer shopping carts a day may be to slow at 2,000 or 20,000 carts. These are great problems to have!
To meet your client’s needs, you need a system to generate reports overnight. It’s not client facing, so it doesn’t need to be RESTful, and it’s not driven by client activity, so it won’t run itself.
For this article we will use AWS’s SQS, which stands for Simple Queue Service. There are lots of other wonderful open source solutions like Rabbit MQ, but you are probably already using AWS, and SQS is cheap and fully managed.
What does a queue do?
A queue gives you a list of messages that can only be accessed in order.
A queue guarantees that one, and only one, consumer can see a message at a time.
A queue guarantees that messages do not get lost, if the consumer takes a message from the queue, it has a certain amount of time to tell the queue that the message is done, or the queue will assume something has gone wrong and make the message available again.
A queue offers a Dead Letter Queue, if something goes wrong to many times the message is removed and placed on the Dead Letter Queue so that the rest of the list can be processed
As a practical matter all that boils down to:
You can run multiple instances of the report generator without worrying about missing a client, or sending the same report twice. You get to skip the early concurrency and scaling problems you’d encounter if you wrote your own code, or tried to use a database.
When your code has a bug in an edge case, you’ll still be able to generate all of the reports that don’t hit the edge case.
You can alert on failures, see which reports failed, and after you fix the bug, you can click of a button put them back into the queue to run again.
Because SQS is managed by AWS you get monitoring and alerts out of the box
Because SQS is managed by AWS you don’t have to worry about the queue’s scale or performance.
You can add API endpoints to generate “on demand” messages and put them on the queue. This means that the “standard” report messages, and client generated “on demand” report requests follow the same process and you don’t need to build and maintain separate reporting pipelines for internal and external requests.
Everything is tradeoffs, and queues don’t solve all problems well. What kinds of problems are you going to have with SQS?
Priority/Re-sorting existing messages – You have 10,000 messages on the queue and an important client needs a report NOW! Prioritized messages, or changing the order of messages on the queue is not something SQS supports. You can have a “fast lane” queue that gets high priority messages, but it’s clunky.
Random Access – You have 10,000 messages and want to know where in the queue a specific client’s report is? Queues let you add at the end and take from the front, that’s it. If you need to know what’s in the middle you’ll need to maintain that information in a separate system.
Random Deleting – Probably not important for reports, but for cases like bulk importing, if a client changes her mind and wants to cancel a job, you can’t reach into the middle of the queue and remove the message.
Process order is not guaranteed – If you have a 3 part task, and you cannot start Task 2 until Task 1 finishes, you cannot add all 3 tasks onto the queue at once. It is highly likely that a second worker will come along and start Task 2 while Task 1 is still in process. Instead you will need to have Task 1 add Task 2 onto the queue when it finishes.
None of these problems will crop up in your early iterations, and they are great problems to have! They are signs that your SaaS is growing to meet your client’s needs, you and your clients are thriving!
To bring it full circle, what if you already have a home grown system for running reports? How can you get on the path to queues? See my article on about common ways into the DB as a Queue mess, and some suggestions on how to get out.
SaaS Scaling anti-patterns: The database as a queue
Using a database as a queue is a natural and organic part of any growing system. It’s an expedient use of the tools you have on hand. It’s also a subtle mistake that will consume hundreds of thousands of dollars in developer time and countless headaches for the rest of your business. Let’s walk down the easy path into this mess, and how to carve a way out.
No matter what your business does on the backend, your client facing platform will be some kind of web front end, which means you have web servers and a database. As your platform grows, you will have work that needs to be done, but doesn’t make sense in an api / ui format. Daily sales reports and end of day reconciliation, are common examples.
The initial developer probably didn’t realize he was building a queue. The initial version would have been a single table called process which tracked client id, date and completed status. Your report generator would load a list of active client ids, iterate through them, and write done to the database.
Simple, stateful and it works.
For a while.
But, some of your clients are bigger, and there are a lot of them, and the process was taking longer and longer, until it wasn’t finishing overnight. So to gain concurrency and added worker processes your developers added “Not started” and “in process” states. Thanks to database concurrency guarantees and atomic updates, it only took a few releases to get everything working smoothly with the end-to-end processing time dropping back to something manageable.
Now the database is a queue and preventing duplicate work.
There’s a list of work, a bunch of workers, and with only a few more days of developer time you can even monitor progress as they chew through the queue.
Except your devs haven’t implemented retry logic because failures are rare. If the process dies and doesn’t generate a report, then someone, usually support fielding an angry customer call, will notice and ask your developers to stop what they’re doing and restart the process. No problem, adding code to move “in-process” back to “not started” after some amount of time is only a sprint worth of work.
Except, sometimes, for some reason, some tasks always fail. So your developers add a counter for retries, and after 5 or so, they set the state to “skip” so that the bad jobs don’t keep sucking up system resources.
Congratulations! For about $100,000 in precious developer time, your SaaS product has a buggy, inefficient, poor scaling implementation of database-as-a-queue. Probably best not to even try to quantify the opportunity costs.
Solutions like SQS and RabbitMQ are available, effectively free, and take an afternoon to set up.
Instead of worrying about how you got here, a better question is how do you stop throwing good developer resources away and migrate?
Every instance is different, but I find it is easiest to work backwards.
You already have worker code to generate reports. Have your developers extend the code to accept a job from a queue like SQS in addition to the DB. In the first iteration, the developers can manually add failed jobs to the queue. Likely you already have a manual retries process; migrate that to use the queue.
Once you have the code working smoothly with a queue, you can start having the job generator write to the queue instead of the database. Something magically usually happens at this point. You’ll be amazed at how many new types of jobs your developers will want to implement once new functionality no longer requires a database migration.
Soon, you’ll be able to run your system off the db or a queue, but the db tables will be empty.
Only then do you refactor the db queues out of your codebase.
Adding a proper queue system gets your team out of the hole and scratches your developers itch for shiny and new technology. You get improved functionality after the very first sprint, and aren’t rewriting your code from scratch.
That’s your best alternative to a total rewrite, start today!
Replacement is not a release plan, it’s a sign that you are solving developer’s pain instead of client pain.
Deployment gets glossed over in the pitch: First we will mimic the existing functionality. Then turn off the old system.
Since the plan is to re-implement the current functionality, your developers can start immediately! No need to talk to the clients since they won’t notice any difference until we show them all the wonder improvements!
Developers get super excited about these kinds of rewrites because it is all about them and their pain. The plan fails because the client cares about client pain, not developer pain.
Don’t assume the client wants what are you giving them! Don’t assume they would love for you to give them more features, better code, or anything that excites your developers. A more common situation is that someone has full time job doing manual data extractions, transformations, and other manipulations that software could do in seconds and your developers could write in a week.
Find your client’s pain. Appeal to your developer’s sense of empathy. If they hate dealing with the system, have them imagine the low level person being kept in a pointless job. It’s a good bet that once your developers find out how their software is being used they’ll find that there’s no need for a rewrite; the clients need new tools, not replacements.
Recently, I saw some heated arguments about the optimal ratio of senior to junior developers on teams. There was much mashing of teeth about how companies are hurting themselves by having many seniors and few, if any, junior developers. Long threads about proper ratios, code complexity, code quality, and management oversight followed.
That’s all nice in theory, but the reason is that Senior Developers are overrepresented on development teams, is that they are *relatively* cheap compared to junior developers.
A senior developer costs about $31,000/year more than a junior developer, implying that a senior developer is 41% more productive and useful than a junior developer (or that a junior is 71% as effective as a senior developer).
We are taking a very generic view of things, but if you expect a senior developer to be twice as effective and productive as a junior developer, then the value of a senior developer would be 2x a junior, or $154,000/year.
Relative to junior developers, the average senior developer is 30% cheaper.
My personal experience in Chicago looks even worse, with 1x junior devs getting paid around $100,000/year and 5x seniors getting paid $175,000. This is effectively a 65% discount over junior devs.
As a result of these huge discounts companies do the logical thing and only hire junior devs when they can’t hire enough senior devs. The economically optimal number of junior developers on a team is always 0.
But wait you say, what about non-salary costs? Does that have an impact?
Yes, but it makes the relative value of seniors worse:
The average employee receives about $30,000 in non-salary benefits (mostly healthcare). Hiring 2 juniors instead of 1 senior costs an extra $30,000/year
Managers can only manage so many people, adding more juniors means adding more managers, scrum masters, and other process people.
Why senior developers are relatively cheap is a good question, but not one I feel qualified to answer.
I’ve used dozens of config systems in my career, heavy XML, lightweight INI, and more “Config is easy, I’m just going to write my own” disasters than I care to count. Recently I have been working in Node and deploying to Heroku, and I found what may be my favorite config system of all time.
Node-Config is a JSON based hierarchical system that plays wonderfully with Heroku’s Environment variable based config.
What does a hierarchical system mean?
Imagine you are working on a system that can be run local, dev, or production, with different database and remote services. When running locally your config might look like this:
“dbConn”: “DB Running locally on my laptop”,
dev also runs on port 8080, but uses a shared database. Instead of overwriting everything, you only have to specify the deltas:
“dbConn”: “Shared database location”,
Node-Config used the “NODE_ENV” magic variable to know which environment you are running, but no matter which environment you are running, config.webapp.portwill be 8080.
Where does Heroku fit in?
Heroku injects config variables into the system as environment variables at run time. When using a Heroku Postgres database, they not only manage the system, but they will also automatically rotate the connection details periodically. That’s great, but if means you can’t have a static config file.
Well Node-ConfigALSO supports environmental variable overrides!
You need to create a config file called custom-environment-variables.jsonand place it with the rest of your config. Instead of specifying values, you specify the environment variable names! Continuing our example:
PORT is Heroku’s default PORT, and HEROKU_POSTGRESQL_NAVY_URL is a environment.
Your Heroku config dashboard will look something like this:
Node-config will pull Heroku’s config vars in at run time. No need to handle multiple config injection logic. Set it once and it will all just work like magic!
I really like the combination of Node-config with Heroku. The hierarchy layering keeps things simple, even while Heroku is rotating security keys under you!
Bonus for Visual Studio Code users
I use vsCode for my development.
If you follow the defaults, the environment and launch.json setup will look a lot like this:
launch.json, which is vsCode’s run/debug config will look like this:
The first thing I do after creating a new branch, is write the changelog. Stating what new features go in the branch helps to focus my coding and keeps the branch clean and focused.
As I work I do a reread and fix the changelog to match the implementation reality. There have been countless instances where I have started coding a second feature, realized it doesn’t fit, and used that signal to know that this branch is done and should be merged.
If you write the changelog after you code, you probably won’t write a changelog. If you do it will be incomplete, because you won’t remember everything. Either way your branches will be less focused, and more difficult to merge.
Write the changelog first! Your fellow developers, including future you will thank you for it!
Imagine you’ve just been hired to fix a horrible legacy system.
You’ve just been handed a giant monolith that talks to customers on the internet, accesses 4 different internal databases, handles employee and customer on-boarding, sends marketing emails, contains a proprietary task scheduling system, and has concurrency issues. Which problem do you tackle first?
Separation of concerns? Maybe the massive number of exceptions in the logs? Probably none of these.
The first thing you really need to learn is why you been hired to fix the system in the first place. Ignore the technical problems, what are the business problems with the current system? What does the business need done so badly that they’ve hired you? Your fellow programmers care about how difficult the current system is, but the business cares about how hard it is to get what they need. Don’t confuse tech problems with business problems.
Learn the business problem that justifies your salary. Find a way to provide that value. You can fix or replace the legacy system over time.