There’s an old story about Steve Jobs holding a meeting for a product that was filled with bugs and wasn’t working for the users. He asked “what is this product supposed to do?” and when someone responded by telling him about the features, he said “then why isn’t it doing that?”
I’ve always appreciated this story because it highlights a simple and fundamental question — “is it working?”
Do we care if our software is working?
It almost seems like a trick question to ask if we care if our software is working, because “yes it is working” is so obviously a goal. But, I’m going to suggest that we don’t always act on this.
This image below is a sample image that I made to avoid stealing someone else’s image, but the content is a representation of a very standard SDLC.
There are a few issues I have with these classic SDLCs.
- The first is that it looks like a line, so it’s easy to interpret as a straight-through process where you start at the beginning and finish at the end. Some people try and work around this by making the SDLC look like a circle to emphasize the “cycle” part of “software development lifecycle”, which is an improvement, but can still be misleading if someone hasn’t deeply internalized iterative development.
- “Maintenance” is the same size arrow as the rest. This is helpful to make things look nice for a diagram, but is misleading. A quick search shows a variety of claims about the exact percentages, but when considering the overall project cost and timelines, Maintenance is routinely estimated at over 50% of the total cost. Also, with long-lived SaaS products, the longer you expect to use software, the bigger the Maintenance category gets. If you really believe in iterative development, you might almost remove Maintenance entirely, and just point the arrow back at Planning.
- “Development” and “Testing” are shown as separate phases. Proper testing is fundamentally inseparable from Development. I believe that at the micro level, things like unit testing are deeply linked to the lines of code being written…unit testable code is a sign of well-written code. At the macro level, you almost certainly will need to develop some components to support your test plans, and structure other components to make it possible to test.
Separating “Development” from “Testing” is especially interesting…in theory you can try and isolate the phases, but you’d really have to plan ahead and it doesn’t always happen even though Testing is a phase clearly shown in the SDLC.
What happens when there’s a phase that’s not even shown at all?
I asked “do we care if our software is working?”, and this is why. These standard SDLCs don’t even list “Operating” even though it’s possibly the most important phase of all. Much like the airline industry’s saying that “airplanes only make money when they’re in the air”…software only makes money when it is operating.
You might suggest that it is under “Deployment” or “Maintenance” but there is an impedance mismatch there. Deployment is usually interpreted as “get the software into prod” and Maintenance is interpreted as “fix bugs and upgrade insecure libraries”. But if “Operating” is where we make our money, why is it missing?
A long time ago, someone responsible for a product at a major consumer internet brand told me “we don’t test, we just deploy changes and wait to see if users have problems”. I still cringe when I remember that comment. Under very special circumstances it might be possible to succeed with this philosophy, but it would be irresponsible to build critical and important software with this philosophy.
How do we know if our software is working?
Since we (hopefully) agree that we care if our software is working, the next question is “how do we do this?” Common answers are:
- we write unit tests
- we have automated integration tests
- we have a QA team
These are all fine practices, but I’m going to sidestep the costs/benefits of each. Instead I’m going to point out that they don’t technically answer the question “is the software working?” Instead, they answer a question that’s related but different: “is the software likely to work in production?”
The way to figure out if the software is actually working is by directly watching it. This usually means using words like “monitoring” or “observability” or “metrics” or “telemetry”, but ultimately it means that you should look at data produced by the software which could only be produced by successful operation.
“you should look at data produced by the software which could only be produced by successful operation”
Two side notes:
- This is why simple “bolt on” monitoring doesn’t work well. A simple “does the service return HTTP 200 or 500” is not very helpful, because a 200 just means “the webpage loaded” not “the user achieved the thing they want”
- I’ve always been a fan of Missouri being the “show me” state. Ideally, if you want to prove a claim, you should show a person direct proof.
What should we look at?
There are some classic ways of doing this which I like because they are usually straightforward to understand and to monitor.
LETS USE RED
In addition to being an excuse to add some color to this blog post, LETS USE RED is actually three helpful acronyms, each of which provide a perspective on basic monitoring.
- LETS: Latency, Errors, Traffic, Saturation
- USE: Utilization, Saturation, Errors
- RED: Rate, Error, Duration
I do like these metrics as a starting point. These pieces of data can be good proxies for the user experience, and you should supplement these with metrics based directly on user experience. Also, these metrics are easy to capture from many services without much effort. Of course, just because something is easy to measure doesn’t mean that it’s the right or best thing to measure.
What happens if you have a consumer product that only gets traffic during certain times of the day when people are awake? What happens if you run software that’s only needed during specific business hours? What happens when you want to do a deployment during low traffic hours, but you still want to know if the software is working?
The best way of proving that something is capable of doing something is to watch it do the thing. This philosophy should be applied to operations. Direct proof that a system is working helps in many situations: upgrades, new feature releases, debugging during nearly any issue, and so much more.
Using the SLA/SLO/SLI framework, you should think about your indicators (SLIs) using this “show me” philosophy. If you want to be sure a use case works, write some code to test that use case and emit a “success” metric. You should already have metrics generated by the activity, but you should also write custom expectations against the results to prove the success. You can apply this technique to any piece of software, whether you wrote it in-house or if it’s a third party black box.
I know I wrote earlier that using real user data is a gold mine to use as an indicator, but I think that there’s a lot of value in generating synthetic load on your application, and monitoring based on that. As an industry we already test in production at scale. Why not use smoke tests as part of routine observability?
This technique isn’t appropriate for every use case or SLO, and there can be some challenges. But when you absolutely, positively gotta know…it does work.