What Sound Does Your Microwave?
Author
Marcus HeldHi,
Do you know the feeling? You turn on a device that you use over and over again. For years. You recognize the sound it makes.
The rhythmic “whoosh whoosh whoosh”, when you start the dishwasher.
Or the familiar “click” when the oven turns on.
Your signal: everything’s working. Carry on!
But what if it suddenly sounds different?
It no longer meets your expectations?
That’s exactly what happened to me with my microwave.
I turned it on - as usual. I was expecting a “djeeehhh”.
But…
Instead, it went “DJREEEEEERREERE” 😱
Panic!
Turn it off immediately. Open it. Smell.
An internal alarm went off instantly.
Something’s not right!
And of course, the microwave was broken.
But why am I narrating this story with such fervor?
You know this effect too. And we can make it work for us!
When we develop and operate software, we use a similar language. We talk about Panic when a program shows unexpected and uncorrectable behavior. We trigger Alerts when there’s an issue with our software. In the face of an error, we’d rather stop the entire process than let it continue.
We react in the same way.
However, the prerequisite for us to know something’s wrong is:
We need to know what it looks like in its normal state!
It sounds trivial.
I know.
But: Do you know how your application normally behaves? 🤔
What does your CPU graph look like spread out over the day? Are there natural peaks? Maybe at the start of the workday? How is it over the whole week?
Or what about your RED metrics? Is it normal for us to see errors regularly? Or is that a sign that something’s amiss?
And what about that exception in the log? Does it indicate an issue?
Time and time again, I find myself in teams that can’t answer these questions. And it costs you and a company a lot of money. Especially when something really goes wrong. You don’t know where to look. And you can’t spot the error before your customers do. It only lands on your desk when things are on fire.
Look at your application every day.
Keep an eye on the metrics. Visualize them on a dashboard. Maybe with Grafana.
Look at your logs. Aggregate and index them. For instance, with the ELK Stack.
Enrich exceptions with additional meta information and send them to a place where you can’t miss them. Maybe in your “General” channel.
You need to know what “normal” is to recognize the exceptional state.
Rule the Backend,
~ Marcus