Also, the best alerts come from looking at actual failures you had and not trying to make up "good alerts" from thin air. After you have an outage, figure out what alerts would have caught it, and implement those.
https://en.wikipedia.org/wiki/Nelson_rules
I work for a startup; we have what I think is a fairly typical setup: metrics ingested from a variety of sources, fed into industry-standard metrics/dashboard solutions, triggering escalations to humans. It's fine and I'm happy we have it, but...
The highest value source of alerting right now is one of our growth marketers who pays close attention to our CRM and product analytics tool and notices when key product funnels are underperforming.
Our next highest value signals are a handful of ad hoc alerting channels, mostly in Slack, either directly from a partner telling us that something suspicious happened on their side (think: fraud) or from in-product instrumentation sent to a channel for non-engineering visibility. Members of our business/product/operations team pay attention in these places and make decisions based on their business context.
After that, our support team is increasingly able to filter customer issues and differentiate between bugs, missing features, etc.
I know someone is going to argue that these are all a sign that we haven't instrumented the right things. Fair, but also misses the point. The decision makers in these flows don't (and won't) live in traditional alerting systems and wouldn't have helped us understand breakages without these other, ad hoc processes.
My theory is that it's relatively easy to offer a technical product that moves alerts around or that manages escalation paths. It's quite hard to design a product that surfaces detail to a non-technical export and that makes it easy to build systematic rules.
stingraycharles•1h ago
Lots of metrics are typically available, but almost all of them are noise.
Start with the business: what is important to the business ? What kind of failures are existential threats ?
Then work your way down and design your metrics and alerts, instead of just throwing stuff at the wall.
I’ve had to push back so many times with teams whose manager at one point said “we need better monitoring / alerting” and they interpreted that to mean more metrics / alerts.
This is rarely the case.
I personally am really fond of just using a few alerts. The important thing to know that something went wrong. Not necessarily where / why / how something went wrong.
And yes, inertia is real, and false / invaluable alerts need to be killed immediately, without remorse. They are SRE’s cancer.
b112•1h ago
As you say, few is better. And a well chosen few.
alansaber•1h ago
dandellion•1h ago