Gradually with each false positive (or negative) you learn to tweak your alerts and update dashboards to reduce the noise as much as possible.
To a first approximation, monitoring tools are built for teams, projects running at scale, and for systems where falling over matters at scale. And monitoring as a "best practice" is good engineering practice only in those contexts.
You don't have that context and probably should resist the temptation to boilerplate it and considering it as moving the project forward. Because monitoring doesn't get you customers/users; does not solve customer/user problems; and nobody cares if you monitor or not (except assholes who want to tell you you are morally failing unless you monitor).
Good engineering is doing what makes sense only in terms of the actual problems at hand. Good luck.
A single 80% CPU spike isn't anything to worry about by itself... but if it is prolonged, frequent and accompanied with significant influence on p95/99 latency and response, it could be a critical warning that you need to either mitigate an issue or upgrade soon.
I would be more inclined to set limits on response latency, or other metrics that would be impactful to users as to what is tolerable, and use that as critical alert levels. The rest you can use reports on to get say hourly or half-hourly windows in terms of where performance hits are, what the top values were for latency in addition to p95/p99, etc.
YAGNI
There isn't a simple way but having some tooling to go from alert -> relevant dashboards -> remediation steps can help cut down on the process... it takes a lot of time investment to make these things work in a way that allows you to save time and not spend more time solving issues. FWIW I think developers need to be deeply involved in this process and basically own it. Static thresholds usually would just be a warning to look at later, you want more service level indicators. For example if you have a streaming system you probably want to know if one of your consumers are stuck or behind by a certain amount, and also if there is any measurable data loss. If you have automated pushes, you would probably want alerting for a push that is x amount of time stale. For rpc type systems you would want some recurrent health checks that might warn on cpu/etc but put higher severity alerting on whether or not responses are correct and as expected or not happening at all.
As a solo dev it might be easier just to do the troubleshooting process every time, but as a team grows it becomes a huge time sink and troubleshooting production issues is stressful, so the goal is to make it as easy as possible. Especially if downtime == $$.
I don't have good recommendations for tooling because I have used mostly internal tools but generally this is my experience.
If you can take a reasonable amount of time and come up with something better for your system, great; do it. I've worked with a lot of systems where noisy alerts and human filtering was the best we could reasonably do, and it was fine. In a system like that, not every alert demands immediate response... a single high cpu page doesn't demand a quick response, and the appropriate response could be 'cpu was high for a short time, I don't need to look at anything else' Of course, you could be missing an important signal that should have been investigated, too. OTOH, if you get high cpu alerts from many hosts at once, something is up --- but it could just be an external event that causes high usage and hopefully your system survives those autonomously anyway.
Ideally, monitoring, operations and development feed into each other, so that the system evolves to work best with the human needs it has.
My biggest advice is to leverage alerting levels, and to only send high priority alerts for things visible to users.
For alert levels, I usually have 3. P1 (the highest level) is the only one that will fire a phone call/alarm 24/7/365, and only alerts if some kind of very user-visible issue happens (increase in error rate, unacceptable latency, etc). P2 is a mid-tier and only expected to get a response during business hours. That's where I send things that are maybe an issue or can wait, like storage filling up (but not critically so). P3 alerts get sent to a Slack channel, and exist mostly so if you get a P1 alert you can get a quick view of "things that are odd" like CPU spiking.
For monitoring, I try to only page on user-visible issues. Eg I don't routinely monitor CPU usage, because it doesn't correlate to user-visible issues very well. Lots of things can cause CPU to spike, and if it's not impacting users then I don't care. Ditto for network usage, disk IO, etc, etc. Presuming your service does network calls, the 2 things you really care about are success rate and latency. A drop in success rate should trigger a P1 page, and an increase in latency should trigger a P2 alert if it's higher than you'd like but okay and a P1 alert at the "this is impacting users" point. You may want to split those out by endpoint as well, because your acceptable latency probably differs by endpoint.
If your service can't scale, you might also want to adjust those alerts by traffic levels (i.e. if you know you can't handle 10k QPS and you can't scale past 10k QPS, there's no point in paging someone).
You can also add some automation, especially if the apps are stateless. If api-server-5 is behaving weirdly, kill it and spin up a new api-server-5 (or reboot it if physical). A lot of the common first line of defense options are pretty automatable, and can save you from getting paged if an automated restart will fix it. You probably do want some monitoring and rate limiting over that as well, though. E.g. a P2 alert that api-server-5 has been rebooted 4 times today, because repeated reboots are probably an indication of an underlying issue even if reboots temporarily resolve it.
Ideally alerts should only be generated when ($severity_of_potential_bad_state * $probability_of_that_state) is high. In other words, for marginally bad states, you want a high confidence before alerting. For states that are really mega bad, it may be OK to loosen that and alert when you are less confident that it is actually occurring.
IME CPU% alerts are typically totally spurious in a modern cloud application. In general, to get the most out of your spend, you actually want your instances working close to their limits because the intent is to scale out when your application gets busy. Therefore, you instead want to monitor things that are as close to user experience or business metric as possible. P99 request latency, 5xx rate, etc. are OK, but ideally you go even further into application-specific metrics. For example, Facebook might ask: What's the latency between uploading a cat picture and getting its first like?
The problem you’re starting to run into is that you’re seeing the monitors as useless, which will ultimately lead to ignoring them, so when there is a problem you won’t know it.
What you should be doing is tuning the monitors to make them useful. If your app will see occasional spikes that last 10 minutes, and the monitor checks every 5 minutes, set it to only create an alert after 3 consecutive failures. That creates some tolerance for spikes, but will still alert you if there is a prolonged issue that needs to be addressed due to the inevitable performance issues it will cause.
If there are other alerts that happen often that need action taken, which is repeatable, that’s where EDA (Event Driven Automation) would come in. Write some code to fix what needs to be fixed, and when the alert comes in the code automatically runs to fix it. You then only need to handle when that EDA code can’t fix the issue. Fix it once it code instead of every time you get an alert.
I would recommend alerting on reliability. If errors for an endpoint go above whatever yoy judge to set e.g. 1% or 0.1% or 0.01% for a sustained period then alarm.
Maybe do the same for latency.
For hobby projects though I just point a free tier of one of those down detector things at a few urls. I may make a health check url.
Every false alarm should lead to some decision of how to fix e.g. different alarm, different threshold or even just forget that alarm.
PaulHoule•2h ago
yansoki•2h ago