Ask HN: As a developer, am I wrong to think monitoring alerts are mostly noise?

8•yansoki•2h ago

I'm a solo developer working on a new tool, and I need a reality check from the ops and infrastructure experts here. My background is in software development, not SRE. From my perspective, the monitoring alerts that bubble up from our infrastructure have always felt like a massive distraction. I'll get a page for "High CPU" on a service, spend an hour digging through logs and dashboards, only to find out it was just a temporary traffic spike and not a real issue. It feels like a huge waste of developer time. My hypothesis is that the tools we use are too focused on static thresholds (e.g., "CPU > 80%") and lack the context to tell us what's actually an anomaly. I've been exploring a different approach based on peer-group comparisons (e.g., is api-server-5 behaving differently from its peers api-server-1 through 4?). But I'm coming at this from a dev perspective and I'm very aware that I might be missing the bigger picture. I'd love to learn from the people who live and breathe this stuff. How much developer time is lost at your company to investigating "false positive" infrastructure alerts? Do you think the current tools (Datadog, Prometheus, etc.) create a significant burden for dev teams? Is the idea of "peer-group context" a sensible direction, or are there better ways to solve this that I'm not seeing? I haven't built much yet because I'm committed to solving a real problem. Any brutal feedback or insights would be incredibly valuable.

Comments

PaulHoule•2h ago

No. Useful alerts are a devilishly hard problem.

yansoki•2h ago

You've hit the nail on the head with "devilishly hard." That phrase perfectly captures what I've felt. What have you found to be the most "devilish" part of it? Is it defining what "normal" behavior is for a service, or is it trying to account for all the possible-but-rare failure modes?

chasing0entropy•2h ago

You're not wrong.

yansoki•2h ago

Thanks for the sanity check. In your experience, what's the biggest source of the noise? Do you find it's more of a tooling problem (e.g., bad defaults) or a people/process problem (e.g., alerts not being cleaned up)?

rozenmd•2h ago

Bingo, too many folks focus on "oh god is my cpu good, ram good, etc" rather than "does the app still do the thing in a reasonable time?"

aristofun•2h ago

> investigating "false positive" infrastructure alerts?

Gradually with each false positive (or negative) you learn to tweak your alerts and update dashboards to reduce the noise as much as possible.

yansoki•1h ago

So it's really a manual and iterative process....means there should be room for something to be done

brudgers•2h ago

I'm a solo developer

To a first approximation, monitoring tools are built for teams, projects running at scale, and for systems where falling over matters at scale. And monitoring as a "best practice" is good engineering practice only in those contexts.

You don't have that context and probably should resist the temptation to boilerplate it and considering it as moving the project forward. Because monitoring doesn't get you customers/users; does not solve customer/user problems; and nobody cares if you monitor or not (except assholes who want to tell you you are morally failing unless you monitor).

Good engineering is doing what makes sense only in terms of the actual problems at hand. Good luck.

yansoki•1h ago

I wasn't building one exactly for me, but I believe not all devs have a team available to monitor the deployments for them...and sometimes centralized observability could really be a plus and ease the life for a developper....just being able to vizualise the state of your multiple vps deployments from your single pc without logging into you provider accounts should count for something I belive....this is without any form of anomaly detection or extra advice about your deployment state...I wanna believe this is useful but again the critique is welcome

tracker1•37m ago

Agreed... while monitoring isn't everything, not just alerts, but a dashboard and reports can provide insights and early warnings before things start falling over.

A single 80% CPU spike isn't anything to worry about by itself... but if it is prolonged, frequent and accompanied with significant influence on p95/99 latency and response, it could be a critical warning that you need to either mitigate an issue or upgrade soon.

I would be more inclined to set limits on response latency, or other metrics that would be impactful to users as to what is tolerable, and use that as critical alert levels. The rest you can use reports on to get say hourly or half-hourly windows in terms of where performance hits are, what the top values were for latency in addition to p95/p99, etc.

gethly•2h ago

Figuring out logging and metrics is the hardest part of running online projects nowadays. I would say though, unpopularily, that 99% of work put into this is wasted. You are unlikely running a critical software that cannot go down, so I would not worry too much about it.

YAGNI

yansoki•1h ago

Yes that's true....but my frustrations lmade me wonder if others really faced these problems, and before attempting to solve it, I want to know about solutions available...but lol everyone seems to say it's hell

gethly•59m ago

As the YAGNI says - you ain't going to needed it. Until you do, only then you act and fix that particular problem. It's that simple. So unless there is an actual problem, don't worry about it.

tudelo•2h ago

Alerting has to be a constant iterative process. Some things should be nice to know, and some things should be "halt what you are doing and investigate". The latter needs to really be decided based on what your SLI/SLAs have been defined as, and need to be high quality indicators. Whenever one of the halt-and-do things alerts start to be less high signal they should be downgraded or thresholds should be increased. Like I said, an iterative process. When you are talking about a system owned by a team there should be some occasional semi-formal review of current alerting practices and when someone is on-call and notices flaky/bad alerting they should spend time tweaking/fixing so the next person doesn't have the same churn.

There isn't a simple way but having some tooling to go from alert -> relevant dashboards -> remediation steps can help cut down on the process... it takes a lot of time investment to make these things work in a way that allows you to save time and not spend more time solving issues. FWIW I think developers need to be deeply involved in this process and basically own it. Static thresholds usually would just be a warning to look at later, you want more service level indicators. For example if you have a streaming system you probably want to know if one of your consumers are stuck or behind by a certain amount, and also if there is any measurable data loss. If you have automated pushes, you would probably want alerting for a push that is x amount of time stale. For rpc type systems you would want some recurrent health checks that might warn on cpu/etc but put higher severity alerting on whether or not responses are correct and as expected or not happening at all.

As a solo dev it might be easier just to do the troubleshooting process every time, but as a team grows it becomes a huge time sink and troubleshooting production issues is stressful, so the goal is to make it as easy as possible. Especially if downtime == $$.

I don't have good recommendations for tooling because I have used mostly internal tools but generally this is my experience.

yansoki•1h ago

This is an incredibly insightful and helpful comment, thank you. You explain exactly what I thought when writing this post. The phrase that stands out to me is "constant iterative process." It feels like most tools are built to just "fire alerts," but not to facilitate that crucial, human-in-the-loop review and tweaking process you described. A quick follow-up question if you don't mind: do you feel like that "iterative process" of reviewing and tweaking alerts is well-supported by your current tools, or is it a manual, high-effort process that relies entirely on team discipline? (This is the exact problem space I'm exploring. If you're ever open to a brief chat, my DMs are open. No pressure at all, your comment has already been immensely helpful, thanks.)

toast0•1h ago

Monitoring is one of those things where you report on what's easy to measure, because measuring the "real metric" is very difficult.

If you can take a reasonable amount of time and come up with something better for your system, great; do it. I've worked with a lot of systems where noisy alerts and human filtering was the best we could reasonably do, and it was fine. In a system like that, not every alert demands immediate response... a single high cpu page doesn't demand a quick response, and the appropriate response could be 'cpu was high for a short time, I don't need to look at anything else' Of course, you could be missing an important signal that should have been investigated, too. OTOH, if you get high cpu alerts from many hosts at once, something is up --- but it could just be an external event that causes high usage and hopefully your system survives those autonomously anyway.

Ideally, monitoring, operations and development feed into each other, so that the system evolves to work best with the human needs it has.

yansoki•1h ago

So ideally, a system that can learn from your infrastructure and traffic patterns or metrics over time? Cuz that's what I'm thinking about and your last statement seems to validate it...also from what I'm getting no tool actually exists for this

hshdhdhehd•14m ago

I would not want to use that for alerts (automatically) but I'd consider it for suggesting new alerts to set up or potential problems. If it was at all accurate and useful.

everforward•1h ago

Good alerting is hard, even for those of who are SMEs on it.

My biggest advice is to leverage alerting levels, and to only send high priority alerts for things visible to users.

For alert levels, I usually have 3. P1 (the highest level) is the only one that will fire a phone call/alarm 24/7/365, and only alerts if some kind of very user-visible issue happens (increase in error rate, unacceptable latency, etc). P2 is a mid-tier and only expected to get a response during business hours. That's where I send things that are maybe an issue or can wait, like storage filling up (but not critically so). P3 alerts get sent to a Slack channel, and exist mostly so if you get a P1 alert you can get a quick view of "things that are odd" like CPU spiking.

For monitoring, I try to only page on user-visible issues. Eg I don't routinely monitor CPU usage, because it doesn't correlate to user-visible issues very well. Lots of things can cause CPU to spike, and if it's not impacting users then I don't care. Ditto for network usage, disk IO, etc, etc. Presuming your service does network calls, the 2 things you really care about are success rate and latency. A drop in success rate should trigger a P1 page, and an increase in latency should trigger a P2 alert if it's higher than you'd like but okay and a P1 alert at the "this is impacting users" point. You may want to split those out by endpoint as well, because your acceptable latency probably differs by endpoint.

If your service can't scale, you might also want to adjust those alerts by traffic levels (i.e. if you know you can't handle 10k QPS and you can't scale past 10k QPS, there's no point in paging someone).

You can also add some automation, especially if the apps are stateless. If api-server-5 is behaving weirdly, kill it and spin up a new api-server-5 (or reboot it if physical). A lot of the common first line of defense options are pretty automatable, and can save you from getting paged if an automated restart will fix it. You probably do want some monitoring and rate limiting over that as well, though. E.g. a P2 alert that api-server-5 has been rebooted 4 times today, because repeated reboots are probably an indication of an underlying issue even if reboots temporarily resolve it.

yansoki•1h ago

Thanks...thinking about using AI to learn about what is actually "important" to the developper or team...tracking the alerts that actually lead to manual interventions or important repo changes...this way, we could always automatically send alerts to tiers...just thinking

ElevenLathe•46m ago

If "CPU > 80%" is not an error state for your application, then that is a pointless alert and it should be removed.

Ideally alerts should only be generated when ($severity_of_potential_bad_state * $probability_of_that_state) is high. In other words, for marginally bad states, you want a high confidence before alerting. For states that are really mega bad, it may be OK to loosen that and alert when you are less confident that it is actually occurring.

IME CPU% alerts are typically totally spurious in a modern cloud application. In general, to get the most out of your spend, you actually want your instances working close to their limits because the intent is to scale out when your application gets busy. Therefore, you instead want to monitor things that are as close to user experience or business metric as possible. P99 request latency, 5xx rate, etc. are OK, but ideally you go even further into application-specific metrics. For example, Facebook might ask: What's the latency between uploading a cat picture and getting its first like?

al_borland•21m ago

I spent the first half of my career in ops, watching those alerts, escalating things, fixing stuff, writing EDA to fix stuff, working with monitoring teams and dev teams to tune monitoring, etc. Over time I worked my way into a dev role, but still am focused on the infrastructure.

The problem you’re starting to run into is that you’re seeing the monitors as useless, which will ultimately lead to ignoring them, so when there is a problem you won’t know it.

What you should be doing is tuning the monitors to make them useful. If your app will see occasional spikes that last 10 minutes, and the monitor checks every 5 minutes, set it to only create an alert after 3 consecutive failures. That creates some tolerance for spikes, but will still alert you if there is a prolonged issue that needs to be addressed due to the inevitable performance issues it will cause.

If there are other alerts that happen often that need action taken, which is repeatable, that’s where EDA (Event Driven Automation) would come in. Write some code to fix what needs to be fixed, and when the alert comes in the code automatically runs to fix it. You then only need to handle when that EDA code can’t fix the issue. Fix it once it code instead of every time you get an alert.

hshdhdhehd•19m ago

CPU usage I tend to see used for two things. Scaling and maybe diagnostics (for 5% of investigations). Dont alert on it. Maybe alert if you scaled too much though.

I would recommend alerting on reliability. If errors for an endpoint go above whatever yoy judge to set e.g. 1% or 0.1% or 0.01% for a sustained period then alarm.

Maybe do the same for latency.

For hobby projects though I just point a free tier of one of those down detector things at a few urls. I may make a health check url.

Every false alarm should lead to some decision of how to fix e.g. different alarm, different threshold or even just forget that alarm.

Show HN: A Fast JSON Formatter

All Things Open Has More in Store for 2025, Including an Added Measure of AI

People are using AI to talk to God

Why Do Grains Defy Gravity? [video]

Carving a Niche in the Cloud: The Modal Approach

Penguin and Club bars can no longer be described as chocolate

App for Activists

Might Look Good on You

ATM fraud nearly brought down British banking

The Death of Thread per Core

Grandmaster, Popular Commentator Daniel Naroditsky Tragically Passes Away at 29

Measuring the Moon with a Tungsten Cube [video]

Deep Learning 33 Years Ago (Karpathy) (2022)

Physically Based Content

Khronos Vulkan Tutorial

Supertuxkart 1.5

Show HN: Omegle for Devs

Scientists May Have Found a Simple Way to Reverse Aging Eyes

Hopes new Australian fabric will revitalise domestic clothing manufacturing

Show HN: Workbench – ephemeral cloud sandboxes for agentic coding

How to create multiple Gmail Accounts and effectively manage them

The Robots That Handle Your Amazon Orders

Grand Theft Auto made him a legend. His latest game was a disaster

A Viral Neuraminidase-Specific Sensor for Taste-Based Detection of Influenza

Data Drain: The Land and Water Impacts of the AI Boom

Unexpected patterns in historical astronomical observations

Immigration Service Issues Guidance on Who Pays the $100k H-1B Fee

I'm 15 and just shipped my agent-powered full-stack app builder

Skillz: Use Claude Skills in Codex, Copilot, or Any Other Agent via MCP

Lifespan of AI Chips: The $300B Question

Show HN: A Fast JSON Formatter

All Things Open Has More in Store for 2025, Including an Added Measure of AI

People are using AI to talk to God

Why Do Grains Defy Gravity? [video]

Carving a Niche in the Cloud: The Modal Approach

Penguin and Club bars can no longer be described as chocolate

App for Activists

Might Look Good on You

ATM fraud nearly brought down British banking

The Death of Thread per Core

Grandmaster, Popular Commentator Daniel Naroditsky Tragically Passes Away at 29

Measuring the Moon with a Tungsten Cube [video]

Deep Learning 33 Years Ago (Karpathy) (2022)

Physically Based Content

Khronos Vulkan Tutorial

Supertuxkart 1.5

Show HN: Omegle for Devs

Scientists May Have Found a Simple Way to Reverse Aging Eyes

Hopes new Australian fabric will revitalise domestic clothing manufacturing

Show HN: Workbench – ephemeral cloud sandboxes for agentic coding

How to create multiple Gmail Accounts and effectively manage them

The Robots That Handle Your Amazon Orders

Grand Theft Auto made him a legend. His latest game was a disaster

A Viral Neuraminidase-Specific Sensor for Taste-Based Detection of Influenza

Data Drain: The Land and Water Impacts of the AI Boom

Unexpected patterns in historical astronomical observations

Immigration Service Issues Guidance on Who Pays the $100k H-1B Fee

I'm 15 and just shipped my agent-powered full-stack app builder

Skillz: Use Claude Skills in Codex, Copilot, or Any Other Agent via MCP

Lifespan of AI Chips: The $300B Question

Ask HN: As a developer, am I wrong to think monitoring alerts are mostly noise?

Comments