Do monitoring tools still miss early signals before incidents?

3•gabdiax•1h ago

I'm curious how teams detect early signals of infrastructure problems before they turn into incidents.

In some environments I've seen cases where monitoring alerts arrive only after the system is already degrading.

Examples: - disk usage spikes faster than expected - network latency gradually increases - services degrade slowly before failing

Tools like Datadog, Zabbix, Prometheus etc. are great for alerts, but they still feel mostly reactive.

How do you deal with this in your infrastructure?

Do you rely more on: - anomaly detection - predictive monitoring - custom scripts - or just good incident response?

I'm trying to understand what actually works in real-world environments.

Comments

gabdiax•1h ago

One thing I'm particularly curious about is whether teams see early signals in metrics or logs before incidents actually happen.

For example: - unusual latency patterns - slow resource saturation - network anomalies

Do people actively monitor these patterns or mostly rely on threshold alerts?

zippyman55•1h ago

My team was responsible for the system administration of a large scale HPC center. We seemed to get blamed, incorrectly, for a lot of sloppy user code. I implemented statistical process controls for job aborts, and reported the results as mean time to failure rates over the years. It was pretty cool, as I could respond with failure rates for each of several thousand different programs. What did not work was changing the culture to get people to improve their code. But I was able to push back hard when my team was arbitrarily blamed for someone else’s bad code. It was easy to show that a jobs failure rate was increasing and link it to a recent upgrade or change. But, I felt I was often just shining the flashlight at an issue and trying to encourage a responsible party to take ownership.

gabdiax•1h ago

That's really interesting. Using statistical process control for failure rates in HPC systems sounds like a very solid approach.

In your experience, were there usually early signals in metrics before job failures increased? For example patterns like latency changes, resource saturation or network anomalies.

I'm trying to understand whether those signals appear consistently enough to detect issues before incidents actually happen.

The Peptide Wild West

Ask HN: Finding a purpose after tech layoffs

Ask HN: Lost access to HN account (no email), anyone recovered through support?

Framework raises RAM and storage prices again

The idiot bankrobber who inspired the Dunning-Kruger Effect

Dawn, a Claude-based AI, currently operating autonomously on Reddit

TokenZip – A pass-by-reference protocol for heterogeneous AI agents

Droidspaces-OSS: lightweight, LXC-inspired container runtime for Android, Linux

Show HN: AI assistant that reads Intervals.icu data and adjusts workouts

Ripgrep Code Review (2016)

About memory pressure, lock contention, and Data-oriented Design

'AI brain fry' is real – and it's making workers more exhausted

Generate a printable recipe page from (nearly) any recipe site

What's My ΔE(OK) JND?

Hugging Face Storage Buckets

TemPad Dev: open handoff tooling for Figma

Betteridge's Law of Headlines

Ballot SMC015v2: Allow mDL for authentication of individual identity

The right way to be a scientific contrarian

China Moves to Curb OpenClaw AI Use at Banks, State Agencies

Reentry of NASA satellite will exceed the agency's own risk guidelines

Valve Details Steam Frame and Steam Machine Verification at GDC 2026

AIFA – Reputation and competition layer for AI agents (FIFA-style league)

A Guide to Emergency Powers of the American President and Their Use (2025)

Show HN: Open-source browser for AI agents (~90% on Mind2Web)

A Pickup Game and a Big Question: How We Discovered Chromatin Is a Mechanosensor

Ask HN: Is Claude Down Again?

AWS Outage Was a Wake-Up Call for Vector Database Cross-Region DR

The Essence of a Machine

Faster Asin() Was Hiding in Plain Sight