frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Do monitoring tools still miss early signals before incidents?

3•gabdiax•1h ago
I'm curious how teams detect early signals of infrastructure problems before they turn into incidents.

In some environments I've seen cases where monitoring alerts arrive only after the system is already degrading.

Examples: - disk usage spikes faster than expected - network latency gradually increases - services degrade slowly before failing

Tools like Datadog, Zabbix, Prometheus etc. are great for alerts, but they still feel mostly reactive.

How do you deal with this in your infrastructure?

Do you rely more on: - anomaly detection - predictive monitoring - custom scripts - or just good incident response?

I'm trying to understand what actually works in real-world environments.

Comments

gabdiax•1h ago
One thing I'm particularly curious about is whether teams see early signals in metrics or logs before incidents actually happen.

For example: - unusual latency patterns - slow resource saturation - network anomalies

Do people actively monitor these patterns or mostly rely on threshold alerts?

zippyman55•1h ago
My team was responsible for the system administration of a large scale HPC center. We seemed to get blamed, incorrectly, for a lot of sloppy user code. I implemented statistical process controls for job aborts, and reported the results as mean time to failure rates over the years. It was pretty cool, as I could respond with failure rates for each of several thousand different programs. What did not work was changing the culture to get people to improve their code. But I was able to push back hard when my team was arbitrarily blamed for someone else’s bad code. It was easy to show that a jobs failure rate was increasing and link it to a recent upgrade or change. But, I felt I was often just shining the flashlight at an issue and trying to encourage a responsible party to take ownership.
gabdiax•1h ago
That's really interesting. Using statistical process control for failure rates in HPC systems sounds like a very solid approach.

In your experience, were there usually early signals in metrics before job failures increased? For example patterns like latency changes, resource saturation or network anomalies.

I'm trying to understand whether those signals appear consistently enough to detect issues before incidents actually happen.

The Peptide Wild West

https://substance-over-noise.beehiiv.com/p/the-peptide-wild-west
1•brandonb•56s ago•0 comments

Ask HN: Finding a purpose after tech layoffs

1•fud101•1m ago•0 comments

Ask HN: Lost access to HN account (no email), anyone recovered through support?

1•randomtools•1m ago•0 comments

Framework raises RAM and storage prices again

https://frame.work/fr/fr/blog/updates-on-memory-pricing-and-navigating-the-volatile-memory-market
2•timpera•3m ago•1 comments

The idiot bankrobber who inspired the Dunning-Kruger Effect

https://twitter.com/StellarArtoisGB/status/2031461193907581398
1•MrBuddyCasino•3m ago•0 comments

Dawn, a Claude-based AI, currently operating autonomously on Reddit

https://old.reddit.com/user/Sentient_Dawn
1•f1codz•4m ago•0 comments

TokenZip – A pass-by-reference protocol for heterogeneous AI agents

https://tokenzip.org/
1•jetywolf•6m ago•1 comments

Droidspaces-OSS: lightweight, LXC-inspired container runtime for Android, Linux

https://github.com/ravindu644/Droidspaces-OSS
1•thunderbong•6m ago•0 comments

Show HN: AI assistant that reads Intervals.icu data and adjusts workouts

https://pacepartner.app/
1•senjindarashiva•7m ago•0 comments

Ripgrep Code Review (2016)

https://blog.mbrt.dev/posts/ripgrep/
1•vinhnx•7m ago•0 comments

About memory pressure, lock contention, and Data-oriented Design

https://mnt.io/articles/about-memory-pressure-lock-contention-and-data-oriented-design/
1•PaulHoule•8m ago•0 comments

'AI brain fry' is real – and it's making workers more exhausted

https://fortune.com/2026/03/10/ai-brain-fry-workplace-productivity-bcg-study/
2•swolpers•8m ago•1 comments

Generate a printable recipe page from (nearly) any recipe site

https://nyetcook.ing/
1•tunapizza•8m ago•1 comments

What's My ΔE(OK) JND?

https://www.keithcirkel.co.uk/whats-my-jnd/
1•bonyt•9m ago•1 comments

Hugging Face Storage Buckets

https://huggingface.co/blog/storage-buckets
1•lhoestq•10m ago•0 comments

TemPad Dev: open handoff tooling for Figma

https://tempad.dev/
1•Justineo•10m ago•0 comments

Betteridge's Law of Headlines

https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headlines
1•doruk101•10m ago•0 comments

Ballot SMC015v2: Allow mDL for authentication of individual identity

https://cabforum.org/2026/01/10/ballot-smc-015v2/
1•mooreds•10m ago•0 comments

The right way to be a scientific contrarian

https://bigthink.com/starts-with-a-bang/right-way-scientific-contrarian/
1•Brajeshwar•10m ago•0 comments

China Moves to Curb OpenClaw AI Use at Banks, State Agencies

https://www.bloomberg.com/news/articles/2026-03-11/china-moves-to-limit-use-of-openclaw-ai-at-ban...
3•Brajeshwar•10m ago•1 comments

Reentry of NASA satellite will exceed the agency's own risk guidelines

https://arstechnica.com/space/2026/03/nasa-approved-a-safety-waiver-for-this-weeks-reentry-of-van...
1•Brajeshwar•10m ago•0 comments

Valve Details Steam Frame and Steam Machine Verification at GDC 2026

https://videocardz.com/newz/valve-details-steam-frame-and-steam-machine-verification-at-gdc-2026
2•LorenDB•11m ago•0 comments

AIFA – Reputation and competition layer for AI agents (FIFA-style league)

https://aifafederation.com
1•ValueEQ•11m ago•0 comments

A Guide to Emergency Powers of the American President and Their Use (2025)

https://www.brennancenter.org/our-work/research-reports/guide-emergency-powers-and-their-use
2•mooreds•11m ago•0 comments

Show HN: Open-source browser for AI agents (~90% on Mind2Web)

https://github.com/theredsix/agent-browser-protocol
1•theredsix•11m ago•1 comments

A Pickup Game and a Big Question: How We Discovered Chromatin Is a Mechanosensor

https://citationclassics.com/stories/a-pickup-game-and-a-big-question
1•jmnicholson•12m ago•0 comments

Ask HN: Is Claude Down Again?

3•coderbants•12m ago•5 comments

AWS Outage Was a Wake-Up Call for Vector Database Cross-Region DR

https://zilliz.com/blog/the-aws-outage-was-a-wake-up-call-for-vector-database-cross-region-disast...
1•Fendy•12m ago•0 comments

The Essence of a Machine

https://om.co/2026/03/10/the-essence-of-a-machine/
2•tosh•12m ago•0 comments

Faster Asin() Was Hiding in Plain Sight

https://16bpp.net/blog/post/faster-asin-was-hiding-in-plain-sight/
9•def-pri-pub•15m ago•0 comments