In some environments I've seen cases where monitoring alerts arrive only after the system is already degrading.
Examples: - disk usage spikes faster than expected - network latency gradually increases - services degrade slowly before failing
Tools like Datadog, Zabbix, Prometheus etc. are great for alerts, but they still feel mostly reactive.
How do you deal with this in your infrastructure?
Do you rely more on: - anomaly detection - predictive monitoring - custom scripts - or just good incident response?
I'm trying to understand what actually works in real-world environments.
gabdiax•1h ago
For example: - unusual latency patterns - slow resource saturation - network anomalies
Do people actively monitor these patterns or mostly rely on threshold alerts?