I also live with Type-1 diabetes. That forces me to run extremely disciplined systems in my personal life: continuous monitoring, feedback loops, automated corrections, stability under stress. It shaped how I think about infrastructure in an unexpected way.
I have oncall in my blood literally. My blood sugar is basically a live monitoring system.
And that made me notice something strange about current observability stacks:
We measure everything except the one thing that actually resolves incidents: the human problem-solving process.
Every outage generates knowledge, but most of it evaporates: - shell history disappears - Slack conversations drift away - senior engineers fix silently - runbooks rot - context is lost - the same incident happens again and is solved again
So I’m exploring a new layer for the SRE stack: an Incident Intelligence Layer.
High-level idea (no deep tech here):
- troubleshooting sessions become structured, anonymous traces - each incident type gets a shared knowledge feed - engineers upvote or downvote solutions - a local LLM summarizes recurring patterns - a sanitized layer allows safe use of a public LLM - repeated successful solutions gradually become recommended actions or potential automation candidates
The goal is simple: every outage should make the system smarter, not just the engineer who fixed it.
I’m working on an early MVC: - a minimal session recorder that emits structured JSON - basic incident-type feeds - voting - a first pass of local LLM summarization
Not a full product. Just exploring the space and validating whether others see the same gap.
Would love to talk with people who: - work in SRE or oncall - build observability or internal tooling - have tried to reduce repeated incidents - think about AI-assisted remediation - or have built infra startups before
If this resonates, feel free to DM me here on HN. Happy to share more privately.
mpingu•36m ago