Show HN: An Incident Intelligence Layer that learns from real oncall work

1•mpingu•2mo ago

I’ve worked in operations and SRE for years, mostly in environments where outages, oncall fatigue and recurring incidents were normal. Like many here, I’ve been through the 3 AM pages, the tribal knowledge, the fixes that disappear in scrolling Slack threads, and the runbooks nobody touches until it’s too late.

I also live with Type-1 diabetes. That forces me to run extremely disciplined systems in my personal life: continuous monitoring, feedback loops, automated corrections, stability under stress. It shaped how I think about infrastructure in an unexpected way.

I have oncall in my blood literally. My blood sugar is basically a live monitoring system.

And that made me notice something strange about current observability stacks:

We measure everything except the one thing that actually resolves incidents: the human problem-solving process.

Every outage generates knowledge, but most of it evaporates: - shell history disappears - Slack conversations drift away - senior engineers fix silently - runbooks rot - context is lost - the same incident happens again and is solved again

So I’m exploring a new layer for the SRE stack: an Incident Intelligence Layer.

High-level idea (no deep tech here):

- troubleshooting sessions become structured, anonymous traces - each incident type gets a shared knowledge feed - engineers upvote or downvote solutions - a local LLM summarizes recurring patterns - a sanitized layer allows safe use of a public LLM - repeated successful solutions gradually become recommended actions or potential automation candidates

The goal is simple: every outage should make the system smarter, not just the engineer who fixed it.

I’m working on an early MVC: - a minimal session recorder that emits structured JSON - basic incident-type feeds - voting - a first pass of local LLM summarization

Not a full product. Just exploring the space and validating whether others see the same gap.

Would love to talk with people who: - work in SRE or oncall - build observability or internal tooling - have tried to reduce repeated incidents - think about AI-assisted remediation - or have built infra startups before

If this resonates, feel free to DM me here on HN. Happy to share more privately.

Comments

mpingu•2mo ago

Minor note: I used an AI translator for a few sentences, which is why some emdashes slipped in. The content is mine, only some phrasing was assisted. English is not my first language, so small wording issues can happen.

brihati•1mo ago

Frankly, I was unable to distill what you are trying to achieve here

X said it would give $1M to a user who had previously shared racist posts

155M US land parcel boundaries

Private Inference

Font Rendering from First Principles

Show HN: Seedance 2.0 AI video generator for creators and ecommerce

Wally: A fun, reliable voice assistant in the shape of a penguin

Rewriting Pycparser with the Help of an LLM

Lobsters Vibecoding Challenge

E-Commerce vs. Social Commerce

Avoiding Modern C++ – Anton Mikhailov [video]

Show HN: AegisMind–AI system with 12 brain regions modeled on human neuroscience

Zig – Package Management Workflow Enhancements

AI-powered text correction for macOS

AppSecMaster – Learn Application Security with hands on challenges

Fibonacci Number Certificates

AI Overviews are killing the web search, and there's nothing we can do about it

City skylines need an upgrade in the face of climate stress

1979: The Model World of Robert Symes [video]

Satellites Have a Lot of Room

1980s Farm Crisis

Show HN: FSID - Identifier for files and directories (like ISBN for Books)

Show HN: Holy Grail: Open-Source Autonomous Development Agent

Show HN: Minecraft Creeper meets 90s Tamagotchi

Show HN: Termiteam – Control center for multiple AI agent terminals

The only U.S. particle collider shuts down

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

Show HN: Remotion directory (videos and prompts)

Portable C Compiler

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

Software Engineering Transformation 2026