We open-sourced the core of IncidentFox, an AI SRE / on-call agent.
The main thing we’re working on is handling context for incident investigation. Logs, metrics, traces, runbooks, prior incidents — this data is large, fragmented, and doesn’t fit cleanly into an LLM context window.
For logs, we don’t fetch everything. We start with stats (counts, severity distribution, common patterns) and then sample intentionally (errors-only, around-anomaly, stratified). Most investigations end up with tens of logs instead of millions.
For long documents like runbooks or postmortems, flat chunk-based RAG wasn’t working well, so we implemented a RAPTOR-style hierarchical retrieval to preserve higher-level context while still allowing drill-down.
The open-source core is a tool-based agent runtime with integrations. You can run it locally via CLI (or Slack/ GitHub), which is effectively on-prem on your laptop.
We’re very early and trying to find our first users / customers. If you’ve been on call before, I’m curious:
- does “AI SRE” feel useful, or mostly hype?
- where would something like this actually help, if at all?
- what would you want it to do before you’d trust it?
If you try it and it’s not useful, that’s still helpful feedback. I’ll be around in the comments!
incidentiq•2w ago
1. "AI SRE" useful or hype? Useful in specific contexts, but the trust barrier is real. Most on-call engineers are skeptical of AI suggestions during incidents because the cost of a wrong recommendation at 3am is high. That said, the pain of digging through logs and finding relevant context is also real.
2. Where it helps: The biggest wins are in "pre-work" - surfacing relevant past incidents before you start investigating, correlating alerts that are likely related, and summarizing what changed recently. Reducing the "context gathering" phase which often eats 30%+ of incident time.
3. Trust requirements: For me to trust it: - Show confidence levels and your reasoning. "Here's what I found and why" beats "do this." - Be a copilot that accelerates my investigation, not one that acts on my behalf. - Get the easy stuff 100% right before attempting the hard stuff. If log correlation is wrong on obvious patterns, I won't trust root cause suggestions.
The RAPTOR approach for runbooks is interesting - the "loss of context in chunked RAG" problem is real for long-form incident docs. How do you handle cases where relevant context spans multiple documents (runbook references an architecture doc)?