We open-sourced the core of IncidentFox, an AI SRE / on-call agent.
The main thing we’re working on is handling context for incident investigation. Logs, metrics, traces, runbooks, prior incidents — this data is large, fragmented, and doesn’t fit cleanly into an LLM context window.
For logs, we don’t fetch everything. We start with stats (counts, severity distribution, common patterns) and then sample intentionally (errors-only, around-anomaly, stratified). Most investigations end up with tens of logs instead of millions.
For long documents like runbooks or postmortems, flat chunk-based RAG wasn’t working well, so we implemented a RAPTOR-style hierarchical retrieval to preserve higher-level context while still allowing drill-down.
The open-source core is a tool-based agent runtime with integrations. You can run it locally via CLI (or Slack/ GitHub), which is effectively on-prem on your laptop.
We’re very early and trying to find our first users / customers. If you’ve been on call before, I’m curious:
- does “AI SRE” feel useful, or mostly hype?
- where would something like this actually help, if at all?
- what would you want it to do before you’d trust it?
If you try it and it’s not useful, that’s still helpful feedback. I’ll be around in the comments!