I built raglogs, a CLI tool that tries to answer a simple question during incidents:
"What actually happened?"
Instead of searching logs manually or sending thousands of lines to an LLM, raglogs analyzes a bounded time window of logs and produces a short explanation backed by evidence.
Example:
raglogs explain --since 30m
Output looks like:
Incident summary
Services affected: billing-worker, api
Primary issue: Stripe signature verification failures
Likely trigger: deploy of billing-worker v2.4.1
Secondary effects: checkout 500 errors, webhook retries
Evidence:
- 184 similar errors in billing-worker
- first occurrence 2 minutes after deploy
- same endpoint in 96% of failures
Key idea: don't send raw logs to an LLM.
Pipeline:
logs
→ normalize messages (strip UUIDs/IPs/etc)
→ fingerprint similar messages
→ cluster by fingerprint
→ compare against a baseline window
→ detect triggers (deploys, restarts)
→ assemble an evidence packet
The explanation can be rendered deterministically or optionally polished by an LLM. The LLM never sees raw logs.
araujo88•2h ago
I built raglogs, a CLI tool that tries to answer a simple question during incidents:
"What actually happened?"
Instead of searching logs manually or sending thousands of lines to an LLM, raglogs analyzes a bounded time window of logs and produces a short explanation backed by evidence.
Example:
Output looks like: Key idea: don't send raw logs to an LLM.Pipeline:
logs → normalize messages (strip UUIDs/IPs/etc) → fingerprint similar messages → cluster by fingerprint → compare against a baseline window → detect triggers (deploys, restarts) → assemble an evidence packet
The explanation can be rendered deterministically or optionally polished by an LLM. The LLM never sees raw logs.
Main commands:
There's also a small demo dataset so you can run: Repo: https://github.com/leo-aa88/raglogsI'm especially curious about feedback from people who deal with production incidents or SRE work.