Monitoring tools were built with the premise that humans have written the code and humans are reading the logs, querying dashboards. But today, it's far from the truth. The machine writes code and the machine debugs it. But the logging tools have not evolved beyond complex LogQL queries that take weeks to understand the documentation. And not to mention, the ever-evolving logging patterns within the same company.
It is impossible to detect anomalies in log patterns using deterministic methods alone. That is why companies like Datadog, etc have stopped at anomaly detection at the metrics layer. Because they are numbers you can predict. And you can't feed your entire firehose to LLMs because they blow up in compute.
I developed Rocketgraph that generates "snapshots" from billions of logs so that your agents can query and root cause without hallucinating and burning your engineering budget. First, we fingerprint the logs by masking away all the PII stuff, then use fuzzy matching to group together similar logs using TF-IDF. Then we apply IsolationForest to rank the logs with an anomaly score. By now, we have condensed them to 100-1000 log patterns we call a "snapshot". Here is the interesting part: we just make an LLM call with the service graph dependency map to root cause over the log patterns, and the result is something like what is shown. They are highly accurate.
It can be used to detect anomalous retry loops, weird call patterns, unseen formats, etc.
Basically, I'm building Datadog - but the user is an AI agent. A monitoring tool whose output is queryable and consumable by an AI agent.
kvaranasi_•48m ago
It is impossible to detect anomalies in log patterns using deterministic methods alone. That is why companies like Datadog, etc have stopped at anomaly detection at the metrics layer. Because they are numbers you can predict. And you can't feed your entire firehose to LLMs because they blow up in compute.
I developed Rocketgraph that generates "snapshots" from billions of logs so that your agents can query and root cause without hallucinating and burning your engineering budget. First, we fingerprint the logs by masking away all the PII stuff, then use fuzzy matching to group together similar logs using TF-IDF. Then we apply IsolationForest to rank the logs with an anomaly score. By now, we have condensed them to 100-1000 log patterns we call a "snapshot". Here is the interesting part: we just make an LLM call with the service graph dependency map to root cause over the log patterns, and the result is something like what is shown. They are highly accurate.
It can be used to detect anomalous retry loops, weird call patterns, unseen formats, etc.
Basically, I'm building Datadog - but the user is an AI agent. A monitoring tool whose output is queryable and consumable by an AI agent.
I would love to hear your thoughts on this.