frontpage.

I kept running into the same issue with coding agents.

A test run fails, you get a huge wall of output, and most of the effort goes into figuring out what actually went wrong.

In many cases, the failures are not independent. It’s the same issue repeated across many tests.

In one case: 128 failures → 2 root causes

I built a small CLI that groups repeated failures into shared root causes before passing the result to the model.

It’s mainly built for coding agents, but works on raw CLI output as well.

On my backend tests, this reduced debugging time and token usage quite a bit.

fp.