One of the biggest problems we had there was figuring out what went wrong with bad searches. We had tons of searches per day, but less than 1% of users gave any explicit feedback. So we were either manually digging through searches or making general system improvements and hoping they helped.
This problem gets harder with agents. Traces are longer and more complex. It takes more effort to review them, so I'm building a tool that lets you analyze LLM outputs directly to help developers of LLM apps and agents understand where things are breaking and why.
I've put together a demo using browser-use agent traces (gpt-5): https://trails-red.vercel.app/viewer
It's early, but I have lots of ideas - live querying of past failures for currently-running agents, preference models to expand sparse signal data.
Would love feedback on the demo. Also if you're building agents and have 10k+ traces per day that you're not looking at but would like to, I'd love to talk.