We had traces, evals, Langfuse dashboards - everything looked fine - but we kept finding failures we should have caught earlier.
The pattern kept repeating:
- ship an improvement - it works for a while - hit an edge case that breaks it - don't notice until we've lost good candidates
That's when we realized - the problem wasn't just our recruitment pipeline - almost every AI product has blind spots that evals miss.
So we built Verse, a tool that surfaces issues directly from real AI interactions - whether that's candidates talking to your recruitment pipeline, users interacting with your agent, or any AI making decisions.
Instead of relying solely on evals, we cluster conversations, identify the key ones to review, and flag the ones that show failure patterns. We use OpenTelemetry for trace ingestion, so it's compatible with Langfuse, Langsmith, Braintrust, and other AI observability tools - you can add it right alongside your existing setup.
I'm posting this because I'm curious whether other teams are hitting the same wall. If you want, I'm happy to audit your AI implementation for free and show you where things commonly break - even if you never use Verse.
Happy to answer any technical questions.