Interesting approach, I've been particularly interested in tracking and being able to understand if adding skills or tweaking prompts is making things better or worse.
Anyone know of any other similar tools that allow you to track across harnesses, while coding?
Running evals as a solo dev is too cost restrictive I think.
evantahler•11m ago
I feel like asking the thing that you are measuring, and don’t trust, to measure itself might not produce the best measurements.
john_strinlai•4m ago
"we investigated ourselves and found nothing wrong"
aleksiy123•15m ago
Anyone know of any other similar tools that allow you to track across harnesses, while coding?
Running evals as a solo dev is too cost restrictive I think.