A powerful move beyond benchmarks — this paper redefines LLM evaluation through realistic, behavior-driven testing.
ankush9812•3mo ago
Nice Work
mlop99•3mo ago
Curious if the behaviour driven testing can be done by another LLM agent (or a group of agents) - one LLM agent testing another. Could lead to a self-improving loop?
harshv_03•3mo ago
Interesting
raj_maddipati•3mo ago
Excellent work
jlukecarlson•3mo ago
I appreciate the details shared in this paper but it'd be great if they open sourced their implementation!
saurabh_xen•3mo ago