So, for these certifications to actually hold weight and stay relevant, the benchmarks need to be truly living and adaptive. Think dynamic difficulty: if an agent solves scenario S1, then S1 itself (or the next scenario S2) should automatically adapt and become more challenging based on that successful performance. To achieve that level of real-time adaptation, the benchmarks themselves might need to be AI-generated, or hey, maybe just "vibe coded" by AI but fully adaptive in style, constantly evolving case-by-case to really push what these agents can do.
deathspirate•9mo ago