Our ability to measure AI has been outpaced by our ability to develop it, and we believe this evaluation gap is one of the most important problems in AI. Open benchmarks are one of the most important levers for advancing AI safely and responsibly—but the academic and open-source teams driving them often hit resource constraints, especially in the face of the exponentially expanding complexity of what tomorrow’s benchmarks need to cover.
We think the next wave of benchmarks needs to push on three axes: - Environment complexity - How realistic is the operating environment? - Autonomy horizon - How far can an agent operate independently? We need to measure - Output complexity - How sophisticated is the work product?
Happy to answer questions about the grants, the framework, and would love to hear more about what you’re building!