I’m aware that all of these measures have limitations and that many are controversial or imperfect by design. I’m not assuming they’re “good” or that they cleanly map to real-world capability.
I’d love to hear:
- What measures, benchmarks, or methodologies you think belong on this list
- What you see as their key strengths and failure modes
- How (or whether) you personally use them to interpret AI progress
My goal here is discovery and understanding, not to defend or attack any particular framework.