I wrote an essay outlining why common AI benchmarks are not terribly useful, instead arguing we should mostly use normal user experience instead.
Key reasons:
1) Most questions are not simply ‘wrong’ or ‘right’
2) Most user problems are poorly defined
3) Agents are getting popular, and they pose interconnections of these problems
philecho•2h ago
Key reasons: 1) Most questions are not simply ‘wrong’ or ‘right’ 2) Most user problems are poorly defined 3) Agents are getting popular, and they pose interconnections of these problems