The task appears to stress spatial reasoning: Gemini 3 models lead this benchmark by a decent margin.
Counterintuitively, “more reasoning” often reduces accuracy.
Even the top-performing model scores only ~36% of darts correctly.
The task appears to stress spatial reasoning: Gemini 3 models lead this benchmark by a decent margin.
Counterintuitively, “more reasoning” often reduces accuracy.
Even the top-performing model scores only ~36% of darts correctly.