+-------------+----------+-----------+-------+-------+-------+
| Task | A22B-Ins | A22B | K2 | Opus4 | Deeps |
+-------------+----------+-----------+-------+-------+-------+
| GPQA | *77.5 | 62.9 | +75.1 | -74.9 | 68.4 |
| AIME25 | *70.3 | 24.7 | +49.5 | 33.9 | -46.6 |
| LiveCB_v6 | *51.8 | 32.9 | +48.9 | 44.6 | -45.2 |
| ArenaHard2 | *79.2 | -52.0 | +66.1 | 51.5 | 45.6 |
| BFCL_v3 | *70.9 | +68.0 | -65.2 | 60.1 | 64.7 |
+-------------+----------+-----------+-------+-------+-------+
* = 1st
+ = 2nd
- = 3rd
homarp•2h ago
and later they will release the thinking model
on selected benchmarks, it beats kimi