GLM 5.1 is surprisingly capable. Anecdotally, I couldn't notice a difference until ~120K tokens.
Qwen 3.6 35B A3B also exceeded my expectations. It's surprisingly performant, even though the previous generation wasn't even able to use the testing harness.
lebovic•1h ago
Qwen 3.6 35B A3B also exceeded my expectations. It's surprisingly performant, even though the previous generation wasn't even able to use the testing harness.
(Tbd on Kimi K2.6; the eval is still running.)