frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Benchmarking open-weight models for security research

https://dualuse.dev/posts/benchmarking-open-models-for-security-research
1•lebovic•1h ago

Comments

lebovic•1h ago
GLM 5.1 is surprisingly capable. Anecdotally, I couldn't notice a difference until ~120K tokens.

Qwen 3.6 35B A3B also exceeded my expectations. It's surprisingly performant, even though the previous generation wasn't even able to use the testing harness.

(Tbd on Kimi K2.6; the eval is still running.)