Anthropic's Mythos (10T parameters) scanned Mozilla Firefox and found 271 security issues, 3 of which became published CVEs. We wanted to see what a 9B model could find on the same codebase.
We built Roasty, a multi-agent hostile code review engine in Shipit running Qwen 3.5 9B on a single RTX 3090. Instead of one model doing everything, specialized reviewers each hunt a different vulnerability class. The scan is 43% complete (196/455 chunks) with 142 LLM findings and 235 from our static rules engine so far.
We deliberately chose our smallest model to maximize the gap. If a 9B with solid architecture behind it can match or outperform a 10T, the argument that you need frontier-scale models for security auditing falls apart.
Early results and methodology at the link. We will publish final verified stats when the scan completes (~May 3). Stats only, results will go through Mozilla as responsible disclosure.
apolloraines•1h ago
We built Roasty, a multi-agent hostile code review engine in Shipit running Qwen 3.5 9B on a single RTX 3090. Instead of one model doing everything, specialized reviewers each hunt a different vulnerability class. The scan is 43% complete (196/455 chunks) with 142 LLM findings and 235 from our static rules engine so far.
We deliberately chose our smallest model to maximize the gap. If a 9B with solid architecture behind it can match or outperform a 10T, the argument that you need frontier-scale models for security auditing falls apart.
Early results and methodology at the link. We will publish final verified stats when the scan completes (~May 3). Stats only, results will go through Mozilla as responsible disclosure.