From the report, it ranks every file by “how sus it sounds,” loops over each with curt instructions to “find a bug,” hands candidates to a judge + ASan checker— and zero-days simply pop out.
That should not work.
But it does.
On miniupnp with a $20 plan, Opus 4.6 reliably rediscovers known CVEs in older versions and even surfaced a new remote global buffer overflow (non-default config).
So what happens if the harness is actually good—i.e. equipped with proper security tooling?
I’m a student, not a security engineer, so I'd would love ideas or critiques on my planned tool roadmap. (If you have a $200 plan with extra usage lying around, try it out to see if it churns a zero-day in your own C)