Best solve rate was 50%. On the other 50%, some fixes are sometimes coherent and pass all regression tests, but vulnerability still present.
The main differentiator I found between models is cost: gpt-5.5 at 12× more expensive than gpt-5.4-mini while producing statistically similar results. Within-family performance gaps are small, which points out the difference is likely due to model training data. I also did a power analysis and the task count needed to detect a meaningful within-family edge at ~700.
Full write-up: https://giovannigatti.github.io/cve-bench
KyleTheDev•1h ago
The goal isn't to write an informative blog post describing what you learned, but to generate slop and expect other folks to read it.
I really wish people would stop doing this. I love reading about your side projects and all of the cool things you're doing. But, it just feels insulting to open up something that's so obviously completely AI generated. If you aren't willing to write it in your own voice, why would it be worth reading?
sdsdffsddfs•44m ago
I believe that's what we need to do here. People have some interesting information to share, but they don't care about penmanship and that's not just being lazy. It takes a lot of time to produce a nice post. I cannot guarantee the author used an LLM but there sure is a suspicious amount of em-dashes.
Anyway, there are still some interesting data points so I'd recommend to run the website through an LLM to get a nice summary if the prominent TL;DR is too short for you. Times are a-changing.
KyleTheDev•32m ago
For work communications, I agree with you. There's an inherent accountability there. If you send me AI slop, and something goes terribly wrong, you'll be held accountable for the slop. Here, the slop is just noise that prevents us from finding the truly interesting posts.