So far in this benchmark we based the tasks on a couple of open-source projects (like curl, jq, GNU Coreutils).
Even on those "simple" projects we managed to make the tasks difficult - Claude Opus 4.1 was the only one to correctly cross-compile curl for arm64 (+ make it statically-linked) [1].
In the future we'd like to test it with projects like FFmpeg or chromium - those should be much more difficult.
Now if it could fix React Native builds after package upgrades I'd be impressed...
The newer blog posts appear to scan forums like this one for objections ("AI" does not work for legacy code bases) and then create custom "benchmarks" for their sales people to point to if they encounter these objections.
I found that "just" there to be so funny in terms of how far the goal posts moved over these last few years (as TFA does mention). I personally am certain that it would have taken me significantly longer than that to do it myself.
stared•1h ago
flenserboy•9m ago
johnisgood•5m ago