So far in this benchmark we based the tasks on a couple of open-source projects (like curl, jq, GNU Coreutils).
Even on those "simple" projects we managed to make the tasks difficult - Claude Opus 4.1 was the only one to correctly cross-compile curl for arm64 (+ make it statically-linked) [1].
In the future we'd like to test it with projects like FFmpeg or chromium - those should be much more difficult.
Come to think of it, in terms of trying to get old code building, the CVS days of Firefox should be interesting... because the first command in that build step is "download the source code" and that CVS server isn't running anymore. And some of the components are downloaded from a CVS tag rather than trunk, and the converted CVS repositories I'm aware of all only converted the trunk and none of the branches or tags.
Now if it could fix React Native builds after package upgrades I'd be impressed...
The newer blog posts appear to scan forums like this one for objections ("AI" does not work for legacy code bases) and then create custom "benchmarks" for their sales people to point to if they encounter these objections.
I found that "just" there to be so funny in terms of how far the goal posts moved over these last few years (as TFA does mention). I personally am certain that it would have taken me significantly longer than that to do it myself.
And here's me, after 4 straight days of wrangling an obscure cross-compilation toolchain to resurrect some ill-fated piece of software from year 2011 in a modern embedded environment.
stared•2h ago
flenserboy•1h ago
johnisgood•1h ago