CompileBench: Can AI Compile 22-year-old Code?

https://quesma.com/blog/introducing-compilebench/

148•jakozaur•4mo ago

Comments

stared•4mo ago

Curious for the ultimate benchmark - can AI compile Doom an on arbitrary device?

flenserboy•4mo ago

that, & how well does it cope with Perl?

johnisgood•4mo ago

Claude is good enough at Perl with lots of hand-holding and reiterations, according to my experiences.

piotrgrabowski•4mo ago

Author here.

So far in this benchmark we based the tasks on a couple of open-source projects (like curl, jq, GNU Coreutils).

Even on those "simple" projects we managed to make the tasks difficult - Claude Opus 4.1 was the only one to correctly cross-compile curl for arm64 (+ make it statically-linked) [1].

In the future we'd like to test it with projects like FFmpeg or chromium - those should be much more difficult.

[1] https://www.compilebench.com/curl-ssl-arm64-static/

OtherShrezzing•4mo ago

For the _reviving 20 year old code_ type tasks, are the tested outcomes things we'd expect to be in the public domain? For example, in the way the 'SWEBenchVerified' tests are poisoned tests, because the LLMs are able to look up bug fixes in the project git repository.

criemen•4mo ago

> because the LLMs are able to look up bug fixes in the project git repository

That's not the (only) problem: Even if you take the internet away, we know/assume that all LLMs are heavily trained on public GitHub repositories. Therefore, they know/remember details of the code and organization in a way they can't for your private (or new, past knowledge cut-off date) code.

jcranmer•4mo ago

A long time ago, I did a project where I downloaded a year's worth of nightly builds for Thunderbird so that I could collect nightly code coverage information. Over the course of doing so, I discovered that there was one dependency (pango, I think?) such that no version could support the entire year's worth of source--the newer version didn't work with the older builds, and the older version didn't work with the newer builds.

Come to think of it, in terms of trying to get old code building, the CVS days of Firefox should be interesting... because the first command in that build step is "download the source code" and that CVS server isn't running anymore. And some of the components are downloaded from a CVS tag rather than trunk, and the converted CVS repositories I'm aware of all only converted the trunk and none of the branches or tags.

fuhsnn•4mo ago

You didn't make the tasks difficult, you make them easier.

The entire coreutils is reduced to one utility (sha1sum) and the test doesn't even try to feed a real file to it (just a stdin string)[0], same goes to the jq task, there isn't even a json file feed to it, what's being verified[1] is barely a calculator.

These project ship with "make check", please tell AI to use it.

[0] https://github.com/QuesmaOrg/CompileBench/blob/86d9aeda88a16...

[1] https://github.com/QuesmaOrg/CompileBench/blob/86d9aeda88a16...

nl•4mo ago

This is a really good benchmark. So much time is spent on these messy types of tasks and no one really likes doing it.

Now if it could fix React Native builds after package upgrades I'd be impressed...

bgwalter•4mo ago

LGTM! I'm sure it comes with a correctness proof, too!

The newer blog posts appear to scan forums like this one for objections ("AI" does not work for legacy code bases) and then create custom "benchmarks" for their sales people to point to if they encounter these objections.

falcor84•4mo ago

> Our toughest challenges include cross-compiling to Windows or ARM64 and resurrecting 22-year-old source code from 2003 on modern systems. Some agents needed 135 commands and 15 minutes just to produce a single working binary.

I found that "just" there to be so funny in terms of how far the goal posts moved over these last few years (as TFA does mention). I personally am certain that it would have taken me significantly longer than that to do it myself.

ACCount37•4mo ago

15 minutes?

And here's me, after 4 straight days of wrangling an obscure cross-compilation toolchain to resurrect some ill-fated piece of software from year 2011 in a modern embedded environment.

qazxcvbnmlp•4mo ago

Letting an agent figure out how to compile old projects is magical. What used to be multiple days of slog is now “compile this, make changes and download tools as needed” with 10 mins of git review to make sure it didn’t do anything stupid.

landl0rd•4mo ago

Given the amount of time I've spent wrestling toolchain unpleasantness, particularly for old or embedded systems, I will happily go take a fifteen-minute coffee break while the bot does it for me.

Of course, I will probably do this with OpenAI's option, not $20 of Anthropic API credits.

Palomides•4mo ago

man I dunno, I was expecting some magic but the tasks seem to boil down to untar, configure with some flags, make install

it does seem the machine is faster than me since I would have to spend a minute to copy each of the --disable-whatever flags for curl

it's somewhat cool to see a computer can do the same half-assed process I do of seeing what linker failures happen and googling for the missing lib flag

Philpax•4mo ago

Excellent benchmark. May I suggest a extension: "port any pre-uv Python ML codebase to uv so that it can actually be reliably reproduced"?

stared•4mo ago

Here the tricky part is to make tests that it work correctly.

I did this upgrade a few times, and works for simple stuff like charm (e.g. removing requirements.txt and adding proper pyproject.toml).

Even in Claude Code, it takes some prompting and CLAUDE.md so that it consistently runs uv, rather sometimes `uv run python`, other times `python3 -m` being surprised that some dependency is not available.

sujee_dev•4mo ago

I am doing this now. What are your instructions in CLAUDE.md? thx

stared•4mo ago

Usually are long and project-dependent, so I won't share here.

But just a copy-paste of a piece of a project of mine

        ## Core Principles (IMPORTANT)
        - **NO FALLBACKS EVER** - Fail fast, fail hard. If something is missing, crash immediately
        - **NO SILENT DEFAULTS** - Never use default values when files/data is missing
        - **NO TRY/EXCEPT WRAPPING** - Let exceptions propagate for easier debugging
        - **PROPER TYPES ONLY** - Use Literal types for enums, Pydantic models for structures. No tuples for structured data
        - **NO JAVA-STYLE ABSTRACTIONS** - Don't create pointless constants like STATUS_OK = "OK". Just use literals or proper types
        - **PROPER PARSING** - No regex for structured formats, use real parsers
        - **CRASH EARLY** - Things should crash as soon and as hard as possible for easy debugging
        - **NO DEFENSIVE PROGRAMMING** - Don't worry about "backwards compatibility" during refactoring - just redesign things to be better.

        ## Development

        **IMPORTANT: Always use `uv run` for all Python commands. Never use plain `python` or `python3`.**

        ```bash
        # Format code
        uv run ruff format .

        # Check linting
        uv run ruff check .

        # Type checking
        uv run ty check

        # Run tests
        uv run pytest tests/
        ```

VMG•4mo ago

This is hilarious, I have almost exactly the same prompts. Why are the LLMs so afraid of breaking changes and propagating exceptions?

groby_b•4mo ago

Skip the fiddling with prompts, just sandbox so that these commands are not run (via permissions.deny)

In dire cases, use the PreToolUse hook to inspect/intercept (though it's usually not necessary).

Granted, I haven't tried it for huge projects yet, but after doing that my small-medium sized projects all got ported nicely.

(If you must change the prompt, mention PEP723 as well, it seems to have the same effect as showing a shiny trinket to a magpie ;)

buildbot•4mo ago

I’ve been doing this a lot! AI seems to really excel at setting up compiler boilerplate/minor modifications for new arch. I made a simple cpu information utility work on HP PA-RISC and Sparc64 :)

sehugg•4mo ago

I have tried to get Claude to compile arbitrary C++ projects with Emscripten, and its track record is about as good as mine.

jclay•4mo ago

the libs in the bench don’t really have an external deps. will be much more interesting to see the results with ffmpeg, Qt, etc. The original source releases from any repo here would also be great candidates: https://github.com/id-software

shallichange•4mo ago

I hadn’t thought of that use case. Say for example you find 1990’s Clipper code and want to give it a try on a modern Linux. Thanks

mercurialuser•4mo ago

Use harbour compiler and run it under windows, linux, mac and other less used os..

jeffbee•4mo ago

If you asked me to do this I would want clarification on "cross-compile", "arm64" and "statically".

gregsadetsky•4mo ago

I recently downloaded the source code for Chocolate Doom [0], and even though a ton of human labor has been put into making it cross-platform and easy to build (and that work definitely deserves to be commended!), I still couldn't build it immediately on my M1 MacBook.

Asking Claude Code to build it - literally prompting it "fix whatever needs to be fixed until you get the binary to run" - and waiting ~20 minutes was the best investment of non-time I could do... It definitely felt magical. Claude would tweak headers, `make` it, try to run it, and apply more fixes based on the errors it got back.

Now that I think of it, I regret not opening an issue/PR with its findings...!

(((I then went on to make more vibe-changes to the Doom code and made a video out of those which went semi-viral, which I will now unashamedly plug [1])))

[0] https://github.com/chocolate-doom/chocolate-doom

[1] https://www.youtube.com/watch?v=LcnBXtttF28

bgwalter•4mo ago

So essentially, you are redundant now and celebrate it.

gregsadetsky•4mo ago

I celebrate that I did not have to spend cycles dealing with a non-interesting, non-intellectually-challenging issue aka figuring out the incantations to make a build system happy.

I'm also celebrating (although I forgot to do this - my bad!) that this automated discovery (i.e. of how to fix the build system for machines such as mine) could have been brought back to the Chocolate Doom community, and made the software better for everyone.

And finally, I'm also celebrating that this allowed my (if I may speak so boldly) creativity to express itself by helping me quickly bring a funny idea to life and share it, hopefully entertaining the world/making at least one person laugh/chuckle.

I don't see how any of this makes me redundant though. Efficient? Lazy? Both? Neither? But not redundant. I think! :-)

behringer•4mo ago

Be aware that if you don't understand every change then your contributions may not be welcome or helpful, depending on the project and situation.

varispeed•4mo ago

It's like saying chisel made carpenter redundant. AI still needs an operator and then more people to actually make the output production ready.

bgwalter•4mo ago

People here are claiming that "AI" emits fully working products, so with that reading they are not just a tool.

Also, you would own a chisel and the chisel does not spy on you. The "AI" factories are owned by oligopolies and you have to pay a steep monthly fee in order to continue receiving your warez that are derivative works of actually creative people's IP. Also, the "AI" factories know everything you do and ask and what kind of code you write.

bckr•4mo ago

For now. Open Source AI continues to make progress

a456463•4mo ago

You are correct and I agree with you. HN monoculture of AI fanbois won't understand this

bongodongobob•4mo ago

Of all the forums I frequent, hackernews is probably the most dismissive of AI, which I would not have guessed.

simonw•4mo ago

I think you're in the wrong thread. This isn't about AI emitting "fully working products", this is about AI brute-force figuring out how to compile stuff with gnarly constraints, a task which very few software developers look forward to.

Plus, as other commenters have pointed out already, you can run this stuff entirely free from risk of an AI company spying on what you are doing. The models that run locally got really good in the past 12 months, and if they don't work on your own machine you can rent a capable cloud GPU machine for a few bucks an hour.

ForOldHack•4mo ago

There are also people telling you the earth is flat, and 30 years of experience can be compressed into a 4 minute you tube video. Even if a chisel could spy on me, it becomes dull with use, where as AI may become sharper with use, it still cannot distinguish which idiot is operating it. AI is just for people to learn prompting, which is an art, like google searching. It still cannot fathom "taste." or a large host of other types of nuances, that again, only come with experience and enculturation.

ForOldHack•4mo ago

You are the master of understatement. I just spent 5+ Hours getting an emulator to just work. back and forth with the AI required me to be cognizant of the direction I was going, very cognizant. After It finally worked... the clean up was huge. at least 15 broken images, 100s of scratch files.

bongodongobob•4mo ago

Unless you are selling a service to compile things for people I'm not sure who is being made redundant here.

jstummbillig•4mo ago

Naturally, the primary source of purpose in life: Making Chocolate Doom compile.

warkdarrior•4mo ago

If only philosophers of the last 2500 years had known this...

a456463•4mo ago

Precisely

solsane•4mo ago

I’ve always thought that most devs would be elated by the idea of automatio^n!

magicalist•4mo ago

I mean this in the nicest possible way because you were just messing around on a fun thing, but...

I feel like there's a real metaphor here. 86+ people did work over two decades to maintain a cross-platform codebase and that "definitely deserves to be commended", but what "definitely felt magical" was Claude bumbling through header tweaks from compilation errors until the project compiled. And in the end what has AI wrought? A viral video but not anything to give back to the original project. Really there are multiple layers here :)

hbs18•4mo ago

To be fair, the topic is AI, so of course that's what he's focusing on

turtlebits•4mo ago

1M devs could have worked on it. You can neither fight bit rot nor predict the future.

The point was to get it running, not solve world peace. Without AI, the problem might not have been tackled at all.

pabs3•4mo ago

It builds fine on Linux arm64, what changes did you need to make?

https://buildd.debian.org/status/package.php?p=chocolate-doo...

camel-cdr•4mo ago

Ok, so I tried to build chocolate doom as well (on Debian WSL):

$ git clone --depth=1 https://github.com/chocolate-doom/chocolate-doom

$ cd c*doom; ls

Ok, there is a CMakeFile.txt, so it's probably a cmake project, so:

$ cmake .

Ok, that seems to work, but three libraries are missing, SDL2_Mixer, SDL2_Net and FluidSynth, so lets install them:

$ sudo apt install libsdl2-mixer-dev libsdl2-net-dev libfluidsynth-dev

Let's try again:

$ cmake .

Works, so now for compiling:

$ cmake --build . -j $(nproc)

Build completed in a few seconds first try.

gregsadetsky•4mo ago

I’m on macOS so sometimes things aren’t as easy :) I’ll give it another try.

fuhsnn•4mo ago

For C projects, the task should be passing the full test suite with at least address-sanitizer enabled. Amusing how some would discourage fellow human from using a programming language because of its unsafeness or undefined behavior, yet AI doing unaudited source modification on the same language is encouraged.

peatmoss•4mo ago

Though this is more "LLM uses a variety of open source tools and compilers to compile source," I do wonder about whether there will eventually be a role for transformers in compiling code.

I've mentioned this before, but "sufficiently smart compiler" would be the dream here. Start with high level code or pseudo code, end up with something optimized.

calebkaiser•4mo ago

There's been a decent chunk of research in this direction over the years. Michael O'Boyle is pretty active as a researcher in the space, if you're looking for stuff to read: https://www.dcs.ed.ac.uk/home/mob/

peatmoss•4mo ago

Thank you! I'll take a read.

saltcured•4mo ago

I might start to accept this LLM stuff when it can directly compile programs, i.e. not spit out a compiler command but take in source code and output linked object code in an executable format via token inference. And have it be correct.

Then, I'd start to trust in its ability to manage context and reliably work through complex tasks.

groby_b•4mo ago

"I'll try this newfangled steel constructions if they actually forge each rebar on site".

You are saying that you'd trust the new and unproven technology more if it didn't rely on old and proven technology and instead reinvented everything from scratch. That's a somewhat illogical take.

saltcured•4mo ago

Of course I was being facetious.

But by "correct", I meant that it would need to be able to work through such multi-level tasks as a compiler with semantic analysis, error checking, optimization, and code generation to reliably transcribe the source code. Not just emit lorum ipsum executables.

groby_b•4mo ago

Why? Most engineers can't do that either. That's the whole point of having tools, that you don't need to stand their and mumble about "first principles".

It's about as useful as requiring your engineers to forge computers from sand on upwards.

saltcured•4mo ago

Because I'm interested in acquiring more advanced but reliable tools, not simulations of error-prone humans.

bigfishrunning•4mo ago

The problem is verifying the output's correctness -- you could easily end up with insidious issues like this classic: https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref...

myhf•4mo ago

OmniGraffle user spotted

anonzzzies•4mo ago

22 years is not that old. we run and maintain saas sites older than that.

olivia-banks•4mo ago

I think it definitely depends on the software. GNU GCC coreutils/binutils are a nightmare to build if you're trying to build it on a system more than 15 years its junior.

anonzzzies•4mo ago

I do on ultrasparcs (5-10 and e450). But 25 year old perl/php saas is a lot easier.

viraptor•4mo ago

An extreme test case to add would be compiling quake.js from scratch which requires a specific old version of emscripten and llvm.

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

SectorC: A C Compiler in 512 bytes (2023)

Haskell for all: Beyond agentic coding

Speed up responses with fast mode

Software factories and the agentic moment

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

First Proof

IBM Beam Spring: The Ultimate Retro Keyboard

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

FDA intends to take action against non-FDA-approved GLP-1 drugs

LLMs as the new high level language

Al Lowe on model trains, funny deaths and working with Disney

Show HN: Axiomeer – An open marketplace for AI agents

Start all of your commands with a comma (2009)

Show HN: A luma dependent chroma compression algorithm (image compression)

The AI boom is causing shortages everywhere else

GitBlack: Tracing America's Foundation

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

Vouch

I write games in C (yes, C) (2016)

The silent death of good code

The F Word

Selection rather than prediction

Reinforcement Learning from Human Feedback

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Learning from context is harder than we thought

Where did all the starships go?

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

SectorC: A C Compiler in 512 bytes (2023)

Haskell for all: Beyond agentic coding

Speed up responses with fast mode

Software factories and the agentic moment

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

First Proof

IBM Beam Spring: The Ultimate Retro Keyboard

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

FDA intends to take action against non-FDA-approved GLP-1 drugs

LLMs as the new high level language

Al Lowe on model trains, funny deaths and working with Disney

Show HN: Axiomeer – An open marketplace for AI agents

Start all of your commands with a comma (2009)

Show HN: A luma dependent chroma compression algorithm (image compression)

The AI boom is causing shortages everywhere else

GitBlack: Tracing America's Foundation

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

Vouch

I write games in C (yes, C) (2016)

The silent death of good code

The F Word

Selection rather than prediction

Reinforcement Learning from Human Feedback

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Learning from context is harder than we thought

Where did all the starships go?

CompileBench: Can AI Compile 22-year-old Code?

Comments