MirrorCode: What's the largest software project AI can complete on its own?

3•tadamcz•1h ago

Comments

tadamcz•1h ago

Hi HN! I'm the creator of MirrorCode, a benchmark of long-horizon SWE tasks.

In a MirrorCode task, AI models are tasked with reimplementing an entire program end-to-end, without access to the original source code. AI-generated solutions must match the original program’s stdout and stderr exactly on end-to-end tests. MirrorCode’s 25 target programs span different areas of computing: Unix utilities, data serialization and query tools, bioinformatics, interpreters, static analysis, cryptography, and compression.

IMO, the key tension in eval design is now: how to make tasks that are difficult for AI, but actually fair? My default expectation in 2026 when I see a coding eval with low scores is that the tasks turn out to be impossible.

In MirrorCode, we leverage reimplementation of existing software as the raw material to make tasks hard. But we also manually select each of the 25 target programs, and carefully collect the end-to-end tests, so that the task is actually achievable (if extremely difficult).

It's very easy for MirrorCode-style tasks to degenerate into absurd reverse-engineering: stuff you'd never do as a human engineer.

e.g. one of our target programs is the brotli compression library. But we explicitly test the decompression path only (and call the task "brotlid")

RFC 7932 only defines decompression: given compressed bytes, there's one correct output. In the other direction, there's no unique mapping.

So the output of a compressor is governed by thousands of internal efficiency heuristics (e.g. how far back to look for repeated strings). Replicating those perfectly from a black box is extreme and pointless reverse-engineering.

An aspect of MirrorCode I’m proud of: it’s the first benchmark I know of to implement a serious “security mindset”. The field fatalistically treats AI cheating as an unwinnable game of whack-a-mole. I think this is completely wrong.

My views: (1) in 2026, cheat-proofing must be a first-class consideration in benchmark design, not an afterthought (2) most benchmarks CAN be made secure, by using known primitives like containers

The problem is that benchmark authors aren’t even trying to be secure against human-level attackers, which AIs are.

Another key differentiator of MirrorCode: AIs are clearly told the scope of the task, rather than having to guess what they'll secretly be tested on.

We clarify task scope by showing AIs a list of test inputs (while also keeping some hidden to prevent AIs from cheating with a lookup table).

If this sounds too easy or like giving them the answer, I assure you, you're thinking about it wrong. e.g. perfectly matching gotree's behaviour does not become easy because I show you 1,899 inputs (!) that your program will be tested on.

Meanwhile, if you don't show test cases and just say "reimplement all of gotree" the task is dominated by an impossible guessing-game: what's actually in scope?

The Nexus format was only loosely specified in its original publication (which we give to AIs). Among other omissions, documentation does not mention that Nexus files may contain comments. Nexus comments are between brackets [], so they may appear in the middle of a line of data, unlike code comments. gotree _generally_ handles them without complaint, but errors with comments in certain locations.

I personally think guessing that comments exist, and all the ways comments must be handled, despite comments not even being mentioned in documentation, is functionally impossible. So the benchmark becomes dominated by the guessing-game, rather than actual implementation.

Human engineers gradually learn the scope of inputs a program should support thanks to external feedback from users of the program, they don't just think them all up in a vacuum before publication. Yet that's what some benchmarks test.

We’re releasing MirrorCode as open-source.

Anthropic Moves Toward Deal with US to Lift Curbs on AI Models

The action is off-balance sheet

The Long-Term Threat to the Memory Chip Boom Is Innovation

The open source DOCX editor submitted to HN a few weeks ago has been deleted

AI audio translator with speech-to-text, LLM translation, and text-to-speech

Show HN: RevealSafe: Buyer and seller privately submit price; reveal together

Show HN: I made a minimalist travel planner that uses wikvoyage data

Don't Make Gates Optional, Make Them Flexible

Trump Threatens 100% Tariffs over Digital Services Tax on U.S. Firms

The AI-Run Business Index: measuring execution, not AI adoption

Show HN: Even, the terminal-first desktop workspace

Font-Family Recommendations

The Thing We All Obviously Want

Ask HN: Can distributed data centers in individual households provide UBI?

Ask HN: If we could remake Linux in 2026, what would you change?

Pystd, similar-ish functionality with a fraction of the compile time

The Dottie Number

Show HN: Forensic stock analysis from SEC filings, no LLM guessing (free)

Show HN: Deskmate Live – AI Desktop Pet Companions

The Nationwide Backlash Against Cameras Watching Your Car

SpaceX bonds sell off days after AI and rocket group's $25B debt deal

President warns of 100% tariff on countries implementing digital services tax

AgentKits – 60 production-ready AI agent blueprints with guardrails

The National Parks Were Reportedly Told to Stay Silent on Deaths

A C++ implementation of a fast hash map and hash set using hopscotch hashing

Evan's Jujutsu Tutorial

A couple of months ago in Miami, I sat down and dumped my brains

After 80 Years, Mathematicians Give Famed 'Erdős Method' an Upgrade

The gap between open weights LLMs and closed source LLMs

Primed for Malware: Stop Selling Compromised Android Devices