frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

MirrorCode: What's the largest software project AI can complete on its own?

https://epoch.ai/MirrorCode
3•tadamcz•1h ago

Comments

tadamcz•1h ago
Hi HN! I'm the creator of MirrorCode, a benchmark of long-horizon SWE tasks.

In a MirrorCode task, AI models are tasked with reimplementing an entire program end-to-end, without access to the original source code. AI-generated solutions must match the original program’s stdout and stderr exactly on end-to-end tests. MirrorCode’s 25 target programs span different areas of computing: Unix utilities, data serialization and query tools, bioinformatics, interpreters, static analysis, cryptography, and compression.

IMO, the key tension in eval design is now: how to make tasks that are difficult for AI, but actually fair? My default expectation in 2026 when I see a coding eval with low scores is that the tasks turn out to be impossible.

In MirrorCode, we leverage reimplementation of existing software as the raw material to make tasks hard. But we also manually select each of the 25 target programs, and carefully collect the end-to-end tests, so that the task is actually achievable (if extremely difficult).

It's very easy for MirrorCode-style tasks to degenerate into absurd reverse-engineering: stuff you'd never do as a human engineer.

e.g. one of our target programs is the brotli compression library. But we explicitly test the decompression path only (and call the task "brotlid")

RFC 7932 only defines decompression: given compressed bytes, there's one correct output. In the other direction, there's no unique mapping.

So the output of a compressor is governed by thousands of internal efficiency heuristics (e.g. how far back to look for repeated strings). Replicating those perfectly from a black box is extreme and pointless reverse-engineering.

An aspect of MirrorCode I’m proud of: it’s the first benchmark I know of to implement a serious “security mindset”. The field fatalistically treats AI cheating as an unwinnable game of whack-a-mole. I think this is completely wrong.

My views: (1) in 2026, cheat-proofing must be a first-class consideration in benchmark design, not an afterthought (2) most benchmarks CAN be made secure, by using known primitives like containers

The problem is that benchmark authors aren’t even trying to be secure against human-level attackers, which AIs are.

Another key differentiator of MirrorCode: AIs are clearly told the scope of the task, rather than having to guess what they'll secretly be tested on.

We clarify task scope by showing AIs a list of test inputs (while also keeping some hidden to prevent AIs from cheating with a lookup table).

If this sounds too easy or like giving them the answer, I assure you, you're thinking about it wrong. e.g. perfectly matching gotree's behaviour does not become easy because I show you 1,899 inputs (!) that your program will be tested on.

Meanwhile, if you don't show test cases and just say "reimplement all of gotree" the task is dominated by an impossible guessing-game: what's actually in scope?

The Nexus format was only loosely specified in its original publication (which we give to AIs). Among other omissions, documentation does not mention that Nexus files may contain comments. Nexus comments are between brackets [], so they may appear in the middle of a line of data, unlike code comments. gotree _generally_ handles them without complaint, but errors with comments in certain locations.

I personally think guessing that comments exist, and all the ways comments must be handled, despite comments not even being mentioned in documentation, is functionally impossible. So the benchmark becomes dominated by the guessing-game, rather than actual implementation.

Human engineers gradually learn the scope of inputs a program should support thanks to external feedback from users of the program, they don't just think them all up in a vacuum before publication. Yet that's what some benchmarks test.

We’re releasing MirrorCode as open-source.

Anthropic Moves Toward Deal with US to Lift Curbs on AI Models

https://www.bloomberg.com/news/articles/2026-06-26/anthropic-moves-toward-deal-with-us-to-lift-cu...
1•mfiguiere•9m ago•0 comments

The action is off-balance sheet

https://marginpoints.substack.com/p/the-action-is-off-balance-sheet
1•historian1066•11m ago•0 comments

The Long-Term Threat to the Memory Chip Boom Is Innovation

https://www.wsj.com/finance/the-long-term-threat-to-the-memory-chip-boom-is-innovation-bb289488
1•bookofjoe•13m ago•1 comments

The open source DOCX editor submitted to HN a few weeks ago has been deleted

1•gcanyon•13m ago•1 comments

AI audio translator with speech-to-text, LLM translation, and text-to-speech

https://github.com/team-telnyx/telnyx-code-examples/tree/main/ai-content-translator-python
2•sona-coffee11•17m ago•0 comments

Show HN: RevealSafe: Buyer and seller privately submit price; reveal together

https://www.revealsafe.com/
1•wenbin•24m ago•0 comments

Show HN: I made a minimalist travel planner that uses wikvoyage data

https://triptip.cat
2•belforn•26m ago•1 comments

Don't Make Gates Optional, Make Them Flexible

https://wakamoleguy.com/p/flexible-gates
1•wakamoleguy•28m ago•0 comments

Trump Threatens 100% Tariffs over Digital Services Tax on U.S. Firms

https://www.cnbc.com/2026/06/26/trump-tariff-trade-tech-tax.html
5•billybuckwheat•29m ago•0 comments

The AI-Run Business Index: measuring execution, not AI adoption

https://www.leapd.ai/resources/state-of-ai-run-businesses-2026
2•Cyrus2050•29m ago•2 comments

Show HN: Even, the terminal-first desktop workspace

https://eventerm.com/
1•todience•30m ago•0 comments

Font-Family Recommendations

https://chrismorgan.info/font-family
2•birdculture•31m ago•0 comments

The Thing We All Obviously Want

https://kmicinski.com/thing-we-all-want
1•matt_d•31m ago•0 comments

Ask HN: Can distributed data centers in individual households provide UBI?

1•SuboptimalEng•33m ago•5 comments

Ask HN: If we could remake Linux in 2026, what would you change?

1•alonsovm44•34m ago•2 comments

Pystd, similar-ish functionality with a fraction of the compile time

https://nibblestew.blogspot.com/2026/06/pystd-standard-library-similar-ish.html
5•ibobev•34m ago•0 comments

The Dottie Number

https://lawrencecpaulson.github.io//2026/06/26/Dottie_Number.html
1•ibobev•40m ago•0 comments

Show HN: Forensic stock analysis from SEC filings, no LLM guessing (free)

https://stockonomy.net/proof
1•SEC_Lense•42m ago•0 comments

Show HN: Deskmate Live – AI Desktop Pet Companions

https://deskmatelive.com/
1•valisvalis•42m ago•0 comments

The Nationwide Backlash Against Cameras Watching Your Car

https://www.wsj.com/us-news/the-nationwide-backlash-against-cameras-watching-your-car-401a656a
6•JumpCrisscross•44m ago•0 comments

SpaceX bonds sell off days after AI and rocket group's $25B debt deal

https://www.ft.com/content/04f98e21-4ce7-43d2-8651-44557e12c31c
3•JumpCrisscross•47m ago•0 comments

President warns of 100% tariff on countries implementing digital services tax

https://www.ft.com/content/5d886d47-c509-44a4-9077-bcd25158b61e
7•JumpCrisscross•48m ago•1 comments

AgentKits – 60 production-ready AI agent blueprints with guardrails

https://www.agent-kits.com
2•stoicstoic•49m ago•0 comments

The National Parks Were Reportedly Told to Stay Silent on Deaths

https://www.outsideonline.com/outdoor-adventure/environment/nps-internal-memo-deaths/?link_source...
18•LostMyLogin•49m ago•1 comments

A C++ implementation of a fast hash map and hash set using hopscotch hashing

https://github.com/Tessil/hopscotch-map
19•gjvc•50m ago•1 comments

Evan's Jujutsu Tutorial

https://evmar.github.io/jjtut/
3•joecobb•50m ago•0 comments

A couple of months ago in Miami, I sat down and dumped my brains

https://ghuntley.com/miami/
1•ghuntley•51m ago•0 comments

After 80 Years, Mathematicians Give Famed 'Erdős Method' an Upgrade

https://www.quantamagazine.org/after-80-years-mathematicians-give-famed-erdos-method-an-upgrade-2...
3•ibobev•53m ago•0 comments

The gap between open weights LLMs and closed source LLMs

https://blog.doubleword.ai/frontier-os-llm
31•kkm•54m ago•14 comments

Primed for Malware: Stop Selling Compromised Android Devices

https://www.eff.org/deeplinks/2026/06/primed-malware-stop-selling-compromised-android-devices
3•hn_acker•54m ago•0 comments