However, having an LLM fully replicate the spec purely from memory—without referencing existing code—is still a significant challenge. It requires the underlying model to have strong anti-hallucination capabilities and solid long-term planning to keep from going astray. Because of this, building an NES emulator makes for an excellent LLM stress test.
Here is how the emulator was built:
Data Gathering: I asked Codex to download the necessary developer manuals and test suites. It was strictly prohibited from searching for reference implementations online.
Development: I instructed Codex to build the emulator until all test suites passed. This process was mostly hands-free; I only chimed in to encourage it to continue when it paused.
First Draft: After just 4-5 prompts, Codex delivered a functional, pure-Python emulator—though it ran at a sluggish 7 FPS.
Optimization: Asking Codex to optimize the app completely on its own didn't work this time. Instead, I had it generate a flamegraph, which identified the PPU update as the bottleneck. I then instructed Codex to rewrite the PPU in Cython without breaking the passing tests.
Overall, I'm incredibly impressed by Codex. I already knew it was capable of the task, but the speed was astonishing. It finished the project in under an hour, using merely 2% of my weekly Pro quota.
While the NES might be a relatively easy system to emulate, I think emulation could serve as a fantastic benchmark for testing future LLMs.
nunobrito•57m ago
zi2zi-jit•54m ago