frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Gemini 3 Pro vs. 2.5 Pro in Pokemon Crystal

https://blog.jcz.dev/gemini-3-pro-vs-25-pro-in-pokemon-crystal
114•alphabetting•4d ago

Comments

jwrallie•1h ago
Being through the game recently, I am not surprised Goldenrod Underground was a challenge, it is very confusing and even though I solved it through trial and error, I still don't know what I did. Olivine Lighthouse is the real surprise, as it felt quite obvious to me.
wild_pointer•1h ago
I wonder how much of it is due to the model being familiar with the game or parts of it, be it due to training of the game itself, or reading/watching walkthroughs online.
andrepd•1h ago
There was a well-publicised "Claude plays Pokémon" stream where Claude failed to complete Pokemon Blue in spectacular fashion, despite weeks of trying. I think only a very gullible person would assume that future LLMs didn't specifically bake this into their training, as they do for popular benchmarks or for penguins riding a bike.
criley2•1h ago
While it is true that model makers are increasingly trying to game benchmarks, it's also true that benchmark-chasing is lowering model quality. GPT 5, 5.1 and 5.2 have been nearly universally panned by almost every class of user, despite being a benchmark monster. In fact, the more OpenAI tries to benchmark-max, the worse their models seem to get.
astrange•28m ago
Hm? 5.1 Thinking is much better than 4o or o3. Just don't use the instant model.
oceansky•1h ago
"Crucially, it tells the agent not to rely on its internal training data (which might be hallucinated or refer to a different version of the game) but to ground its knowledge in what it observes. "

Does this even have any effect?

blibble•1h ago
I very much doubt it
tootyskooty•1h ago
I'm wondering about this too. Would be nice to see an ablation here, or at least see some analysis on the reasoning traces.

It definitely doesn't wipe its internal knowledge of Crystal clean (that's not how LLMs work). My guess is that it slightly encourages the model to explore more and second-guess it's likely very-strong Crystal game knowledge but that's about it.

Workaccount2•48m ago
The model probably recognizes the need for a grassroots effort to solve the problem, to "show it's work".
ragibson•1h ago
Yes, at least to some extent. The author mentions that the base model knows the answer to the switch puzzle but does not execute it properly here.

"It is worth noting that the instruction to "ignore internal knowledge" played a role here. In cases like the shutters puzzle, the model did seem to suppress its training data. I verified this by chatting with the model separately on AI Studio; when asked directly multiple times, it gave the correct solution significantly more often than not. This suggests that the system prompt can indeed mask pre-trained knowledge to facilitate genuine discovery."

hypron•1h ago
My issue with this is that the LLM could just be roleplaying that it doesn't know.
brianwawok•50m ago
To test would just need to edit the rom and switch around the solution. Not sure how complicated that is, likely depends on the rom system.
Workaccount2•30m ago
I don't know why people still get wrapped around the axle of "training data".

Basically every benchmark worth it's salt uses bespoke problems purposely tuned to force the models to reason and generalize. It's the whole point of ARC-AGI tests.

Unsurprisingly Gemini 3 pro performs way better on ARC-AGI than 2.5 pro, and unsurprisingly it did much better in pokemon.

The benchmarks, by design, indicate you can mix up the switch puzzle pattern and it will still solve it.

jdiff•46m ago
Of course it is. It's not capable of actually forgetting or suppressing its training data. It's just double checking rather than assuming because of the prompt. Roleplaying is exactly what it's doing. At any point, it may stop doing that and spit out an answer solely based on training data.

It's a big part of why search overview summaries are so awful. Many times the answers are not grounded in the material.

baby•45m ago
Do we have examples of this in promps in other contexts?
mkoubaa•42m ago
It might get things wrong on purpose, but deep down it knows what it's doing
astrange•29m ago
If they trained the model to respond to that, then it can respond to that, otherwise it can't necessarily.
oceansky•20m ago
I think you got a point here. These companies are injecting a lot of datasets every day into it.
raincole•15m ago
It will definitely have some effect. Why won't it? Even adding noise into prompts (like saying you will be rewarded $1000 for each correct answer) has some effect.

Whether the 'effect' something implied by the prompt, or even something we can understand, is a totally different question.

soulofmischief•1h ago
Nice writeup! I need to start blogging about my antics. I rigged up several cutting edge small local models to an emulator all in-browser and unsuccessfully tried to get them to play different Pokémon games. They just weren't as sharp as the frontier models.

This was a good while back but I'm sure a lot of people might find the process and code interesting even if it didn't succeed. Might resurrect that project.

giancarlostoro•1h ago
I have to think they need to know enough of the guides for the game for it to work out, how do they know whats on screen?
soulofmischief•1h ago
In my project I rigged up an in-browser emulator and directly fed captured images of the screen to local multimodal models.

So it just looks right at what's going on, writes a description for refinement, and uses all of that to create and manage goals, write to a scratchpad and submit input. It's minimal scaffolding because I wanted to see what these raw models are capable of. Kind of a benchmark.

bbondo•1h ago
1.88 billion tokens * $12 / 1M tokens (output) suggests a total cost of $22,560 to solve the game with Gemini 3 Pro?
brianwawok•51m ago
True though I bet the $200 a month plan could do it, maybe a few extra days of downtime when quota was maxed
mkoubaa•44m ago
I can't believe how massively underpaid I was when I was 11
re-thc•29m ago
Do you hallucinate as a kid?
foundddit•3m ago
At that age, it's called "imagination"
elephanlemon•39m ago
“Gemini 3 Pro was often overloaded, which produced long spans of downtime that 2.5 Pro experienced much less often”

I was unclear if this meant that the API was overloaded or if he was on a subscription plan and had hit his limit for the moment. Although I think that the Gemini plans just use weekly limits, so I guess it must be API.

ogogmad•27m ago
:/ Damn. That needs to cost 1000x less before people can try it on their own games.
squimmy26•58m ago
How certain can we be that these improvements aren't just a result of Gemini 3 Pro pre-training on endless internet writeups of where 2.5 has struggled (and almost certainly what a human would have done instead)?

In other words, how much of this improvement is true generalization vs memorization?

zurfer•19m ago
You're too kind. Even the CEO of Google retweeted how well Gemini 2.5 did on Pokemon. There is a high chance that now it's explicitly part of the training regime. We kind of need a different kind of game to know how well it generalizes.
cg5280•14m ago
I like the inclusion of the graph at the end to compare progress. It would be cool to compare this directly to competing models (Claude, GPT, etc).
kqr•3m ago
It would unfortunately also need several runs of each to be reliable. There's nothing in TFA to indicate the results shown aren't to a large degree affected by random chance!

(I do think from personal benchmarks that Gemini 3 is better for the reasons stated by the author, but a single run from each is not strong evidence.)

Go ahead, self-host Postgres

https://pierce.dev/notes/go-ahead-self-host-postgres#user-content-fn-1
110•pavel_lishin•1h ago•56 comments

Log level 'error' should mean that something needs to be fixed

https://utcc.utoronto.ca/~cks/space/blog/programming/ErrorsShouldRequireFixing
63•todsacerdoti•3d ago•23 comments

Gemini 3 Pro vs. 2.5 Pro in Pokemon Crystal

https://blog.jcz.dev/gemini-3-pro-vs-25-pro-in-pokemon-crystal
114•alphabetting•4d ago•33 comments

Immersa: Open-source Web-based 3D Presentation Tool

https://github.com/ertugrulcetin/immersa
67•simonpure•3h ago•11 comments

NTP at NIST Boulder Has Lost Power

https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/ACADD3NKOG2QRWZ56OSNNG7UIEKKT...
299•lpage•9h ago•137 comments

Pure Silicon Demo Coding: No CPU, No Memory, Just 4k Gates

https://www.a1k0n.net/2025/12/19/tiny-tapeout-demo.html
3•a1k0n•7m ago•0 comments

Skills Officially Comes to Codex

https://developers.openai.com/codex/skills/
143•rochansinha•8h ago•73 comments

Privacy doesn't mean anything anymore, anonymity does

https://servury.com/blog/privacy-is-marketing-anonymity-is-architecture/
237•ybceo•10h ago•164 comments

CSS Grid Lanes

https://webkit.org/blog/17660/introducing-css-grid-lanes/
641•frizlab•18h ago•187 comments

Arduino UNO Q bridges high-performance computing with real-time control

https://www.arduino.cc/product-uno-q/
26•doener•3d ago•12 comments

Reflections on AI at the End of 2025

https://antirez.com/news/157
107•danielfalbo•7h ago•153 comments

Mistral OCR 3

https://mistral.ai/news/mistral-ocr-3
624•pember•2d ago•114 comments

Charles Proxy

https://www.charlesproxy.com/
235•handfuloflight•10h ago•90 comments

What Does a Database for SSDs Look Like?

https://brooker.co.za/blog/2025/12/15/database-for-ssd.html
104•charleshn•6h ago•83 comments

A train-sized tunnel is now carrying electricity under South London

https://www.ianvisits.co.uk/articles/a-train-sized-tunnel-is-now-carrying-electricity-under-south...
75•zeristor•8h ago•64 comments

Raycaster (YC F24) Is Hiring a Research Engineer (NYC, In-Person)

1•levilian•4h ago

Garage – An S3 object store so reliable you can run it outside datacenters

https://garagehq.deuxfleurs.fr/
642•ibobev•1d ago•140 comments

Airbus to migrate critical apps to a sovereign Euro cloud

https://www.theregister.com/2025/12/19/airbus_sovereign_cloud/
329•saubeidl•8h ago•256 comments

New Quantum Antenna Reveals a Hidden Terahertz World

https://www.sciencedaily.com/releases/2025/12/251213032617.htm
86•aacker•4d ago•4 comments

A terminal emulator that runs in your terminal. Powered by Turbo Vision

https://github.com/magiblot/tvterm
97•mariuz•3d ago•10 comments

Maximizing Compression of Apple II Hi-Res Images

http://deater.net/weave/vmwprod/hgr_compress/
3•deater•4d ago•0 comments

A proof of concept of a semistable C++ vector container

https://github.com/joaquintides/semistable_vector
19•joaquintides•4d ago•3 comments

Contrails Map

https://map.contrails.org/
97•schaum•9h ago•39 comments

Hash tables in Go and advantage of self-hosted compilers

https://rushter.com/blog/go-and-hashmaps/
40•f311a•5d ago•28 comments

NOAA deploys new generation of AI-driven global weather models

https://www.noaa.gov/news-release/noaa-deploys-new-generation-of-ai-driven-global-weather-models
126•hnburnsy•2d ago•80 comments

Fuzix on a Raspberry Pi Pico

https://ewpratten.com/blog/fuzix-pi-pico
95•ewpratten•5d ago•8 comments

TP-Link Tapo C200: Hardcoded Keys, Buffer Overflows and Privacy

https://www.evilsocket.net/2025/12/18/TP-Link-Tapo-C200-Hardcoded-Keys-Buffer-Overflows-and-Priva...
318•sibellavia•22h ago•96 comments

8-bit Boléro

https://linusakesson.net/music/bolero/index.php
301•Aissen•1d ago•43 comments

LLM Year in Review

https://karpathy.bearblog.dev/year-in-review-2025/
304•swyx•20h ago•115 comments

Graphite is joining Cursor

https://cursor.com/blog/graphite
250•fosterfriends•1d ago•242 comments