frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Insights into Claude Opus 4.5 from Pokémon

https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-into-claude-opus-4-5-from-pokemon
30•surprisetalk•5d ago

Comments

falcor84•1h ago
The idea of Claude having "anterograde amnesia" and the top-rated comment there by Noosphere89 really resonated with me:

  "I would analogize this to a human with anterograde amnesia, who cannot form new memories, and who is constantly writing notes to keep track of their life. The limitations here are obvious, and these are limitations future Claudes will probably share unless LLM memory/continual learning is solved in a better way."

  This is an extremely underrated comparison, TBH. Indeed, I'd argue that frozen weights + lack of a long-term memory are easily one of the biggest reasons why LLMs are much more impressive than useful at a lot of tasks (with reliability being another big, independent issue).

  It emphasizes 2 things that are both true at once: LLMs do in fact reason like humans and can have (poor-quality) world-models, and there's no fundamental chasm between LLM capabilities and human capabilities that can't be cured by unlimited resources/time, and yet just as humans with anterograde amnesia are usually much less employable/useful to others than people who do have long-term memory, current AIs are much, much less employable/useful than future paradigm AIs.
skybrian•1h ago
For a coding agent, the project "learns" as you improve its onboarding docs (AGENTS.md), code, and tests. If you assume you're going to start a new conversation for each task and the LLM is a temp that's going to start from scratch, you'll have a better time.
kaashif•55m ago
Yeah but it feels terrible. I put as much as I can into Claude skills and CLAUDE.md but the fact that this is something I even have to think about makes me sad. The discrete points where the context gets compacted really feel bad and not like how I think AGI or whatever should work.

Just continuously learn and have a super duper massive memory. Maybe I just need a bazillion GPUs to myself to get that.

But no-one wants to manage context all the time, it's incidental complexity.

falcor84•46m ago
I agree with essentially everything you said, except for the final claim that managing context is incidental complexity. From what I know of cognitive science, I would argue that context management is a central facet of intelligence, and a lot of the success of humans in society is dependent on their ability to do so. Looking at it from the other side, executive function disorders such as ADHD offer significant challenges for many humans, and they seem to be not quite entirely unlike these context issues that Claude faces.
falcor84•54m ago
But that's the thing: Claude Plays Pokemon is an experiment in having Claude work fully independently, so there's no "you" who would improve its onboarding docs or anything else, it has to do so on its own. And as long as it cannot do so reliably, it effectively has anterograde amnesia.

And just to be clear, I'm mentioning this because I think that Claude Plays Pokemon is a playground for any agentic AI doing any sort of long-term independent work; I believe that the solution needed here is going to bring us closer to a fully independent agent in coding and other domains. It reminds me of the codeclash.ai benchmark, where similar issues are seen across multiple "rounds" of an AI working on the same codebase.

oceansky•27m ago
I wonder if there's someone at Antrophic working to fine-tune the model's pokemon playing ability specifically.

Maybe not but it sure would be funny.

dbish•24m ago
If I recall correctly, a prior interview about claude plays pokemon stated they purposely chose pokemon as a use case that was not meant to be trained/finetuned on. That's what makes it an interesting problem, so hopefully they aren't.
oceansky•19m ago
I believe the testing itself is done in very good faith.

But I believe the team at Antrophic looks for popular use cases like this one to improve their datasets. Same for every other big player in the LLM game.

martinald•19m ago
This actually matches my experience quite well. I use vision (often) to try and do 2 main things in Claude code:

1) give it text data from something that is annoying to copy and paste (eg labels off a chart or logs from a terrible web UI that doesn't make it easy to copy and paste).

2) give it screenshots of bugs, especially UI glitches.

It's extremely good at 1), can't remember when it got it wrong.

On 2) it _really_ struggled until opus 4.5, almost comically so, with me posting a screenshot and a description of the UI bug and it telling me "great it looks perfect! What next?"

With opus 4.5 it's not quite laughably as bad but still often misses very obvious problems.

It's very interesting to see the rapid progression on these benchmarks, as it's probably a very good proxy for "agentic vision".

I've came to the conclusion that browser use without vision (eg based on the DOM or accessibility trees) is a dead end, simply because "modern" websites tend to use a comical amount of tokens to render. So if this gets very good (close to human level/speed) then we have basically solved agents being able to browse any website/GUI effectively.

The struggle of resizing windows on macOS Tahoe

https://noheger.at/blog/2026/01/11/the-struggle-of-resizing-windows-on-macos-tahoe/
787•happosai•4h ago•360 comments

CLI agents like Claude Code make self-hosting on a home server easier and fun

https://fulghum.io/self-hosting
254•websku•4h ago•181 comments

This game is a single 13 KiB file that runs on Windows, Linux and in the Browser

https://iczelia.net/posts/snake-polyglot/
77•snoofydude•3h ago•26 comments

iCloud Photos Downloader

https://github.com/icloud-photos-downloader/icloud_photos_downloader
299•reconnecting•6h ago•151 comments

I Cannot SSH into My Server Anymore (and That's Fine)

https://soap.coffee/~lthms/posts/i-cannot-ssh-into-my-server-anymore.html
66•TheWiggles•4d ago•36 comments

FUSE is All You Need – Giving agents access to anything via filesystems

https://jakobemmerling.de/posts/fuse-is-all-you-need/
62•jakobem•4h ago•19 comments

Statement by Federal Reserve Chair Jerome F. Powell [video]

https://www.youtube.com/watch?v=KckGHaBLSn4
187•sprawl_•35m ago•70 comments

Sampling at negative temperature

https://cavendishlabs.org/blog/negative-temperature/
109•ag8•5h ago•39 comments

I'm making a game engine based on dynamic signed distance fields (SDFs) [video]

https://www.youtube.com/watch?v=il-TXbn5iMA
174•imagiro•3d ago•22 comments

Don't fall into the anti-AI hype

https://antirez.com/news/158
582•todsacerdoti•15h ago•753 comments

I'd tell you a UDP joke…

https://www.codepuns.com/post/805294580859879424/i-would-tell-you-a-udp-joke-but-you-might-not-get
88•redmattred•3h ago•25 comments

Elo – A data expression language which compiles to JavaScript, Ruby, and SQL

https://elo-lang.org/
46•ravenical•4d ago•5 comments

The Next Two Years of Software Engineering

https://addyosmani.com/blog/next-two-years/
53•napolux•3h ago•25 comments

Show HN: What if AI agents had Zodiac personalities?

https://github.com/baturyilmaz/what-if-ai-agents-had-zodiac-personalities
14•arbayi•1h ago•3 comments

Gentoo Linux 2025 Review

https://www.gentoo.org/news/2026/01/05/new-year.html
293•akhuettel•14h ago•149 comments

A set of Idiomatic prod-grade katas for experienced devs transitioning to Go

https://github.com/MedUnes/go-kata
103•medunes•4d ago•13 comments

Insights into Claude Opus 4.5 from Pokémon

https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-into-claude-opus-4-5-from-pokemon
30•surprisetalk•5d ago•9 comments

Ask HN: What are you working on? (January 2026)

144•david927•8h ago•473 comments

Perfectly Replicating Coca Cola [video]

https://www.youtube.com/watch?v=TDkH3EbWTYc
131•HansVanEijsden•3d ago•69 comments

Jerome Powell Responds

https://www.federalreserve.gov/newsevents/speech/powell20260111a.htm
46•0xedb•37m ago•1 comments

Poison Fountain

https://rnsaffn.com/poison3/
165•atomic128•8h ago•105 comments

A Glimpse into DexProtector

https://www.romainthomas.fr/post/26-01-dexprotector/
3•shelfchair•4d ago•0 comments

BYD's cheapest electric cars to have Lidar self-driving tech

https://thedriven.io/2026/01/11/byds-cheapest-electric-cars-to-have-lidar-self-driving-tech/
118•senti_sentient•4h ago•135 comments

Quake 1 Single-Player Map Design Theories (2001)

https://www.quaddicted.com/webarchive//teamshambler.planetquake.gamespy.com/theories1.html
42•Lammy•20h ago•7 comments

“Food JPEGs” in Super Smash Bros. and Kirby Air Riders

https://sethmlarson.dev/food-jpegs-in-super-smash-bros-and-kirby-air-riders
258•SethMLarson•5d ago•66 comments

Rare Iron Age war trumpet and boar standard found

https://www.bbc.com/news/articles/cr7jvj8d39eo
8•breve•4d ago•1 comments

Anthropic: Developing a Claude Code competitor using Claude Code is banned

https://twitter.com/SIGKITTEN/status/2009697031422652461
231•behnamoh•6h ago•138 comments

"Scholars Will Call It Nonsense": The Structure of von Däniken's Argument (1987)

https://www.penn.museum/sites/expedition/scholars-will-call-it-nonsense/
54•Kaibeezy•6h ago•6 comments

I dumped Windows 11 for Linux, and you should too

https://www.notebookcheck.net/I-dumped-Windows-11-for-Linux-and-you-should-too.1190961.0.html
737•smurda•14h ago•693 comments

C++ std::move doesn't move anything: A deep dive into Value Categories

https://0xghost.dev/blog/std-move-deep-dive/
228•signa11•2d ago•189 comments