frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

"A milion token context" Big AI says. But the model is accurate for 2-4K tokens

https://unagent.eu/2025/04/22/misleading-promises-of-long-context-llm/
2•kzawpl•8mo ago

Comments

kzawpl•8mo ago
Over last two years there were claims of better long context capabilities for LLM, but that is often tested on exact text search. New benchmark called NoLiMa shows that long context capability of LLM is still poor, if you want LLM to perform some abstraction and reasoning.
vessenes•8mo ago
Meh. NoLima is helpful, in that it shows what we all "feel" working with models -- there's a marked dropoff in accuracy and intelligence as we get past 4-32k of context, depending on the model.

But, it seems unreasonable to be super worried about this -- a year or two ago, models couldn't easily find needles in haystacks of long context. As training and test strategies delivered trainable content, this became a thing that could be done perfectly across millions of tokens of context. There has not been a good way to incentivize models to do anything more but remember locations yet.

We are (mostly) paying the full costs of attending to the entire context in current architectures, and it seems pretty reasonable that we will therefore be able to train those architectures to more fully attend across context if we get the right training data into (ideally) an RL loop.

NoLima is an okay test, but I think the most recent OpenAI tests are significantly better and quite interesting; OpenAI-MRCR and Graphwalks are both super smart ideas about how to programmatically generate data that is easy to evaluate and forces better cross context attention.

From their 4.1 announcement: Graphwalks fills the context window with a directed graph composed of hexadecimal hashes, and then asks the model to perform a breadth-first search (BFS) starting from a random node in the graph. We then ask it to return all nodes at a certain depth.

MRCR asks for direct quotes at semantically identified locations in the text, e.g. poems about tapirs, bears and ballerinas, as well as stories about tapirs, bears and ballerinas are generated, perhaps fifty each. The system is asked "give me the third poem about tapirs". This requires counting, conceptual attention, and also distinguishing between stories and poems.

They only test their own models on MRCR for the benchmark graph, but it's still worth reviewing: the accuracy curves are super interesting. https://openai.com/index/gpt-4-1/

Simple Bidirectional Type Inference

https://ettolrach.com/blog/bidirectional_inference.html
1•todsacerdoti•51s ago•0 comments

Show HN: A Simple CLI Utility Around Git Worktree for Running Parallel Agents

https://github.com/golbin/gw
1•golbin•53s ago•0 comments

Google Opal

https://opal.google/landing/
1•gmays•1m ago•0 comments

Ask HN: Technical co-founder wanted for modern baseball simulation game

1•bwheat•2m ago•0 comments

Bengio: AI shows signs of self-preservation and we should be ready to pull plug

https://www.theguardian.com/technology/2025/dec/30/ai-pull-plug-pioneer-technology-rights
1•fittingopposite•4m ago•0 comments

How to Win at eBay: How to Search

https://www.kenrockwell.com/tech/ebay/search.htm
1•LorenDB•11m ago•0 comments

Operation Mincemeat

https://en.wikipedia.org/wiki/Operation_Mincemeat
2•RyanShook•11m ago•0 comments

The Showa Hundred Year Problem

https://www.dampfkraft.com/showa-100.html
1•polm23•15m ago•0 comments

Ask HN: How do you make sure that your source code does not leak?

3•stein1946•20m ago•1 comments

RC5-72 / Overall Project Stats

https://stats.distributed.net/projects.php?project_id=8
1•macote•24m ago•0 comments

A single DNA cassette tape could store billions of photos

https://www.popsci.com/technology/dna-cassette-data-storage/
2•Tomte•27m ago•0 comments

Radio observations to date find no evidence of technosignature from 3I/ATLAS

https://phys.org/news/2025-12-sensitive-radio-date-evidence-technosignature.html
1•wglb•28m ago•1 comments

Askap discovers an outflow in a nearby galaxy

https://phys.org/news/2025-12-askap-spectacular-outflow-nearby-galaxy.html
1•wglb•29m ago•1 comments

5th floor at stair 9 in building 2 of CS research group at Bell Labs

https://spinroot.com/gerard/img/5th_floor.gif
1•fisheuler•30m ago•0 comments

New York's incoming mayor bans Raspberry Pi at his inauguration party

https://www.theregister.com/2025/12/31/zohran_mamdani_raspberry_pi_ban/
1•linker3000•31m ago•1 comments

Skunk Works Rules [pdf]

https://www.lockheedmartin.com/content/dam/lockheed-martin/aero/photo/skunkworks/kellys-14-rules.pdf
1•dvrp•33m ago•0 comments

Mitt Romney: Tax the Rich, Like Me

https://www.nytimes.com/2025/12/19/opinion/romney-tax-the-rich.html
2•throw0101c•37m ago•3 comments

Tips for Writing a Technical Book

https://borischerny.com/writing/2019/05/26/Tips-For-Writing-A-Technical-Book.html
3•jxmorris12•42m ago•0 comments

I'm rejecting the next architecture PR that uses a Service Mesh for a team of 4

https://old.reddit.com/r/devops/comments/1pzkibf/im_rejecting_the_next_architecture_pr_that_uses_a/
5•ivewonyoung•43m ago•0 comments

Taking photos of PCBs and electronics [video]

https://www.youtube.com/watch?v=W-j-hvUjfJs
1•zdw•43m ago•0 comments

AI-powered skin analysis platform

https://skinadvisor.ai/en
1•nancynguyen98•44m ago•1 comments

We don't need more contributors who aren't programmers to contribute code

https://discourse.llvm.org/t/rfc-llvm-ai-tool-policy-human-in-the-loop/89159
7•pertymcpert•45m ago•1 comments

Ask HN: Looking for an Invite for Lobster.rs

2•willmorrison•46m ago•0 comments

Antibrittle Agents

https://www.southbridge.ai/blog/antibrittle-agents
1•hrishi•47m ago•0 comments

I built a free tool to explore app market data for indie developers and founders

https://appark.ai/
1•xuechen006•52m ago•1 comments

I curated 25GB of video assets so you don't have to use Stock sites

2•BeyondWalk•57m ago•0 comments

You can now submit fraud claims to the IRS online. Before you had to mail a form

https://twitter.com/shl/status/2005621582677622871
1•raybb•1h ago•1 comments

If childhood is half of life, how should that change how we live?

https://moultano.wordpress.com/2025/12/30/children-and-helical-time/
3•moultano•1h ago•1 comments

Play Free Online Games – No Download Needed – MiniTapFun

https://minitapfun.com
2•heihieih•1h ago•0 comments

MTTR-A: Measuring Cognitive Recovery Latency in Multi-Agent Systems

https://arxiv.org/abs/2511.20663
2•PaulHoule•1h ago•0 comments