frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

"A milion token context" Big AI says. But the model is accurate for 2-4K tokens

https://unagent.eu/2025/04/22/misleading-promises-of-long-context-llm/
2•kzawpl•7mo ago

Comments

kzawpl•7mo ago
Over last two years there were claims of better long context capabilities for LLM, but that is often tested on exact text search. New benchmark called NoLiMa shows that long context capability of LLM is still poor, if you want LLM to perform some abstraction and reasoning.
vessenes•7mo ago
Meh. NoLima is helpful, in that it shows what we all "feel" working with models -- there's a marked dropoff in accuracy and intelligence as we get past 4-32k of context, depending on the model.

But, it seems unreasonable to be super worried about this -- a year or two ago, models couldn't easily find needles in haystacks of long context. As training and test strategies delivered trainable content, this became a thing that could be done perfectly across millions of tokens of context. There has not been a good way to incentivize models to do anything more but remember locations yet.

We are (mostly) paying the full costs of attending to the entire context in current architectures, and it seems pretty reasonable that we will therefore be able to train those architectures to more fully attend across context if we get the right training data into (ideally) an RL loop.

NoLima is an okay test, but I think the most recent OpenAI tests are significantly better and quite interesting; OpenAI-MRCR and Graphwalks are both super smart ideas about how to programmatically generate data that is easy to evaluate and forces better cross context attention.

From their 4.1 announcement: Graphwalks fills the context window with a directed graph composed of hexadecimal hashes, and then asks the model to perform a breadth-first search (BFS) starting from a random node in the graph. We then ask it to return all nodes at a certain depth.

MRCR asks for direct quotes at semantically identified locations in the text, e.g. poems about tapirs, bears and ballerinas, as well as stories about tapirs, bears and ballerinas are generated, perhaps fifty each. The system is asked "give me the third poem about tapirs". This requires counting, conceptual attention, and also distinguishing between stories and poems.

They only test their own models on MRCR for the benchmark graph, but it's still worth reviewing: the accuracy curves are super interesting. https://openai.com/index/gpt-4-1/

Nuke Snake, the classic Mac shareware game

https://nukesnake.com/
1•gaws•1m ago•0 comments

SF's Claude Passed Away

https://www.kron4.com/news/bay-area/albino-alligator-claude-passes/
1•wferrell•7m ago•0 comments

TikTok and LinkedIn Face Investigations by Irish Media Regulator

https://www.bloomberg.com/news/articles/2025-12-02/tiktok-and-linkedin-face-investigations-by-iri...
1•1vuio0pswjnm7•8m ago•0 comments

Non-Lexical Bindings

https://www.sheeeeeeeep.art/lang-things-current-object.html
1•panic•9m ago•0 comments

Can we build WeChat-like Mini Apps using open web standards?

https://dmathewwws.com/antler-an-irl-browser
1•brazukadev•10m ago•0 comments

The Missing Piece in Digital Workflows – DevLog 0

https://www.youtube.com/watch?v=IFbj-P0D1AQ
1•ch3coohlink•13m ago•0 comments

Rand Paul: My Proposal Will Improve Health Care and Lower Costs

https://www.newsweek.com/rand-paul-my-proposal-will-improve-health-care-and-lower-costs-opinion-1...
1•bilsbie•14m ago•0 comments

Mechanisms as Types

https://spacechimplives.substack.com/p/mechanisms-as-types
1•azhenley•18m ago•0 comments

Our Founders Would Abhor What the USPTO Is Doing with the Patent System

https://www.techdirt.com/2025/12/02/our-founders-would-abhor-what-the-uspto-is-doing-with-the-pat...
1•hn_acker•18m ago•0 comments

Trump pardons Honduran ex-president who was convicted of drug crimes

https://www.npr.org/2025/12/02/nx-s1-5628382/trump-pardons-honduran-ex-president-juan-orlando-her...
5•CXSHNGCB•24m ago•0 comments

Is this code clean? A critical look at Clean Code 2nd Edition

https://bugzmanov.github.io/cleancode-critique/clean_code_second_edition_review.html
3•birdculture•28m ago•0 comments

The Man I Want to Meet the Most: The Life of Alfred Lee Loomis

https://chillphysicsenjoyer.substack.com/p/the-man-i-want-to-meet-the-most
1•crescit_eundo•30m ago•0 comments

Eev: Emacs Execute Verbosely

https://anggtwu.net/index.html#eev
1•oumua_don17•30m ago•0 comments

Designing the Dreidel of the Future

https://www.jellomenorah.com/p/designing-the-dreidel-of-the-future
1•akkartik•31m ago•0 comments

Show HN: FT-Lab – Lightweight TinyLlama Fine-Tuning (Full FT / LoRA / QLoRA)

https://github.com/REICHIYAN/ft_lab
1•Sai-HN•38m ago•0 comments

Everything that is wrong in museums starts with wall labels

https://www.aaronland.info/weblog/2025/11/20/cafeteria/#usf
1•panic•40m ago•0 comments

Show HN: AI slides and presentation coaching

https://eloquentiq.vercel.app
1•mdev23•41m ago•0 comments

A pragmatic guide to LLM evals for devs

https://newsletter.pragmaticengineer.com/p/evals
1•sren•41m ago•0 comments

Now Watch Me Read

https://www.newyorker.com/culture/the-lede/performative-reading
1•petethomas•41m ago•0 comments

Three tips for easy container deployments on AWS

https://www.processfoundry.io/insights/three-tips-container-deployments-aws
1•christian-scott•42m ago•0 comments

Show HN: Wedding Guest Ranker

https://weddingguestranker.com/
1•etothepii•42m ago•0 comments

Finding Gene Cernan's Missing Moon Camera

https://www.spacecamera.co/articles/2020/3/3/gene-cernans-missing-lunar-surface-camera
2•theodorespeaks•43m ago•0 comments

Irys Photos – Social photography app

https://www.irysphotos.com
1•lylo•45m ago•0 comments

Show HN: Veru – open-source AI citation auditor using OpenAlex

https://github.com/Yinghao-Guan/Veru
1•guaguaaaa•49m ago•0 comments

The Prosecution of Roger Ver: A Lawfare Case Study

https://solari.com/the-prosecution-of-roger-ver-a-lawfare-case-study/
1•salkahfi•52m ago•0 comments

Vibe Coding: Empowering and Imprisoning

https://www.anildash.com/2025/12/02/vibe-coding-empowering-and-imprisoning/
2•zdw•53m ago•0 comments

Running Linux on a RiscPC – why is it so hard?

https://thejpster.org.uk/blog/blog-2025-12-02/
1•zdw•55m ago•0 comments

The Rise and Fall of the H-1B Visa – American Affairs Journal

https://americanaffairsjournal.org/2025/11/the-rise-and-fall-of-the-h-1b-visa/
2•bilsbie•57m ago•0 comments

Show HN: TrailWrightQA – local-first, AI-assisted Playwright UI testing

https://github.com/marktl/TrailWrightQA
1•marktl•1h ago•0 comments

Accommodation Nation: America's colleges have an extra-time-on-tests problem

https://www.theatlantic.com/magazine/2026/01/elite-university-student-accommodation/684946/
1•petethomas•1h ago•0 comments