frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

"A milion token context" Big AI says. But the model is accurate for 2-4K tokens

https://unagent.eu/2025/04/22/misleading-promises-of-long-context-llm/
2•kzawpl•10mo ago

Comments

kzawpl•10mo ago
Over last two years there were claims of better long context capabilities for LLM, but that is often tested on exact text search. New benchmark called NoLiMa shows that long context capability of LLM is still poor, if you want LLM to perform some abstraction and reasoning.
vessenes•10mo ago
Meh. NoLima is helpful, in that it shows what we all "feel" working with models -- there's a marked dropoff in accuracy and intelligence as we get past 4-32k of context, depending on the model.

But, it seems unreasonable to be super worried about this -- a year or two ago, models couldn't easily find needles in haystacks of long context. As training and test strategies delivered trainable content, this became a thing that could be done perfectly across millions of tokens of context. There has not been a good way to incentivize models to do anything more but remember locations yet.

We are (mostly) paying the full costs of attending to the entire context in current architectures, and it seems pretty reasonable that we will therefore be able to train those architectures to more fully attend across context if we get the right training data into (ideally) an RL loop.

NoLima is an okay test, but I think the most recent OpenAI tests are significantly better and quite interesting; OpenAI-MRCR and Graphwalks are both super smart ideas about how to programmatically generate data that is easy to evaluate and forces better cross context attention.

From their 4.1 announcement: Graphwalks fills the context window with a directed graph composed of hexadecimal hashes, and then asks the model to perform a breadth-first search (BFS) starting from a random node in the graph. We then ask it to return all nodes at a certain depth.

MRCR asks for direct quotes at semantically identified locations in the text, e.g. poems about tapirs, bears and ballerinas, as well as stories about tapirs, bears and ballerinas are generated, perhaps fifty each. The system is asked "give me the third poem about tapirs". This requires counting, conceptual attention, and also distinguishing between stories and poems.

They only test their own models on MRCR for the benchmark graph, but it's still worth reviewing: the accuracy curves are super interesting. https://openai.com/index/gpt-4-1/

Field Recordings Around the World

https://earth-garden.alen.ro/
1•alentodorov•59s ago•0 comments

Robert Tinney: 'Byte' Magazine and Beyond

https://70s-sci-fi-art.ghost.io/robert-tinney-byte-magazine-and-beyond/
1•sohkamyung•1m ago•0 comments

Show HN: Pane – Give your AI access to your financial data via MCP

https://pane.money
1•darnfish•4m ago•0 comments

Hit Your 1 Rep Max with AI

https://www.xiegerts.com/post/hit-your-1-rep-max-with-ai/
1•siegers•4m ago•0 comments

CBP Tapped into the Online Advertising Ecosystem to Track Peoples' Movements

https://lwn.net/Articles/1061085/
1•DyslexicAtheist•6m ago•0 comments

MCP Servers Are Now Searchable

https://mcpmonitoring.com/
1•jspuri•7m ago•0 comments

Microsoft Expands Starlink Alliance to Grow Azure and AI in Kenya

https://finance.yahoo.com/news/microsoft-expands-starlink-alliance-grow-160902940.html
1•andsoitis•10m ago•0 comments

Slab tearing and segmented subduction termination driven by transform tectonics

https://www.science.org/doi/full/10.1126/sciadv.ady8347
1•luu•12m ago•0 comments

Rare Earths Norway says estimate of Europe's biggest deposit jumps 81%

https://www.reuters.com/business/energy/rare-earths-norway-says-estimate-deposit-biggest-europe-j...
1•littlexsparkee•12m ago•0 comments

Anthropic-backed super PAC spends $1.6M in primary race divided over datacenters

https://www.theguardian.com/us-news/2026/mar/03/datacenter-politics-north-carolina-primary
1•colinhb•13m ago•0 comments

First AI Agent on a Smartwatch

https://twitter.com/petruspennanen/status/2028946464119165140
1•petruspennanen•13m ago•1 comments

Killed by Mozilla

https://killedbymozilla.com/
1•TigerUniversity•14m ago•0 comments

PRX Part 3 – Training a Text-to-Image Model in 24h

https://huggingface.co/blog/Photoroom/prx-part3
1•ibobev•17m ago•0 comments

Helsinki just went a full year without a single traffic death

https://www.politico.eu/article/helsinki-no-traffic-death-roads-eu-accident-finland-driving-trans...
7•mooreds•18m ago•0 comments

Select your fruit (No JavaScript)

https://codepen.io/t_afif/pen/PwGPJOB
1•ChadNauseam•18m ago•1 comments

If You Like PICO-8, You'll Love Kaplay (Probably)

https://jslegenddev.substack.com/p/if-you-like-pico-8-youll-love-kaplay
1•ibobev•18m ago•0 comments

It's an Obscure Psychedelic Used to Treat Trauma. Could It Help Me?

https://www.nytimes.com/2026/03/01/magazine/ibogaine-psychedelic-treatment-trauma-mental-health.html
2•whack•18m ago•0 comments

MicroTimes Interviews Borland's Philippe Kahn Again (1995)

https://computeradsfromthepast.substack.com/p/microtimes-interviews-borlands-philippe-93a
1•rbanffy•19m ago•0 comments

Behold the Power of Meta:Substitute

https://brevzin.github.io/c++/2026/03/02/power-of-substitute/
1•ibobev•22m ago•0 comments

Pincer – Python AI agent framework, security-first

https://github.com/pincerhq/pincer
1•vpu2301•22m ago•1 comments

Compiling Prolog to Forth [pdf]

https://vfxforth.com/flag/jfar/vol4/no4/article4.pdf
2•PaulHoule•22m ago•0 comments

Maryland Senators Approve Bill to Let Off-Duty Firefighters, EMTs Use Cannabis

https://www.marijuanamoment.net/maryland-senators-approve-bill-to-let-firefighters-and-rescue-wor...
1•treatsmokenjury•23m ago•1 comments

Zed will require age identification for its services

https://zed.dev/terms#21-eligibility
24•delduca•24m ago•16 comments

Linux in Space: The aerospace industry's attitude for Space Architechture

https://www.windriver.com/blog/Linux-Flies-into-Space
2•huxleyFiddler•25m ago•2 comments

The magic of adding random noise to black and white images [video]

https://www.youtube.com/watch?v=kT4p1GXq4HY
1•ColinWright•26m ago•0 comments

Who Pays for Tariffs Along the Supply Chain? Evidence from European Wine Tariffs

https://www.nber.org/papers/w34392
1•Anon84•26m ago•1 comments

Are these AI cost‑curve assumptions realistic? (ARK Big Ideas 2026)

https://www.ark-invest.com/big-ideas-2026
1•crosbyk•27m ago•1 comments

Why specialized AI systems may outperform general‑purpose models

https://research.contrary.com/report/the-case-for-specialized-ai
1•crosbyk•29m ago•1 comments

Generating Colour Palettes Thanks to Microgpt

https://fungiboletus.github.io/MicroColourGPT3000/generating-colour-palettes-randomly.html
1•speedgoose•30m ago•0 comments

Economic Possibilities for our Grandchildren (1930) [pdf]

http://www.econ.yale.edu/smith/econ116a/keynes1.pdf
2•mooreds•31m ago•0 comments