frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

"A milion token context" Big AI says. But the model is accurate for 2-4K tokens

https://unagent.eu/2025/04/22/misleading-promises-of-long-context-llm/
2•kzawpl•1y ago

Comments

kzawpl•1y ago
Over last two years there were claims of better long context capabilities for LLM, but that is often tested on exact text search. New benchmark called NoLiMa shows that long context capability of LLM is still poor, if you want LLM to perform some abstraction and reasoning.
vessenes•1y ago
Meh. NoLima is helpful, in that it shows what we all "feel" working with models -- there's a marked dropoff in accuracy and intelligence as we get past 4-32k of context, depending on the model.

But, it seems unreasonable to be super worried about this -- a year or two ago, models couldn't easily find needles in haystacks of long context. As training and test strategies delivered trainable content, this became a thing that could be done perfectly across millions of tokens of context. There has not been a good way to incentivize models to do anything more but remember locations yet.

We are (mostly) paying the full costs of attending to the entire context in current architectures, and it seems pretty reasonable that we will therefore be able to train those architectures to more fully attend across context if we get the right training data into (ideally) an RL loop.

NoLima is an okay test, but I think the most recent OpenAI tests are significantly better and quite interesting; OpenAI-MRCR and Graphwalks are both super smart ideas about how to programmatically generate data that is easy to evaluate and forces better cross context attention.

From their 4.1 announcement: Graphwalks fills the context window with a directed graph composed of hexadecimal hashes, and then asks the model to perform a breadth-first search (BFS) starting from a random node in the graph. We then ask it to return all nodes at a certain depth.

MRCR asks for direct quotes at semantically identified locations in the text, e.g. poems about tapirs, bears and ballerinas, as well as stories about tapirs, bears and ballerinas are generated, perhaps fifty each. The system is asked "give me the third poem about tapirs". This requires counting, conceptual attention, and also distinguishing between stories and poems.

They only test their own models on MRCR for the benchmark graph, but it's still worth reviewing: the accuracy curves are super interesting. https://openai.com/index/gpt-4-1/

Driving in America Is Headlight Hell

https://www.theatlantic.com/technology/2026/06/car-headlights-too-bright-adaptive-beams/687488/
1•pavel_lishin•45s ago•0 comments

How to Stop a Killer Asteroid

https://nautil.us/how-to-stop-a-killer-asteroid-1281888
1•Brajeshwar•1m ago•0 comments

Situational Awareness: The Decade Ahead

https://situational-awareness.ai/
2•yusufozkan•2m ago•0 comments

We Are On A Runaway Freight Train

https://mdelcaro.substack.com/p/the-freight-train
1•fathermarz•4m ago•0 comments

Artificial chocolate will show what shapes global trade

https://www.ft.com/content/06ad9bb9-983e-4651-8931-c9945164d589
1•mmarian•5m ago•2 comments

Spotify Wrapped for your AI coding usage

https://wrapped.entelligence.ai/
1•Entelligence25•5m ago•0 comments

An Overview of Modern AI Robotics from First Principles

https://interlatent.com/blog/interlatent-modern-ai-robotics-first-principles
2•sebg•6m ago•0 comments

Our First Fellows – Gnome Foundation

https://blogs.gnome.org/foundation/2026/06/11/announcing-our-first-fellows/
1•rbanffy•7m ago•0 comments

Copy Has a Blind Spot. This Claude Skill Finds It

https://aiforcontentmarketing.ai/your-copy-has-a-blind-spot-this-claude-skill-finds-it/
1•pakostina•8m ago•0 comments

Linux Sees Patches for "Critical" Vulnerability Affecting Many Arm CPUs

https://www.phoronix.com/news/Arm-CPU-Critical-CVE-2025-10263
1•rbanffy•9m ago•0 comments

Elon Musk's Age of Impunity

https://www.axios.com/2026/06/11/elon-musk-spacex-belfast-riots-incitement
2•1vuio0pswjnm7•10m ago•0 comments

Upgrading Pilo to Support Human-in-the-Loop Browser Automation

https://tabstack.ai/blog/pilo-interactive-mode
1•mooreds•10m ago•0 comments

Caddis – Professional Motion Design Tool

https://www.caddis.app/
1•preommr•13m ago•1 comments

Hijacking Cloud Identities by Recycling Namespaces in Global OIDC Issuers

https://astrix.security/learn/blog/subjugation-hijacking-cloud-identities-by-recycling-namespaces...
1•mooreds•13m ago•0 comments

Individual locomotor bias drives counterclockwise motion in pedestrian crowds

https://www.nature.com/articles/s41467-026-73713-w
1•helterskelter•13m ago•0 comments

Lisp's Influence on Ruby

https://blog.tacoda.dev/lisps-influence-on-ruby-6a54f1a7740e
1•tacoda•13m ago•0 comments

Welcome to the OpenAI, Anthropic, and Google price wars

https://sherwood.news/tech/openai-anthropic-google-price-wars-where-no-one-is-making-money/
1•simonpure•14m ago•1 comments

An evolutionary biologist and a science fiction writer walk into a bar (2024)

https://thereader.mitpress.mit.edu/the-collapse-is-coming-will-humanity-adapt/
1•FrustratedMonky•14m ago•0 comments

The theory taking the rich by storm: China funds data center haters

https://www.npr.org/2026/06/10/nx-s1-5844328/us-china-data-centers-foreign-influence
1•doruk101•15m ago•2 comments

Show HN: RadioPal – a somewhat intelligent layer over Liquidsoap

https://github.com/smoqadam/radiopal
2•smoqadam•15m ago•0 comments

DreamHost is shutting down Mailman – decision made sense, the execution didn't

https://emparrot.com/blog/2026/06/10/DreamHostMailman.html
1•wadco•15m ago•0 comments

Humans prefer to walk anticlockwise, scientists find – but reason is unclear

https://www.theguardian.com/science/2026/jun/10/humans-prefer-to-walk-anticlockwise-scientists-fi...
3•helterskelter•15m ago•0 comments

Why smart people keep getting AI wrong [video]

https://www.youtube.com/watch?v=KpTZbq-eV38
1•haizhung•16m ago•0 comments

Making Software: Image Compression

https://www.makingsoftware.com/chapters/image-compression
1•bpierre•17m ago•0 comments

Giulio Zausa's MMO-Chip Makes Reverse Engineering Old Silicon Chips a Multi Game

https://www.hackster.io/news/giulio-zausa-s-mmo-chip-makes-reverse-engineering-old-silicon-chips-...
1•retro_guy•19m ago•0 comments

Canada seeks to ban social media accounts for U16, joining growing global effort

https://apnews.com/article/canada-social-media-ban-16-kids-292444c9dd8773aeb4119aaa9eae5990
2•1vuio0pswjnm7•20m ago•0 comments

Fable 5 monitored a production incident, found nothing, was off by 20x

https://www.digitalapplied.com/blog/claude-fable-5-mythos-5-agentic-coding-deep-dive-2026
2•tkcashman•20m ago•0 comments

Code Is Cheap(er)

https://htmx.org/essays/code-is-cheap/
2•rcy•20m ago•1 comments

AMD Gaslights Security Researcher, Changes Rules Retroactively [video]

https://www.youtube.com/watch?v=4HjWHNLRMB0
4•SockThief•20m ago•1 comments

How a New DSL May Survive in the Era of LLMs

https://www.williamcotton.com/articles/how-a-new-dsl-survives-in-the-era-of-llms
1•williamcotton•21m ago•0 comments