frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

"A milion token context" Big AI says. But the model is accurate for 2-4K tokens

https://unagent.eu/2025/04/22/misleading-promises-of-long-context-llm/
2•kzawpl•1y ago

Comments

kzawpl•1y ago
Over last two years there were claims of better long context capabilities for LLM, but that is often tested on exact text search. New benchmark called NoLiMa shows that long context capability of LLM is still poor, if you want LLM to perform some abstraction and reasoning.
vessenes•1y ago
Meh. NoLima is helpful, in that it shows what we all "feel" working with models -- there's a marked dropoff in accuracy and intelligence as we get past 4-32k of context, depending on the model.

But, it seems unreasonable to be super worried about this -- a year or two ago, models couldn't easily find needles in haystacks of long context. As training and test strategies delivered trainable content, this became a thing that could be done perfectly across millions of tokens of context. There has not been a good way to incentivize models to do anything more but remember locations yet.

We are (mostly) paying the full costs of attending to the entire context in current architectures, and it seems pretty reasonable that we will therefore be able to train those architectures to more fully attend across context if we get the right training data into (ideally) an RL loop.

NoLima is an okay test, but I think the most recent OpenAI tests are significantly better and quite interesting; OpenAI-MRCR and Graphwalks are both super smart ideas about how to programmatically generate data that is easy to evaluate and forces better cross context attention.

From their 4.1 announcement: Graphwalks fills the context window with a directed graph composed of hexadecimal hashes, and then asks the model to perform a breadth-first search (BFS) starting from a random node in the graph. We then ask it to return all nodes at a certain depth.

MRCR asks for direct quotes at semantically identified locations in the text, e.g. poems about tapirs, bears and ballerinas, as well as stories about tapirs, bears and ballerinas are generated, perhaps fifty each. The system is asked "give me the third poem about tapirs". This requires counting, conceptual attention, and also distinguishing between stories and poems.

They only test their own models on MRCR for the benchmark graph, but it's still worth reviewing: the accuracy curves are super interesting. https://openai.com/index/gpt-4-1/

I Replaced JSON with a Custom Binary Format. In PHP

https://tomj.pro/i-replaced-json-with-a-custom-binary-format-in-php/
1•TomJpro•38s ago•0 comments

A Visit to id Software (November 1993)

https://www.youtube.com/watch?v=HpEBUV_g9vU
1•_tk_•7m ago•0 comments

What if Palantir's logo was square?

https://www.jasonwu.ink/signals/2026-06-18-palantir-square
2•whiteblossom•7m ago•0 comments

The Rapture of the Programming Languages

https://fogknife.com/2026-06-19-the-rapture-of-the-programming-languages.html
2•doodpants•9m ago•0 comments

Hallucinating Canary

https://github.com/marinus/hallucinating-canary
2•marinusva•11m ago•1 comments

The Download: AI bottleneck debates, and BCI trials take off

https://www.technologyreview.com/2026/06/19/1139327/the-download-llms-bottleneck-breakthrough-bci...
1•joozio•12m ago•0 comments

Artery Embolization W Resorbable Gelatin Microspheres in Osteorthritic Knee Pain

https://pubs.rsna.org/doi/10.1148/radiol.253312
2•bookofjoe•14m ago•0 comments

Anthropic "pauses" token-based billing for its Claude Agent SDK

https://arstechnica.com/ai/2026/06/anthropic-pauses-token-based-billing-for-its-claude-agent-sdk/
1•mikhael•15m ago•0 comments

How I Work from Anywhere Without Losing My Place

https://micro.webology.dev/2026/06/13/how-i-work-from-anywhere/
3•speckx•15m ago•0 comments

AURpocalypse now: a look at the recent AUR attacks

https://lwn.net/SubscriberLink/1077619/f7b07c5489fdd43a/
2•jwilk•15m ago•0 comments

Show HN: No-install, end-to-end encrypted HTML artifact sharing for agents

https://askhuman.app/
2•rvcdbn•19m ago•0 comments

Repo-Jacking Anthropic's Claude Community Plugins (and the SHAs That Saved Them)

https://johnstawinski.com/2026/06/18/repo-jacking-anthropics-claude-community-plugins-and-the-sha...
2•cyberbender•19m ago•0 comments

The False Sense of Productivity

https://blog.ngxson.com/the-false-sense-of-productivity
1•ngxson•20m ago•0 comments

Experiments with Scripting and User-Interface Languages (1998)

https://web.archive.org/web/20070210194816/http://inferno.bell-labs.com/cm/cs/who/bwk/interps/pap...
1•tosh•20m ago•0 comments

MiniMax M3 vs. GLM 5.2: Codegen comparison across autonomous coding tasks

https://thinkwright.ai/minimax-m3-vs-glm-5-2-coding-benchmark
2•oceanwaves•21m ago•1 comments

Show HN: Find the right stack for your AI use case

https://inferlay.com/
1•ibbie•25m ago•0 comments

Turn Citi Bike receipt emails into Strava rides, with the real route

https://github.com/erikleon/citibike2strava
2•ekarwatowski•26m ago•0 comments

MCP isn't dead: the 'MCP is dead' wave measures the wrong axis

https://prashamhtrivedi.in/mcp-isnt-dead/
1•prash2488•26m ago•1 comments

The noisy neighbor problem: serving LLMs

https://cohere.com/blog/serving-fairness
1•_josh_meyer_•27m ago•0 comments

Why Temporal Isn't Enough

https://www.diagrid.io/blog/verifiable-execution-lineage-agent-workflows
1•yaronsc•28m ago•0 comments

PhD_fleet – Manage a virtual research lab of AI PhD students via Slack

https://github.com/canatara/phd_fleet
1•canatara•28m ago•1 comments

Basic Markdown Syntax Guide

https://www.markdownguide.org/basic-syntax/
1•nanfinitum•29m ago•0 comments

I'd like to sell my time today

https://based.press/e/i-d-like-to-sell-my-time-today-ua8c1a8/
1•pro_methe5•29m ago•0 comments

AI-gateway product that cuts LLM API TOKEN costs by 40-70%

1•arnab777•30m ago•0 comments

DiffsHub: View code changes from any public GitHub diff with a freaking-fast UI

1•maxloh•30m ago•1 comments

Code Is Not a Product, Product Is Not a Startup

https://pawelbrodzinski.substack.com/p/code-is-not-a-product-product-is
1•flail•35m ago•0 comments

Microsoft Shareholders Sue over $357B Stock Wipeout

https://www.gadgetreview.com/microsoft-shareholders-sue-over-357-billion-stock-wipeout
4•speckx•37m ago•0 comments

Manna announces 'strategic pause' that grounds drone deliveries in Ireland

https://www.irishtimes.com/business/2026/06/19/manna-announces-strategic-pause-that-grounds-drone...
1•trusche•39m ago•0 comments

How I turned World Cup data into posters

https://zehfernandes.com/posts/how-i-turned-world-cup-data-into-posters
1•zehfernandes•39m ago•1 comments

Open-source AI skills that make Claude/ChatGPT produce real work, eval-scored

https://github.com/mohitagw15856/pm-claude-skills
1•mohitagw•40m ago•0 comments