frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

"A milion token context" Big AI says. But the model is accurate for 2-4K tokens

https://unagent.eu/2025/04/22/misleading-promises-of-long-context-llm/
2•kzawpl•11mo ago

Comments

kzawpl•11mo ago
Over last two years there were claims of better long context capabilities for LLM, but that is often tested on exact text search. New benchmark called NoLiMa shows that long context capability of LLM is still poor, if you want LLM to perform some abstraction and reasoning.
vessenes•11mo ago
Meh. NoLima is helpful, in that it shows what we all "feel" working with models -- there's a marked dropoff in accuracy and intelligence as we get past 4-32k of context, depending on the model.

But, it seems unreasonable to be super worried about this -- a year or two ago, models couldn't easily find needles in haystacks of long context. As training and test strategies delivered trainable content, this became a thing that could be done perfectly across millions of tokens of context. There has not been a good way to incentivize models to do anything more but remember locations yet.

We are (mostly) paying the full costs of attending to the entire context in current architectures, and it seems pretty reasonable that we will therefore be able to train those architectures to more fully attend across context if we get the right training data into (ideally) an RL loop.

NoLima is an okay test, but I think the most recent OpenAI tests are significantly better and quite interesting; OpenAI-MRCR and Graphwalks are both super smart ideas about how to programmatically generate data that is easy to evaluate and forces better cross context attention.

From their 4.1 announcement: Graphwalks fills the context window with a directed graph composed of hexadecimal hashes, and then asks the model to perform a breadth-first search (BFS) starting from a random node in the graph. We then ask it to return all nodes at a certain depth.

MRCR asks for direct quotes at semantically identified locations in the text, e.g. poems about tapirs, bears and ballerinas, as well as stories about tapirs, bears and ballerinas are generated, perhaps fifty each. The system is asked "give me the third poem about tapirs". This requires counting, conceptual attention, and also distinguishing between stories and poems.

They only test their own models on MRCR for the benchmark graph, but it's still worth reviewing: the accuracy curves are super interesting. https://openai.com/index/gpt-4-1/

Amazon to end support for older Kindles

https://www.bbc.co.uk/news/articles/c98k91yy4z4o
1•akpa1•21s ago•0 comments

Sir

https://x.com/0xgeorgegoldman
1•UkOny•1m ago•0 comments

ByteByteGo free 1 month access (until May first)

https://bytebytego.com/
1•wilsonfiifi•1m ago•0 comments

Perplexity's $1B build challenge with no investment terms listed

https://www.perplexity.ai/computer/a/bdb-terms-conditions-DvGwJTrKQumizUjQ1xoxZA
1•GoRudy•1m ago•0 comments

Show HN: I built a desktop workbench for building and debugging MCP tools

https://github.com/spring-ai-community/spring-ai-playground
1•hjm1980•1m ago•0 comments

I tested 3 Windows laptops in the MacBook Neo's price range – there's no contest

https://www.theverge.com/tech/908328/macbook-neo-windows-laptop-competitors-asus-lenovo-acer-revi...
1•stalfosknight•2m ago•0 comments

DeepTutor: Agent-Native Personalized Tutoring

https://github.com/HKUDS/DeepTutor
1•wslh•2m ago•0 comments

Run it for yourself: compute time dilation

https://github.com/grokthis/shape-engine
1•girlwponytail•4m ago•0 comments

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

https://huggingface.co/blog/ibm-research/altk-evolve
1•allthingsapi•5m ago•1 comments

OpenAI: Short proofs in combinatorics, probability and number theory II

https://arxiv.org/abs/2604.06609
2•Tyyps•7m ago•0 comments

Small Joys of Fatherhood

https://roryflint.substack.com/p/small-joys-of-fatherhood
1•mrroryflint•8m ago•0 comments

Eurail says December data breach impacts 300k individuals

https://www.bleepingcomputer.com/news/security/eurail-says-december-data-breach-impacts-300-000-i...
1•Brajeshwar•8m ago•0 comments

Trustworthy Technology

https://trustworthy.technology/
3•mixmastamyk•9m ago•0 comments

Interpolation Search

https://en.wikipedia.org/wiki/Interpolation_search
1•tosh•9m ago•0 comments

I Am a Cross-Cutting Concern

https://scottlawsonbc.com/post/monorepo
1•surprisetalk•9m ago•0 comments

I Solved Connect 4 [video]

https://www.youtube.com/watch?v=KaljD3Q3ct0
1•surprisetalk•9m ago•0 comments

The phenomenology of being hungry while pregnant

https://substack.com/home/post/p-193224553
1•surprisetalk•9m ago•0 comments

PGLite Evangelism

https://substack.com/home/post/p-193415720
1•surprisetalk•9m ago•0 comments

Wastrel milestone: full hoot support, with generational GC as a treat

https://wingolog.org/archives/2026/04/09/wastrel-milestone-full-hoot-support-with-generational-gc...
2•davexunit•10m ago•0 comments

Quines in Every Programming Language (Rosetta Code)

https://rosettacode.org/wiki/Quine#bodyContent
2•nathan-barry•12m ago•0 comments

Artemis II's last test: Will its heat shield work?

https://www.nationalgeographic.com/science/article/artemis-ii-heat-shield-nasa
1•malshe•13m ago•0 comments

Anthropic Just Passed OpenAI in Revenue While Spending 4x Less

https://www.the-ai-corner.com/p/anthropic-30b-arr-passed-openai-revenue-2026
1•MattSayar•13m ago•0 comments

The Benefits of Sticking Around (2023)

https://letterstoanewdeveloper.com/2023/08/07/the-benefits-of-sticking-around/
1•mooreds•14m ago•0 comments

Migrations Considered Helpful

https://brandonvin.github.io/2026/04/08/matryoshka-migrations.html
2•mooreds•14m ago•0 comments

The Identity Underground Annual Pulse 2026

https://www.theidentityunderground.com/annual-pulse-2026
1•mooreds•16m ago•0 comments

Science Communication and the Hype Machine

https://cognitivewonderland.substack.com/p/science-communication-and-the-hype
1•goekjclo•16m ago•1 comments

Trump admin makes sweeping request for medical records of federal workers

https://arstechnica.com/health/2026/04/trump-admin-seeks-medical-records-of-federal-workers-for-v...
2•voxadam•17m ago•0 comments

Show HN: Nheengatu – CLI tool to simplify books to your language level with LLMs

https://github.com/pdrgds/nheengatu
1•pdrgds•17m ago•0 comments

Programming language designed for LLMs to write, not humans

https://veralang.dev/
1•x591•17m ago•0 comments

SQLite Release 3.53.0

https://sqlite.org/releaselog/3_53_0.html
2•yread•18m ago•1 comments