frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

"A milion token context" Big AI says. But the model is accurate for 2-4K tokens

https://unagent.eu/2025/04/22/misleading-promises-of-long-context-llm/
2•kzawpl•7mo ago

Comments

kzawpl•7mo ago
Over last two years there were claims of better long context capabilities for LLM, but that is often tested on exact text search. New benchmark called NoLiMa shows that long context capability of LLM is still poor, if you want LLM to perform some abstraction and reasoning.
vessenes•7mo ago
Meh. NoLima is helpful, in that it shows what we all "feel" working with models -- there's a marked dropoff in accuracy and intelligence as we get past 4-32k of context, depending on the model.

But, it seems unreasonable to be super worried about this -- a year or two ago, models couldn't easily find needles in haystacks of long context. As training and test strategies delivered trainable content, this became a thing that could be done perfectly across millions of tokens of context. There has not been a good way to incentivize models to do anything more but remember locations yet.

We are (mostly) paying the full costs of attending to the entire context in current architectures, and it seems pretty reasonable that we will therefore be able to train those architectures to more fully attend across context if we get the right training data into (ideally) an RL loop.

NoLima is an okay test, but I think the most recent OpenAI tests are significantly better and quite interesting; OpenAI-MRCR and Graphwalks are both super smart ideas about how to programmatically generate data that is easy to evaluate and forces better cross context attention.

From their 4.1 announcement: Graphwalks fills the context window with a directed graph composed of hexadecimal hashes, and then asks the model to perform a breadth-first search (BFS) starting from a random node in the graph. We then ask it to return all nodes at a certain depth.

MRCR asks for direct quotes at semantically identified locations in the text, e.g. poems about tapirs, bears and ballerinas, as well as stories about tapirs, bears and ballerinas are generated, perhaps fifty each. The system is asked "give me the third poem about tapirs". This requires counting, conceptual attention, and also distinguishing between stories and poems.

They only test their own models on MRCR for the benchmark graph, but it's still worth reviewing: the accuracy curves are super interesting. https://openai.com/index/gpt-4-1/

Nvidia and Alphabet VC arms back vibe coding startup Lovable

https://www.cnbc.com/2025/12/18/google-and-n.html
1•kristianp•22s ago•0 comments

Digital Rights Management

https://kdp.amazon.com/en_US/help/topic/GDDXGH9VR22ACM8U
1•richardboegli•1m ago•0 comments

TailwindSQL – SQL Queries with Tailwind Syntax

https://tailwindsql.xyz/
1•speedgoose•3m ago•0 comments

The First Photographs of Snowflakes Discover the Groundbreaking Microphotography

https://www.openculture.com/2017/12/the-first-photographs-of-snowflakes.html
1•_____k•3m ago•0 comments

Show HN: I built sellmedeepgram.com to help get a job at Deepgram

https://www.sellmedeepgram.com/
1•akhilnchauhan•6m ago•0 comments

Computer Use Models

https://geohot.github.io//blog/jekyll/update/2025/12/18/computer-use-models.html
1•aldaleri•7m ago•0 comments

Flying Taxis? China Has Them. and Drone Lunch Deliveries, Too

https://www.nytimes.com/interactive/2025/12/17/climate/china-hefei-clean-energy-drones-evs-robots...
1•_____k•7m ago•0 comments

How Bell Labs Won Its First Nobel Prize

https://www.construction-physics.com/p/how-bell-labs-won-its-first-nobel
2•chmaynard•7m ago•0 comments

Show HN: Webhook.build – Instant, powerful webhook inspection and debugging

https://webhook.build
1•keithwirch•8m ago•1 comments

AI Vending Machine Was Tricked into Giving Away Everything

https://kottke.org/25/12/this-ai-vending-machine-was-tricked-into-giving-away-everything
1•duggan•9m ago•1 comments

Engineers should read more blogs

https://www.proactiveengineer.com/p/33-why-you-should-read-more-engineering
2•shehabas•9m ago•0 comments

Denmark says Russia was behind two 'destructive and disruptive' cyber-attacks

https://www.theguardian.com/world/2025/dec/18/denmark-says-russia-was-behind-two-destructive-and-...
2•_____k•10m ago•0 comments

House Democrats share new Epstein photos featuring Sergey Brin and Bill Gates

https://www.businessinsider.com/jeffrey-epstein-photos-congress-sergey-brin-bill-gates-david-broo...
1•newspaper1•12m ago•0 comments

Ask HN: Is anyone using LLM based document processing in production?

2•asdev•13m ago•1 comments

Show HN: Agentry: Intelligent orchestration for dynamic AI agent workflows

https://github.com/amtp-protocol/agentry
1•wang_cong•14m ago•0 comments

One Generic Cancer Drug Costs $35. Or $134. Or $13,000

https://www.bloomberg.com/features/2025-cancer-drug-markups/
3•danem•15m ago•0 comments

Tell HN: Dr Paris Buttfield-Addison's Apple account has been unlocked

3•ValentineC•16m ago•2 comments

Can you visualize what NYC smells like? Yes, turns out, you can

https://huggingface.co/datasets/Voxel51/NYC_Smells
3•elzappo•16m ago•0 comments

Redacted by Counsel: A supply chain postmortem

https://heartbreak.ing/
1•ravenical•18m ago•0 comments

New concept for energy transfer between gravitational waves and light

https://www.hzdr.de/db/Cms?pOid=76137&pNid=0
2•wjSgoWPm5bWAhXB•18m ago•0 comments

Resizable arrays in optimal time and space [pdf]

https://cs.uwaterloo.ca/~imunro/cs840/ResizableArrays.pdf
1•fanf2•20m ago•0 comments

A universal law could explain how large trades change stock prices

https://phys.org/news/2025-12-universal-law-large-stock-prices.html
1•wjSgoWPm5bWAhXB•22m ago•0 comments

The Age of 10xy Opportunity

https://gonzo.engineer/posts/10xy/
1•Dowwie•23m ago•0 comments

Building a Code Review system that uses prod data to predict bugs

https://blog.sentry.io/building-a-code-review-system-that-uses-prod-data-to-predict-bugs/
1•jshchnz•24m ago•0 comments

Naughty Dog Studio Orders Employee Overtime for 'Intergalactic'

https://www.bloomberg.com/news/articles/2025-12-18/sony-s-naughty-dog-studio-orders-employee-over...
5•HelloUsername•33m ago•1 comments

A TS library for connecting videos in your Mux account to multi-modal LLMs

https://github.com/muxinc/ai
1•tilt•36m ago•0 comments

Plaintext Casa First Release

https://github.com/nkoehring/plaintext.casa/releases/tag/v0.3
1•koehr•36m ago•1 comments

The Art of Vibe Design

https://www.ivan.codes/blog/the-art-of-vibe-design
1•dohguy•37m ago•0 comments

Starlink 35956 suffered a failure with venting of the propulsion tank

https://bsky.app/profile/planet4589.bsky.social/post/3mac4a3owxs2c
4•perihelions•37m ago•0 comments

CVSS 10.0 HPE OneView RCE bug identified

https://www.scworld.com/news/10-0-hpe-oneview-rce-bug-identified-patch-now
2•Bender•37m ago•0 comments