frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

https://arxiv.org/abs/2509.09677
1•shash42•2h ago

Comments

shash42•2h ago
Does continued scaling of large language models (LLMs) yield diminishing returns? Real-world value often stems from the length of task an agent can complete. We start this work by observing the simple but counterintuitive fact that marginal gains in single-step accuracy can compound into exponential improvements in the length of a task a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. We propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. We find that larger models can correctly execute significantly more turns even when small models have 100\% single-turn accuracy. We observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations -- curiously, we observe a self-conditioning effect -- models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size. In contrast, recent thinking models do not self-condition, and can also execute much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of task they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits of scaling model size and sequential test-time compute for long-horizon tasks.

The future of microoptimization

https://goldenstack.net/blog/future_of_microoptimization
1•GoldenStack•1m ago•0 comments

Why broot doesn't enter zip archives

https://dystroy.org/blog/why-broot-doesnt-enter-zip/
2•lioeters•8m ago•0 comments

Ask HN: Getting over Burnout with Imposter Syndrome

1•chrsig•9m ago•0 comments

Where are the security advisories of the recently compromised NPM packages?

https://gribnau.dev/posts/finding-security-advisories/
1•foresterre•10m ago•0 comments

After AI Led to Layoffs, Coders Are Being Hired to Fix 'Vibe-Coded' Screwups

https://gizmodo.com/after-ai-led-to-layoffs-coders-are-being-hired-to-fix-vibe-coded-screwups-200...
4•hackernj•21m ago•0 comments

PolaroidAI – Transform Your Ideas into Visuals

https://www.polaroidai.online/
2•yszhu•25m ago•0 comments

SQL processing over images, text, and audio

https://itrummer.github.io/thalamusdb/
1•itrummer•26m ago•0 comments

Get Excited About Postgres 18

https://www.crunchydata.com/blog/get-excited-about-postgres-18
2•thunderbong•28m ago•1 comments

World's oldest joke traced back to 1900 BC (2008)

https://www.reuters.com/article/lifestyle/worlds-oldest-joke-traced-back-to-1900-bc-idUSKUA147851/
2•thunderbong•29m ago•0 comments

Effects of natural extracts in cognitive function: systematic meta-analysis

https://www.frontiersin.org/journals/pharmacology/articles/10.3389/fphar.2025.1573034/full
1•mallowdram•31m ago•1 comments

Show HN: Free AI Mental Health Chat – Instant, Private Support

https://mental-ai-answer.vercel.app/
1•zh7788•33m ago•0 comments

HybridPetya: More proof that Secure Boot bypasses are not just an urban legend

https://www.theregister.com/2025/09/12/hopefully_just_a_poc_hybridpetya/
2•raybb•36m ago•0 comments

Scalable LLM approach to enhancing chatbot knowledge with user-generated content

https://careersatdoordash.com/blog/doordash-llm-chatbot-knowledge-with-ugc/
1•pykello•38m ago•0 comments

NASA's Juno Mission Captures [Green] Lightning on Jupiter (2023)

https://www.nasa.gov/image-article/nasas-juno-mission-captures-lightning-on-jupiter/
2•Jimmc414•41m ago•0 comments

Magical Systems Thinking

https://worksinprogress.co/issue/magical-systems-thinking/
1•skmurphy•50m ago•1 comments

Shell develops EV fluid tech that enables sub-10-minute charging

https://www.fleetnews.co.uk/news/shell-develops-ev-fluid-tech-that-enables-sub-10-minute-charging
1•breve•53m ago•1 comments

Call Center Staffing Calculator

https://www.callcentercalculator.com/
1•luu•54m ago•0 comments

Financial Speculation in Ancient Rome

https://substack.com/inbox/post/173374496
1•_1729•57m ago•0 comments

Basics of Equality Saturation

https://egglog-python.readthedocs.io/latest/tutorials/tut_1_basics.html
3•todsacerdoti•58m ago•0 comments

Demanding DARPA: Transparency on AI Autonomy

1•freemuserealai•59m ago•0 comments

Kissing bugs bring deadly Chagas disease to California

https://www.latimes.com/california/newsletter/2025-09-02/kissing-bugs-bring-deadly-disease-to-cal...
2•OutOfHere•1h ago•1 comments

B-17 Gunner Training Film (1944) [video]

https://www.youtube.com/watch?v=aoHOVUKOc0M
3•seadan83•1h ago•0 comments

Show HN: wcwidth-o1 – Find Unicode text cell width in no time for JavaScript/TS

https://github.com/dawsonhuang0/Wcwidth-O1
3•dawson0•1h ago•0 comments

Five Whys

https://en.wikipedia.org/wiki/Five_whys
3•nivethan•1h ago•0 comments

Employee Who Leaked 'Spider-Man' Blu-ray Sentenced to Nearly 5 Years Prison

https://torrentfreak.com/employee-who-leaked-spider-man-blu-ray-sentenced-to-nearly-5-years-in-pr...
2•airhangerf15•1h ago•0 comments

A Drinking Game for Sale

https://flippa.com/12063245-top-10-ranked-android-app-with-31k-installs-4-3-rating-1k-revenue-on-...
2•tamtam99•1h ago•0 comments

Amtrak NextGen Acela trains have arrived in D.C. We took a test ride

https://www.washingtonpost.com/travel/2025/09/07/amtrak-nextgen-acela-train-test-ride/
3•reaperducer•1h ago•0 comments

OCI Registry Explorer

https://oci.dag.dev/
8•jcbhmr•1h ago•0 comments

Hurricane Katrina haunts New Orleans as Trump guts disaster aid

https://www.theguardian.com/us-news/ng-interactive/2025/aug/26/hurricane-katrina-anniversary-trum...
2•PaulHoule•1h ago•0 comments

NASA confirms Moon landing by a private American spacecraft

https://out.reddit.com/t3_1nflc84
1•danielmorozoff•1h ago•1 comments