news newest ask show jobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

We Analyzed 413K Agent Runs. Here's What Separates the Ones That Succeed

https://twitter.com/lihanc02/status/2032150260638941360

2•lihanc111•1h ago

Comments

lihanc111•1h ago

Hey HN,

We dug into 17 billion tokens of behavioral data across 413K AI agent trajectories (CoderForge-Preview) attempting real GitHub issues. Instead of just looking at final SWE-bench scores, we compared successful runs against failing runs on the exact same problem to filter out task-difficulty confounds.

The biggest surprise? Agents are not junior developers, and prompting them to act like humans actively hurts their success rate.

Here is what the data actually shows:

Human exploration rituals predict failure: "View-before-edit" and "grep-before-edit" are negatively correlated with success. Humans do this to build mental models. Agents already have the codebase in their context window; if they are heavily grepping, they aren't learning, they're flailing.

TDD is the ultimate predictor of success: The single strongest behavioral signal of a passing agent is the fraction of early bash commands dedicated exclusively to running the test suite.

The Single Responsibility Principle is law: Agents that scatter edits across 3 or more files in the first 30% of their run see their success rate plummet. Successful agents fix one targeted thing at a time.

Perseverance is a myth: If an agent runs the exact same bash command twice early on, it’s a massive failure signal. They don't adapt; they just get stuck in a loop.

Check out the article for full content!

PaulHoule•1h ago

"This post is the first in a series. We are extending this analysis to more realistic workloads beyond artificial SWE benchmarks. Follow the account and stay tuned.---"

Did something get cut off at the end?

lihanc111•1h ago

Actually not, i think the --- was just mistakenly typed XD

How to build a moon base – China and the US are in a race to build outposts

https://www.scientificamerican.com/article/how-to-build-a-moon-base/

1•voxadam•1m ago•0 comments

Runners Are Discovering It's Surprisingly Easy to Churn Butter on Their Runs

https://www.runnersworld.com/news/a70683169/how-to-make-butter-while-running/

1•randycupertino•3m ago•0 comments

Altman, Amodei and Musk fight dirty for the biggest prize in business

https://www.economist.com/business/2026/03/12/altman-amodei-and-musk-fight-dirty-for-the-biggest-...

1•andsoitis•3m ago•0 comments

$36 AT&T Upgrade fee if you're not formally employed

1•Johnny_Bonk•5m ago•0 comments

US Navy will escort vessels via Strait of Hormuz as soon as militarily possible

https://www.cnbc.com/2026/03/12/iran-war-us-navy-strait-of-hormuz-oil-bessent.html

1•donsupreme•6m ago•1 comments

How to Install Gemini CLI on Termux

https://medium.com/@ROCKYSHARAF/how-to-install-gemini-cli-on-termux-bypassing-the-native-build-er...

1•kaycebasques•7m ago•0 comments

MSVC's /experimental:constevalVfuncNoVtable is non-conforming

https://quuxplusone.github.io/blog/2026/03/12/consteval-vfunc-no-vtable/

1•jandeboevrie•7m ago•0 comments

Dear Software Engineers: You Still Have Value

https://www.godaddy.com/resources/news/dear-software-engineer-you-still-have-value

1•tmuhlestein•8m ago•1 comments

Geoffrey Huntley (Ralph loop inventor) on AI implications for software pro's

https://ghuntley.com/frontier/

2•oshoma•8m ago•0 comments

Addressing GitHub's recent availability issues

https://github.blog/news-insights/company-news/addressing-githubs-recent-availability-issues-2/

3•tjwds•8m ago•0 comments

2M DNS domains compressed into 253 bytes – with proof of correctness

https://proofcodec.github.io/proofcodec-verify/

3•RusDyn•8m ago•1 comments

Rise of the AI Soldiers

https://time.com/article/2026/03/09/ai-robots-soldiers-war/

2•jMyles•9m ago•1 comments

Common Worflow Patterns for AI Agents

https://claude.com/blog/common-workflow-patterns-for-ai-agents-and-when-to-use-them

3•danebalia•9m ago•1 comments

The New Consumer Turing Test

https://medium.com/@plewis67/the-new-turing-test-af02b61ab061

2•paulpauper•9m ago•0 comments

White House plan to break up iconic U.S. climate lab moves forward

https://www.science.org/content/article/white-house-plan-break-iconic-u-s-climate-lab-moves-forward

7•robtherobber•10m ago•0 comments

A calmer interface for a product in motion

https://linear.app/now/behind-the-latest-design-refresh

2•casperb•12m ago•0 comments

Show HN: On-Call Health – spot burnout before it hits your engineers

https://github.com/Rootly-AI-Labs/On-Call-Health

2•sylvainkalache•12m ago•0 comments

Astro – Ochestrator of AI Agents Such as Claude Code and Codex

https://github.com/astro-anywhere/astro-agent

2•astroanywhere•12m ago•1 comments

Authentication with Pocket ID

https://cweagans.net/2026/03/authentication-with-pocket-id/

2•cweagans•12m ago•0 comments

Trump's DOJ is not falling for Sam Bankman-Fried's MAGA makeover on X

https://arstechnica.com/tech-policy/2026/03/trumps-doj-is-not-falling-for-sam-bankman-frieds-maga...

2•tartoran•16m ago•0 comments

The Bhangmeter, a 1960s device to measure nuclear detonations

https://en.wikipedia.org/wiki/Bhangmeter

2•zahlman•17m ago•0 comments

Show HN: CastReader – Free TTS Extension That Reads Kindle Cloud Reader

https://chromewebstore.google.com/detail/castreader-tts-reader/foammmkhpbeladledijkdljlechlclpb

1•vinxu•17m ago•0 comments

Auto-georeferenced 381 Soviet military maps of China

https://sovietatlas.monarcha.ai/

2•everettglee•20m ago•0 comments

When the Simulation Starts to Feel Real

https://alvinpane.com/essays/when-the-simulation-starts-to-feel-real

2•alvinpane•21m ago•0 comments

QuickBEAM: JavaScript Runtime for the BEAM VM

https://github.com/elixir-volt/quickbeam

1•clessg•23m ago•0 comments

Show HN: The Bones of PearlOS

https://github.com/NiaExperience/PearlOS/discussions/5

3•pearlos•23m ago•0 comments

Reddit's database has two tables (2012)

https://kevin.burke.dev/kevin/reddits-database-has-two-tables/

1•tosh•26m ago•0 comments

Dario 12 months ago: In 12 months, nearly all code may be generated by AI

https://twitter.com/slow_developer/status/1899430284350616025

4•nipponese•27m ago•1 comments

Do current trends in drone technology favor offense or defense?

https://marginalrevolution.com/marginalrevolution/2024/03/do-current-trends-in-drone-technology-f...

1•paulpauper•27m ago•1 comments

#238 – Sam Winter-Levy and Nikita Lalwani on how AI won't end nuclear deterrence

https://80000hours.org/podcast/episodes/sam-winter-levy-nikita-lalwani-ai-nuclear-deterrence/

1•paulpauper•27m ago•0 comments