frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Measuring AI agent autonomy in practice

https://www.anthropic.com/research/measuring-agent-autonomy
28•jbredeche•3h ago

Comments

Havoc•1h ago
I still can't believe anyone in the industry measures it like:

>from under 25 minutes to over 45 minutes.

If I get my raspberry pi to run a LLM task it'll run for over 6 hours. And groq will do it in 20 seconds.

It's a gibberish measurement in itself if you don't control for token speed (and quality of output).

dcre•1h ago
Tokens per second are similar across Sonnet 4.5, Opus 4.5, and Opus 4.6. More importantly, normalizing for speed isn't enough anyway because smarter models can compensate for being slower by having to output fewer tokens to get the same result. The use of 99.9p duration is a considered choice on their part to get a holistic view across model, harness, task choice, user experience level, user trust, etc.
visarga•20m ago
I agree time is not what we are looking for, it is maximum complexity the model can handle without failing the task, expressed in task length. Long tasks allow some slack - if you make an error you have time to see the outcomes and recover.
saezbaldo•6m ago
The bigger gap isn't time vs tokens. It's that these metrics measure capability without measuring authorization scope. An agent that completes a 45-minute task by making unauthorized API calls isn't more autonomous, it's more dangerous. The useful measurement would be: given explicit permission boundaries, how much can the agent accomplish within those constraints? That ratio of capability-within-constraints is a better proxy for production-ready autonomy than raw task duration.
prodigycorp•1h ago
i hate how anthropic uses data. you cant convince me that what they are doing is "privacy preserving"
FuckButtons•1h ago
They’re using react, they are very opaque, they don’t want you to use any other mechanism to interact with their model. They haven’t left people a lot of room to trust them.
mrdependable•16m ago
I agree. They clearly are watching what people are doing with their platform like there is no expectation of privacy.
swyx•1h ago
my highlights and writeup here https://www.latent.space/p/ainews-anthropics-agent-autonomy
esafak•31m ago
I wonder why there was a big downturn at the turn of the year until Opus was released.
saezbaldo•6m ago
This measures what agents can do, not what they should be allowed to do. In production, the gap between capability and authorization is the real risk. We see this pattern in every security domain: capability grows faster than governance. Session duration tells you about model intelligence. It tells you nothing about whether the agent stayed within its authorized scope. The missing metric is permission utilization: what fraction of the agent's actions fell within explicitly granted authority?

Gemini 3.1 Pro

https://deepmind.google/models/model-cards/gemini-3-1-pro/
310•PunchTornado•1h ago•187 comments

Show HN: Micasa – track your house from the terminal

https://micasa.dev
61•cpcloud•1h ago•16 comments

Dinosaur Food: 100M year old foods we still eat today (2022)

https://borischerny.com/food/2022/01/17/Dinosaur-food.html
56•simonebrunozzi•2h ago•38 comments

Pebble Production: February Update

https://repebble.com/blog/february-pebble-production-and-software-updates
183•smig0•5h ago•71 comments

Paged Out Issue #8 [pdf]

https://pagedout.institute/download/PagedOut_008.pdf
161•SteveHawk27•5h ago•35 comments

Don't Trust the Salt: AI Summarization, Multilingual Safety, and LLM Guardrails

https://royapakzad.substack.com/p/multilingual-llm-evaluation-to-guardrails
144•benbreen•2d ago•55 comments

Arrays in Forth

https://www.forth.org/svfig/Len/arrays.htm
18•tosh•4d ago•1 comments

Gemini 3.1 Pro Preview

https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-3.1-pro-preview?...
117•MallocVoidstar•2h ago•60 comments

-fbounds-safety: Enforcing bounds safety for C

https://clang.llvm.org/docs/BoundsSafety.html
78•thefilmore•3d ago•54 comments

Coding Tricks Used in the C64 Game Seawolves

https://kodiak64.co.uk/blog/seawolves-technical-tricks
70•atan2•5h ago•4 comments

Show HN: A physically-based GPU ray tracer written in Julia

https://makie.org/website/blogposts/raytracing/
106•simondanisch•6h ago•36 comments

America vs. Singapore: You Can't Save Your Way Out of Economic Shocks

https://www.governance.fyi/p/america-vs-singapore-you-cant-save
119•guardianbob•3h ago•139 comments

Large Language Models for Mortals: A Practical Guide for Analysts with Python

https://crimede-coder.com/blogposts/2026/LLMsForMortals
44•apwheele•4d ago•10 comments

Measuring AI agent autonomy in practice

https://www.anthropic.com/research/measuring-agent-autonomy
28•jbredeche•3h ago•10 comments

Bridging Elixir and Python with Oban

https://oban.pro/articles/bridging-with-oban
85•sorentwo•6h ago•41 comments

Sizing chaos

https://pudding.cool/2026/02/womens-sizing/
751•zdw•20h ago•390 comments

Why applicant tracking systems are broken by design

https://www.saj.ad/2026/ats
3•dajas•51m ago•0 comments

Zero downtime migrations at Petabyte scale

https://planetscale.com/blog/zero-downtime-migrations-at-petabyte-scale
31•Ozzie_osman•3d ago•8 comments

Show HN: Mini-Diarium - An encrypted, local, cross-platform journaling app

https://github.com/fjrevoredo/mini-diarium
79•holyknight•5h ago•43 comments

The Mongol Khans of Medieval France

https://www.historytoday.com/archive/feature/mongol-khans-medieval-france
78•Thevet•2d ago•32 comments

Against Theory-Motivated Experimentation

https://journals.sagepub.com/doi/10.1177/26339137261421577
18•paraschopra•3h ago•13 comments

27-year-old Apple iBooks can connect to Wi-Fi and download official updates

https://old.reddit.com/r/MacOS/comments/1r8900z/macos_which_officially_supports_27_year_old/
420•surprisetalk•20h ago•238 comments

Famous Signatures Through History

https://signatory.app/#famous-signatures
30•elliotbnvl•4h ago•28 comments

Old School Visual Effects: The Cloud Tank (2010)

http://singlemindedmovieblog.blogspot.com/2010/04/old-school-effects-cloud-tank.html
75•exvi•11h ago•14 comments

ShannonMax: A Library to Optimize Emacs Keybindings with Information Theory

https://github.com/sstraust/shannonmax
46•sammy0910•6h ago•7 comments

Voith Schneider Propeller

https://en.wikipedia.org/wiki/Voith_Schneider_Propeller
73•Luc•3d ago•18 comments

15 years of FP64 segmentation, and why the Blackwell Ultra breaks the pattern

https://nicolasdickenmann.com/blog/the-great-fp64-divide.html
178•fp64enjoyer•16h ago•66 comments

Mark Zuckerberg Grilled on Usage Goals and Underage Users at California Trial

https://www.wsj.com/us-news/law/meta-mark-zuckerberg-social-media-trial-0e9a7fa0
28•1vuio0pswjnm7•1h ago•1 comments

Step 3.5 Flash – Open-source foundation model, supports deep reasoning at speed

https://static.stepfun.com/blog/step-3.5-flash/
177•kristianp•15h ago•77 comments

Anthropic officially bans using subscription auth for third party use

https://code.claude.com/docs/en/legal-and-compliance
563•theahura•15h ago•687 comments