Measuring AI agent autonomy in practice

https://www.anthropic.com/research/measuring-agent-autonomy

28•jbredeche•3h ago

Comments

Havoc•1h ago

I still can't believe anyone in the industry measures it like:

>from under 25 minutes to over 45 minutes.

If I get my raspberry pi to run a LLM task it'll run for over 6 hours. And groq will do it in 20 seconds.

It's a gibberish measurement in itself if you don't control for token speed (and quality of output).

dcre•1h ago

Tokens per second are similar across Sonnet 4.5, Opus 4.5, and Opus 4.6. More importantly, normalizing for speed isn't enough anyway because smarter models can compensate for being slower by having to output fewer tokens to get the same result. The use of 99.9p duration is a considered choice on their part to get a holistic view across model, harness, task choice, user experience level, user trust, etc.

visarga•20m ago

I agree time is not what we are looking for, it is maximum complexity the model can handle without failing the task, expressed in task length. Long tasks allow some slack - if you make an error you have time to see the outcomes and recover.

saezbaldo•6m ago

The bigger gap isn't time vs tokens. It's that these metrics measure capability without measuring authorization scope. An agent that completes a 45-minute task by making unauthorized API calls isn't more autonomous, it's more dangerous. The useful measurement would be: given explicit permission boundaries, how much can the agent accomplish within those constraints? That ratio of capability-within-constraints is a better proxy for production-ready autonomy than raw task duration.

prodigycorp•1h ago

i hate how anthropic uses data. you cant convince me that what they are doing is "privacy preserving"

FuckButtons•1h ago

They’re using react, they are very opaque, they don’t want you to use any other mechanism to interact with their model. They haven’t left people a lot of room to trust them.

mrdependable•16m ago

I agree. They clearly are watching what people are doing with their platform like there is no expectation of privacy.

swyx•1h ago

my highlights and writeup here https://www.latent.space/p/ainews-anthropics-agent-autonomy

esafak•31m ago

I wonder why there was a big downturn at the turn of the year until Opus was released.

saezbaldo•6m ago

This measures what agents can do, not what they should be allowed to do. In production, the gap between capability and authorization is the real risk. We see this pattern in every security domain: capability grows faster than governance. Session duration tells you about model intelligence. It tells you nothing about whether the agent stayed within its authorized scope. The missing metric is permission utilization: what fraction of the agent's actions fell within explicitly granted authority?

Gemini 3.1 Pro

Show HN: Micasa – track your house from the terminal

Dinosaur Food: 100M year old foods we still eat today (2022)

Pebble Production: February Update

Paged Out Issue #8 [pdf]

Don't Trust the Salt: AI Summarization, Multilingual Safety, and LLM Guardrails

Arrays in Forth

Gemini 3.1 Pro Preview

-fbounds-safety: Enforcing bounds safety for C

Coding Tricks Used in the C64 Game Seawolves

Show HN: A physically-based GPU ray tracer written in Julia

America vs. Singapore: You Can't Save Your Way Out of Economic Shocks

Large Language Models for Mortals: A Practical Guide for Analysts with Python

Measuring AI agent autonomy in practice

Bridging Elixir and Python with Oban

Sizing chaos

Why applicant tracking systems are broken by design

Zero downtime migrations at Petabyte scale

Show HN: Mini-Diarium - An encrypted, local, cross-platform journaling app

The Mongol Khans of Medieval France

Against Theory-Motivated Experimentation

27-year-old Apple iBooks can connect to Wi-Fi and download official updates

Famous Signatures Through History

Old School Visual Effects: The Cloud Tank (2010)

ShannonMax: A Library to Optimize Emacs Keybindings with Information Theory

Voith Schneider Propeller

15 years of FP64 segmentation, and why the Blackwell Ultra breaks the pattern

Mark Zuckerberg Grilled on Usage Goals and Underage Users at California Trial

Step 3.5 Flash – Open-source foundation model, supports deep reasoning at speed

Anthropic officially bans using subscription auth for third party use

Measuring AI agent autonomy in practice

Comments

Gemini 3.1 Pro

Show HN: Micasa – track your house from the terminal

Dinosaur Food: 100M year old foods we still eat today (2022)

Pebble Production: February Update

Paged Out Issue #8 [pdf]

Don't Trust the Salt: AI Summarization, Multilingual Safety, and LLM Guardrails

Arrays in Forth

Gemini 3.1 Pro Preview

-fbounds-safety: Enforcing bounds safety for C

Coding Tricks Used in the C64 Game Seawolves

Show HN: A physically-based GPU ray tracer written in Julia

America vs. Singapore: You Can't Save Your Way Out of Economic Shocks

Large Language Models for Mortals: A Practical Guide for Analysts with Python

Measuring AI agent autonomy in practice

Bridging Elixir and Python with Oban

Sizing chaos

Why applicant tracking systems are broken by design

Zero downtime migrations at Petabyte scale

Show HN: Mini-Diarium - An encrypted, local, cross-platform journaling app

The Mongol Khans of Medieval France

Against Theory-Motivated Experimentation

27-year-old Apple iBooks can connect to Wi-Fi and download official updates

Famous Signatures Through History

Old School Visual Effects: The Cloud Tank (2010)

ShannonMax: A Library to Optimize Emacs Keybindings with Information Theory

Voith Schneider Propeller

15 years of FP64 segmentation, and why the Blackwell Ultra breaks the pattern

Mark Zuckerberg Grilled on Usage Goals and Underage Users at California Trial

Step 3.5 Flash – Open-source foundation model, supports deep reasoning at speed

Anthropic officially bans using subscription auth for third party use