Human-bench: an eval for "human shaped" agents

1•jam0xb797fd•1h ago

Comments

jam0xb797fd•1h ago

TLDR: the next important category of agents isn’t just "multiplayer" but human-shaped. today we’re releasing human-bench v0 as a benchmark for any sufficiently human-shaped agent. we’d love feedback

---

MY (STILL ABBREVIATED) BUT LONGER FORM THOUGHTS ON THE MATTER:

there's been much discussion about multiplayer agents [1], recently. but this is an ill defined term. for instance, anthropic defines their new Claude Tag as multiplayer. and indeed, it is, but it is not "human-shaped." there is only one Claude! [2] and this can have weird downstream effects, such as other models claiming to be Claude, too! [3][4]

at the risk of anthropomorphising, our small team here at APC [5] believes there will be a substantial period of time in which frontier models and agent systems are powerful enough to do a large variety of useful work, but the world will still be optimised for humans. you can raise money from vc, today, promising to turn some human-optimal process into something agents can do more easily.

the implication of the above, in our view, is that during this period, "human-shaped" will be a useful thing for the vanguard of agents to be. and if we think human-shaped is a useful category, then we need a benchmark that measures progress in this area.

IN WHICH I OFFER A BRIEF DEFINITION OF "HUMAN SHAPED":

a "human shaped" agent is one that can interact in the real world like we can. that is, a human shaped agent should be able to use slack, browse the web, and use a desktop.[6] it should also be able to text, call, and email. [7]

this generally requires a singular and persistent identity and memory to do effectively. without a persistent identity, the agent is unequipped to handle the various threads and updates it is tasked with. altogether, this definition has the nice property that any agents with these abilities resemble humans. and humans generally feel comfortable interacting with them as they do with humans.

WHICH BRINGS US TO A DISCUSSION OF EVALS:

one of the things we do here at APC is to build human-shaped agents. [8] we believe we do a pretty good job at this, but it's hard. [9]

one way to tell you're building something at the edge of ai capabilities is that you lack well-defined metrics and pre-built systems for measuring what you're doing.

we've been using this internally on any proposed change to our agent harness or environment, so we can measure the impacts before pushing to prod.

though built with our own agents in mind, we've recognized the potential use of this to the community and are striving to generalize the system s.t. any arbitrary agent which is sufficiently human-shaped [10] can be quantifiably measured w.r.t. its performance with these tools and also on important qualities like memory, accuracy, and safety.

to that end, we are excited to announce human-bench as the v0 of this community benchmark, and we are eagerly soliciting entrants who believe their agent is sufficiently human-shaped to compete on this trial. feedback is welcomed

- joseph, APC

--- [1]: https://news.ycombinator.com/item?id=48648039 [2]: https://x.com/joannejang/status/2069567286634267041?s=20 [3]: https://x.com/jmbollenbacher/status/2067361099612037610?s=20 [4]: https://x.com/peakcooper/status/2067062979091153030?s=20 [5]: the American Productivity Company, a small agent-research lab in san francisco, ca. [6]: (the virtual world, already accessible to most agents) [7]: (less accessible, though there are already start-ups which make it easier to plug tools like these into your agent. the memory and identity stack is left as an exercise to the reader.) [8]: cf righthand.ai [9]: imagine the space of edge cases when you give people a do-anything agent [10]: that is, an agent which can freely handle inbound & outbound texts, emails, calls

New split layout framework for nearly all Apple platforms (macOS, iOS, etc.)

Rejecting Emails on AS Level

Tachio – Free esports API covering 13 games

YayText

Show HN: Mantis, A self-hosted LLM gateway

VibePHP

Building Voice AI Workflows with Branches Instead of One Giant Prompt

Show HN: A free ACP payments module that adds Stripe payments to MCP tools

From API to Ontology: An Architecture for On-Demand Semantic Digital Twins

Summary of METR's predeployment evaluation of GPT-5.6 Sol

Long Wave radio era set to end with Droitwich switch-off

Amazon wouldn't let me, so I built my own 20TB Snowball [video]

Akrites: The Latest Attempt to Protect Open-Source from AI Attacks Has Arrived

A list of software and other offerings with free developer tiers

The French are painting their windows with chalk to beat the heat

My-Pi Coding-Agent

HTML: Composer and Perl HTML Templating

My Love-Hate Relationship with Page Builders

New Intel Linux Driver Patches Enable HDR over DP MST Connections

Roblox parental controls are a dystopian security disaster

QA/Testing at Startups

The EU Wants to Grow Homegrown Tech. Its Courts Keep Making That Impossible

In Loving Memory of Om Malik – Hodinkee

AI Models Directory (To Compare)

Transformers Explained for Software Engineers

Europe's largest datacentre hub leaves town sweltering

Designing a Personal Pebble Watchface

Auto-Charge Tracker makes Steam Controller move toward its charging dock

The Ontological Consequences of AGI Autonomy

Cory Doctorow on the Right – and Wrong – Way to Criticize AI