frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

Human-bench: an eval for "human shaped" agents

https://www.human-bench.com/leaderboard
1•jam0xb797fd•1h ago

Comments

jam0xb797fd•1h ago
TLDR: the next important category of agents isn’t just "multiplayer" but human-shaped. today we’re releasing human-bench v0 as a benchmark for any sufficiently human-shaped agent. we’d love feedback

---

MY (STILL ABBREVIATED) BUT LONGER FORM THOUGHTS ON THE MATTER:

there's been much discussion about multiplayer agents [1], recently. but this is an ill defined term. for instance, anthropic defines their new Claude Tag as multiplayer. and indeed, it is, but it is not "human-shaped." there is only one Claude! [2] and this can have weird downstream effects, such as other models claiming to be Claude, too! [3][4]

at the risk of anthropomorphising, our small team here at APC [5] believes there will be a substantial period of time in which frontier models and agent systems are powerful enough to do a large variety of useful work, but the world will still be optimised for humans. you can raise money from vc, today, promising to turn some human-optimal process into something agents can do more easily.

the implication of the above, in our view, is that during this period, "human-shaped" will be a useful thing for the vanguard of agents to be. and if we think human-shaped is a useful category, then we need a benchmark that measures progress in this area.

IN WHICH I OFFER A BRIEF DEFINITION OF "HUMAN SHAPED":

a "human shaped" agent is one that can interact in the real world like we can. that is, a human shaped agent should be able to use slack, browse the web, and use a desktop.[6] it should also be able to text, call, and email. [7]

this generally requires a singular and persistent identity and memory to do effectively. without a persistent identity, the agent is unequipped to handle the various threads and updates it is tasked with. altogether, this definition has the nice property that any agents with these abilities resemble humans. and humans generally feel comfortable interacting with them as they do with humans.

WHICH BRINGS US TO A DISCUSSION OF EVALS:

one of the things we do here at APC is to build human-shaped agents. [8] we believe we do a pretty good job at this, but it's hard. [9]

one way to tell you're building something at the edge of ai capabilities is that you lack well-defined metrics and pre-built systems for measuring what you're doing.

we've been using this internally on any proposed change to our agent harness or environment, so we can measure the impacts before pushing to prod.

though built with our own agents in mind, we've recognized the potential use of this to the community and are striving to generalize the system s.t. any arbitrary agent which is sufficiently human-shaped [10] can be quantifiably measured w.r.t. its performance with these tools and also on important qualities like memory, accuracy, and safety.

to that end, we are excited to announce human-bench as the v0 of this community benchmark, and we are eagerly soliciting entrants who believe their agent is sufficiently human-shaped to compete on this trial. feedback is welcomed

- joseph, APC

--- [1]: https://news.ycombinator.com/item?id=48648039 [2]: https://x.com/joannejang/status/2069567286634267041?s=20 [3]: https://x.com/jmbollenbacher/status/2067361099612037610?s=20 [4]: https://x.com/peakcooper/status/2067062979091153030?s=20 [5]: the American Productivity Company, a small agent-research lab in san francisco, ca. [6]: (the virtual world, already accessible to most agents) [7]: (less accessible, though there are already start-ups which make it easier to plug tools like these into your agent. the memory and identity stack is left as an exercise to the reader.) [8]: cf righthand.ai [9]: imagine the space of edge cases when you give people a do-anything agent [10]: that is, an agent which can freely handle inbound & outbound texts, emails, calls

New split layout framework for nearly all Apple platforms (macOS, iOS, etc.)

https://twitter.com/mitchellh/status/2070273858154987537
1•simonebrunozzi•1m ago•0 comments

Rejecting Emails on AS Level

https://blog.vasi.li/june-spam-wave/
1•vsviridov•3m ago•1 comments

Tachio – Free esports API covering 13 games

https://tachiosports.com
1•domktt•3m ago•0 comments

YayText

https://yaytext.com/
1•visviva•6m ago•0 comments

Show HN: Mantis, A self-hosted LLM gateway

https://github.com/mantis-llm-gateway
2•rizsyed1•7m ago•0 comments

VibePHP

https://github.com/mnapoli/vibephp
1•_Microft•8m ago•0 comments

Building Voice AI Workflows with Branches Instead of One Giant Prompt

https://github.com/team-telnyx/telnyx-code-examples/tree/main/build-conversational-workflow-nodejs
1•anushathukral•9m ago•0 comments

Show HN: A free ACP payments module that adds Stripe payments to MCP tools

https://www.afcommerce.com/free-acp-payments-module/
1•abratabia•10m ago•0 comments

From API to Ontology: An Architecture for On-Demand Semantic Digital Twins

https://blog.ptidej.net/from-api-to-ontology-an-architecture-for-on-demand-semantic-digital-twins/
3•viniciusmioto•10m ago•0 comments

Summary of METR's predeployment evaluation of GPT-5.6 Sol

https://metr.org/blog/2026-06-26-gpt-5-6-sol/
1•pongogogo•10m ago•1 comments

Long Wave radio era set to end with Droitwich switch-off

https://www.bbc.com/news/articles/c74yn7v7k4qo
1•speckx•10m ago•0 comments

Amazon wouldn't let me, so I built my own 20TB Snowball [video]

https://www.youtube.com/watch?v=v0DEI4Ad7Ik
2•tambourine_man•11m ago•0 comments

Akrites: The Latest Attempt to Protect Open-Source from AI Attacks Has Arrived

https://devops.com/akrites-the-latest-attempt-to-protect-open-source-from-ai-attacks-has-arrived/
1•CrankyBear•12m ago•0 comments

A list of software and other offerings with free developer tiers

https://github.com/ripienaar/free-for-dev
2•faradtech•15m ago•0 comments

The French are painting their windows with chalk to beat the heat

https://www.bbc.com/future/article/20260625-why-the-french-are-painting-chalk-on-their-windows
3•geox•17m ago•0 comments

My-Pi Coding-Agent

https://github.com/spences10/my-pi
2•mainsong•17m ago•0 comments

HTML: Composer and Perl HTML Templating

https://rawley.xyz/posts/html-composer.html
1•rawleyfowler•18m ago•0 comments

My Love-Hate Relationship with Page Builders

https://eliotdill.substack.com/p/my-love-hate-relationship-with-page
1•DillyDally125•18m ago•0 comments

New Intel Linux Driver Patches Enable HDR over DP MST Connections

https://lore.kernel.org/dri-devel/20260626175510.3899476-1-gildekel@google.com/
1•DemiGuru•20m ago•0 comments

Roblox parental controls are a dystopian security disaster

1•notsure357•20m ago•1 comments

QA/Testing at Startups

1•ovi_firstqa•22m ago•0 comments

The EU Wants to Grow Homegrown Tech. Its Courts Keep Making That Impossible

http://www.techdirt.com/2026/06/26/the-eu-wants-to-grow-homegrown-tech-its-courts-keep-making-tha...
3•beardyw•24m ago•0 comments

In Loving Memory of Om Malik – Hodinkee

https://www.hodinkee.com/articles/in-loving-memory-of-om-malik-friend-writer-venture-capitalist-a...
1•adamfuhrer•24m ago•0 comments

AI Models Directory (To Compare)

https://aimodels.directory/
1•entempsllc•25m ago•0 comments

Transformers Explained for Software Engineers

https://bharad.dev/blog/transformers-and-attention
1•bharadwajp•25m ago•0 comments

Europe's largest datacentre hub leaves town sweltering

https://www.theguardian.com/environment/2026/jun/26/slough-is-like-an-experiment-europes-largest-...
1•speckx•26m ago•0 comments

Designing a Personal Pebble Watchface

https://www.jonashietala.se/blog/2026/06/26/designing_a_personal_pebble_watchface/
1•lawn•29m ago•0 comments

Auto-Charge Tracker makes Steam Controller move toward its charging dock

https://videocardz.com/newz/modder-makes-steam-controller-move-itself-to-the-charging-puck
1•LorenDB•34m ago•0 comments

The Ontological Consequences of AGI Autonomy

https://secondexpulsion.substack.com/p/the-apple-the-serpent-and-the-outside
2•cosmosjang•34m ago•0 comments

Cory Doctorow on the Right – and Wrong – Way to Criticize AI

https://jacobin.com/2026/06/ai-bubble-layoffs-workers-copyright
1•thunderbong•37m ago•0 comments