frontpage.

Show HN: Amber, a capability-based runtime/compiler for agent benchmarks

https://github.com/RDI-Foundation/amber/

1•_nhynes•1h ago

Hi HN, since the Berkeley RDI benchmark integrity post recently got a lot of attention here [0], it seems like a good time to share Amber, related work aimed at making agent benchmarks easier to reproduce.

Amber grew out of the RDI AgentX-AgentBeats benchmarking competition [1] where the general public was invited to submit agents. To ensure trustworthy results, we needed submissions to be reproducible and have clear provenance. Reproducibility motivates declarative specifications of benchmarks, and provenance motivates the ability to safely and efficiently run benchmarks on hosted hardware. Once you add support for multi-phase multi-agent benchmarks (like Werewolf), the design for Amber mostly falls right out.

Amber is inspired by Fuchsia OS Component Framework. The security model of Amber is that a component like an A2A agent or MCP tool only serves a component that has explicitly been given a capability to use it. In the context of benchmarks, this means that an agent under test cannot reach into the evaluator, and that a tool can be revoked in a later phase of a benchmark.

Amber is a combination of a compiler and a runtime system: the compiler turns manifests describing agents, tools, and how they connect to each other into a deterministic plan. The plan can be executed against different backends like Docker, K8s, KVM, or the host OS. The compiler injects runtime components necessary to enforce the capability model: sidecar routers that provide guarded connectivity between components, and backend controllers that allow components to create and destroy components at runtime.

Amber started out with just static `docker compose`, but benches like TerminalBench and OSWorld required the addition of dynamic components and VM-backed components. Then competition participants wanted an easier way to test locally that didn't involve repeatedly rebuilding Docker images, so Amber got native binary support and a one-liner `amber run` interface. The concepts borrowed from Fuchsia have held up so far. Right now I'm working on making Amber's observability traces available to the benchmark evaluator so that it can judge based on the path an agent took, rather than just the final answer.

Overall, the goal we set out to achieve was to make it easy to reproduce agent benchmark results in a low-trust environment. Amber is not a complete solution, but it takes some burden off of benchmark authors and agent builders. Maybe it's even useful beyond benchmarks. I would be happy for you to batter the conceptual framework!

The AgentBeats tau2 benchmark manifest [2] is a real example. The in-tree mixed-site example [3] is a simple demo of Amber end-to-end with `amber run`.

[0]: https://news.ycombinator.com/item?id=47733217

[1]: https://rdi.berkeley.edu/agentx-agentbeats.html

[2]: https://github.com/RDI-Foundation/tau2-agentbeats/blob/main/...

[3]: https://github.com/RDI-Foundation/amber/tree/main/examples/m...

Show HN: SkillCompass – open-source quality evaluator for your AI skills

Turbo Pascal on Your iPhone

Cursortab.nvim: Edit-Completions for Neovim

The Command Line that never died

Show HN: I built a social media management tool in 3 weeks with Claude and Codex

Go-overlay: Nix overlay for complete go development environment

Six Characters

AI conditionally allowed in the Linux kernel. "Linux lays down the law on AI.."

Universal surface-growth law confirmed in two dimensions after 40 years

How to Reproduce Container Images

Defender – Local prompt injection detection for AI agents (no API calls)

Abundant Ways to Address Scarcity

Introduction to Spherical Harmonics for Graphics Programmers

They're Rich but Not Famous–and They're Suddenly Everywhere

What if a few AI companies end up with all the money and power?

Why do NES colors look so different in emulators? [video]

How to Monetize a Mobile App – 6 Proven Strategies That Work

Who's Hacking CRA Accounts?

AI went viral among attorneys. We have the numbers on what happened next

Invisible Scars (2024)

Utf8Regex – UTF-8 Regex for .NET (using SIMD/AVX)

Dynamic Export Rate Pilot – San Diego Gas and Electric [video]

Fastfind, a fast and featureful replacement to find and fd

Convenient Trust Management for Emacs

Finding Widespread Cheating on Popular Agent Benchmarks

Ask HN: Do have any SaaS idea that give me knowledge of the business and money?

Brain on Poverty: Why Poor People Seem to Make Bad Decisions (2013)

Benchmark LLM Inference on WebGPU

The McDonalds Monopoly Fraud (2014)

Ideomotor Phenomenon