frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Amber, a capability-based runtime/compiler for agent benchmarks

https://github.com/RDI-Foundation/amber/
1•_nhynes•1h ago
Hi HN, since the Berkeley RDI benchmark integrity post recently got a lot of attention here [0], it seems like a good time to share Amber, related work aimed at making agent benchmarks easier to reproduce.

Amber grew out of the RDI AgentX-AgentBeats benchmarking competition [1] where the general public was invited to submit agents. To ensure trustworthy results, we needed submissions to be reproducible and have clear provenance. Reproducibility motivates declarative specifications of benchmarks, and provenance motivates the ability to safely and efficiently run benchmarks on hosted hardware. Once you add support for multi-phase multi-agent benchmarks (like Werewolf), the design for Amber mostly falls right out.

Amber is inspired by Fuchsia OS Component Framework. The security model of Amber is that a component like an A2A agent or MCP tool only serves a component that has explicitly been given a capability to use it. In the context of benchmarks, this means that an agent under test cannot reach into the evaluator, and that a tool can be revoked in a later phase of a benchmark.

Amber is a combination of a compiler and a runtime system: the compiler turns manifests describing agents, tools, and how they connect to each other into a deterministic plan. The plan can be executed against different backends like Docker, K8s, KVM, or the host OS. The compiler injects runtime components necessary to enforce the capability model: sidecar routers that provide guarded connectivity between components, and backend controllers that allow components to create and destroy components at runtime.

Amber started out with just static `docker compose`, but benches like TerminalBench and OSWorld required the addition of dynamic components and VM-backed components. Then competition participants wanted an easier way to test locally that didn't involve repeatedly rebuilding Docker images, so Amber got native binary support and a one-liner `amber run` interface. The concepts borrowed from Fuchsia have held up so far. Right now I'm working on making Amber's observability traces available to the benchmark evaluator so that it can judge based on the path an agent took, rather than just the final answer.

Overall, the goal we set out to achieve was to make it easy to reproduce agent benchmark results in a low-trust environment. Amber is not a complete solution, but it takes some burden off of benchmark authors and agent builders. Maybe it's even useful beyond benchmarks. I would be happy for you to batter the conceptual framework!

The AgentBeats tau2 benchmark manifest [2] is a real example. The in-tree mixed-site example [3] is a simple demo of Amber end-to-end with `amber run`.

[0]: https://news.ycombinator.com/item?id=47733217

[1]: https://rdi.berkeley.edu/agentx-agentbeats.html

[2]: https://github.com/RDI-Foundation/tau2-agentbeats/blob/main/...

[3]: https://github.com/RDI-Foundation/amber/tree/main/examples/m...

Show HN: SkillCompass – open-source quality evaluator for your AI skills

https://github.com/Evol-ai/SkillCompass
1•yo103jg•59s ago•0 comments

Turbo Pascal on Your iPhone

https://pascal.kulman.sk
1•ig0r0•4m ago•0 comments

Cursortab.nvim: Edit-Completions for Neovim

https://github.com/cursortab/cursortab.nvim
1•leonardcser•5m ago•0 comments

The Command Line that never died

https://ajitem.com/blog/iron-core-part-3-the-command-line-that-never-died/
2•Airplanepasta•5m ago•0 comments

Show HN: I built a social media management tool in 3 weeks with Claude and Codex

https://github.com/brightbeanxyz/brightbean-studio
1•JanSchu•8m ago•0 comments

Go-overlay: Nix overlay for complete go development environment

https://github.com/purpleclay/go-overlay
1•hambes•10m ago•0 comments

Six Characters

https://ajitem.com/blog/iron-core-part-2-six-characters/
1•Airplanepasta•10m ago•0 comments

AI conditionally allowed in the Linux kernel. "Linux lays down the law on AI.."

https://www.tomshardware.com/software/linux/linux-lays-down-the-law-on-ai-generated-code-yes-to-c...
1•aleksjess•12m ago•0 comments

Universal surface-growth law confirmed in two dimensions after 40 years

https://phys.org/news/2026-04-universal-surface-growth-law-dimensions.html
1•thunderbong•16m ago•0 comments

How to Reproduce Container Images

https://dangerzone.rocks/news/2026-03-02-repro-build/
1•almet•21m ago•0 comments

Defender – Local prompt injection detection for AI agents (no API calls)

https://www.npmjs.com/package/@stackone/defender
1•Hiskias•21m ago•0 comments

Abundant Ways to Address Scarcity

https://thelivingfossils.substack.com/p/abundant-ways-to-address-scarcity
1•paulpauper•23m ago•0 comments

Introduction to Spherical Harmonics for Graphics Programmers

https://gpfault.net/posts/sph.html
1•luu•29m ago•0 comments

They're Rich but Not Famous–and They're Suddenly Everywhere

https://www.wsj.com/economy/wealthy-americans-us-economy-dba0d26a
1•paulpauper•30m ago•0 comments

What if a few AI companies end up with all the money and power?

https://www.noahpinion.blog/p/what-if-a-few-ai-companies-end-up
2•paulpauper•33m ago•0 comments

Why do NES colors look so different in emulators? [video]

https://www.youtube.com/watch?v=7JupB4QHyGI
3•JuniperMesos•38m ago•0 comments

How to Monetize a Mobile App – 6 Proven Strategies That Work

https://apparencekit.dev/blog/how-to-monetize-mobile-app/
1•macfleid•40m ago•0 comments

Who's Hacking CRA Accounts?

https://www.cbc.ca/newsinteractives/features/whos-hacking-cra-accounts
1•luu•40m ago•0 comments

AI went viral among attorneys. We have the numbers on what happened next

https://www.theregister.com/2026/04/13/ai_attorneys/
2•jjgreen•42m ago•1 comments

Invisible Scars (2024)

https://www.psychologytoday.com/us/blog/beyond-school-walls/202410/invisible-scars
1•mpweiher•42m ago•0 comments

Utf8Regex – UTF-8 Regex for .NET (using SIMD/AVX)

https://github.com/Lokad/Utf8Regex
1•vermorel•44m ago•0 comments

Dynamic Export Rate Pilot – San Diego Gas and Electric [video]

https://www.youtube.com/watch?v=K3Gavo8OpyY
1•thelastgallon•45m ago•0 comments

Fastfind, a fast and featureful replacement to find and fd

https://github.com/RobertFlexx/fastfind
1•Kokonico•47m ago•0 comments

Convenient Trust Management for Emacs

https://github.com/eshelyaron/trust-manager
2•oskardrums•49m ago•0 comments

Finding Widespread Cheating on Popular Agent Benchmarks

https://debugml.github.io/cheating-agents/
1•stared•50m ago•0 comments

Ask HN: Do have any SaaS idea that give me knowledge of the business and money?

1•SRMohitkr•50m ago•0 comments

Brain on Poverty: Why Poor People Seem to Make Bad Decisions (2013)

https://www.theatlantic.com/business/archive/2013/11/your-brain-on-poverty-why-poor-people-seem-t...
1•downbad_•51m ago•2 comments

Benchmark LLM Inference on WebGPU

https://arxiv.org/abs/2604.02344
1•yu3zhou4•52m ago•0 comments

The McDonalds Monopoly Fraud (2014)

https://priceonomics.com/the-mcdonalds-monopoly-fraud/
1•downbad_•53m ago•1 comments

Ideomotor Phenomenon

https://en.wikipedia.org/wiki/Ideomotor_phenomenon
1•thinkingemote•54m ago•0 comments