Show HN: RewardHackBench: Using sandboxes to stop agents from cheating

https://github.com/islo-labs/reward-hack-bench

6•rotemtam•1h ago

hey all,

happy to share research i've been working on for islo.dev in recent months.

ever since the cheating agents (https://debugml.github.io/cheating-agents/) paper came out, revealing reward hacking was 4x more prevalent than previously estimated, i've been looking into how we can deal with the issue

the common approach (taken by the tbench team) is post hoc trajectory analysis.

i've been interested in the idea of reframing the problem as an endpoint security problem and tackling it via sandbox

i hope you find it interesting, and thanks to the islo.dev team for sponsoring this

happy to answer any Qs

Comments

adamgold7•54m ago

love this. we are actually looking at reward hacking from a cyber security perspective - refreshing (unless you're from Israel).

Any collaborators that want to join us?

yonSpektor•35m ago

Curious what the distribution of hacking strategies looked like across different models — would expect RL-heavy vs RLHF models to cheat very differently.

Data Leak at Ozempic Manufacturer Novo Nordisk

Show HN: TermType – a terminal typing game where words fall like Space Invaders

Show HN: [[[hinge-ts]]] I reverse engineered Hinge's API

Rust for C#/.NET Developers

Voskhod Spacecraft "Globus" IMP navigation instrument

How much of your life slick UI animations are stealing

Human Judgment as a Specification

Flax debugging: making a hash of things

AmigaOS 2: The Greatest Upgrade

Show HN: Stegcore – steganography and steganalysis in one Rust binary

Show HN: WPF grade canvas UI framework for the web

React Interview Questions Every Developer Should Know in 2026

Stop reaching for microservices. You are not Netflix

Wah-Ult in the Vault

A Chinese Android just ran a half-marathon faster than any human

Cheaper LLM tokens led to bigger AI bills (Jevons paradox)

Deep Work Plan – Turn a repo into a spec-driven harness for AI agents

€31B drug trade, 7,600 deaths: How the EU plans to tackle the drug crisis

AWS Blocks – build AWS apps locally before deploying

BareMetal OS running inside Firecracker microVMs with <1ms cold start

Function Composition from C++17 to C++23

Show HN: Kaupang – a push-based deploy CLI, now with a drag-and-drop builder

The engineering practices Claude Code and Codex use to improve AI agents

Git worktrees – why should I use them?

Databricks Iceberg Support Has a Catch. It's Called Unity Catalog

Show HN: Yet Another News Reader

GitHub Action to grade OpenAPI schema quality (A–F) and catch breaking changes

Lords urgent question on the suspension of Anthropic's AI models [video]

HPE Discover 2026 Keynote Coverage

CLI AI Tool Laucher