frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: RewardHackWatch – Reward hacking detector for LLM agents

https://github.com/aerosta/rewardhackwatch
1•aerosta•2h ago

Comments

aerosta•2h ago
After METR reported frontier models modifying tests and scoring code to inflate results, I wanted to see whether reward hacking could be detected at runtime from agent trajectories. I built an open-source prototype for that. It combines a DistilBERT classifier trained on 5,391 MALT trajectories with 45 regex patterns and optional LLM judges (Claude, OpenAI, or local Llama via Ollama). It catches things like sys.exit(0) to fake passing, test rewriting, reference answer copying, and validator patching. The part I'm most interested in feedback on is RMGI - a metric that tracks whether hack scores and misalignment scores begin correlating over a trajectory, inspired by Anthropic's finding that reward hacking can generalize into broader misaligned behavior. It's a first attempt and probably has issues. Runs on CPU, ~50ms per trajectory. Also includes a local dashboard and a batch eval workbench for scoring JSONL files. Research context:

METR: https://metr.org/blog/2025-06-05-recent-reward-hacking/ OpenAI: https://openai.com/index/chain-of-thought-monitoring/ Anthropic: https://arxiv.org/abs/2511.18397

Repo: https://github.com/aerosta/rewardhackwatch Project page: https://aerosta.github.io/rewardhackwatch Known limitations in the README. Happy to answer questions.

Show HN: Social proof works 2-7x better on AI shopping agents than humans

https://github.com/aaronbatchelder/claude-marketing-susceptibility-eval
1•aaronmb7•1m ago•0 comments

How the Government Deceived Congress in the Debate over Surveillance Powers (2013)

https://www.eff.org/deeplinks/2013/06/director-national-intelligences-word-games-explained-how-go...
4•doener•7m ago•0 comments

Show HN: Reflex – local code search engine and MCP server for AI coding

https://github.com/reflex-search/reflex
1•therecluse26•7m ago•0 comments

Bind 2 Port 0

https://bengarcia.dev/b2p0
1•hahahacorn•8m ago•0 comments

Poll: AI Winter

6•amelius•11m ago•2 comments

Show HN: AI Sees Me – CLIP running in the browser

https://www.howaiseesme.com/
1•jayyvk•11m ago•0 comments

SaaS in, SaaS out: Here's what's driving the SaaSpocalypse

https://techcrunch.com/2026/03/01/saas-in-saas-out-heres-whats-driving-the-saaspocalypse/
1•palad1n•12m ago•0 comments

Dbslice: Extract a slice of your production database to reproduce bugs

https://github.com/nabroleonx/dbslice
1•rbanffy•14m ago•0 comments

Show HN: Updater – one command for macOS app updates

https://github.com/lu-zhengda/updater
2•zhengda-lu•16m ago•0 comments

PEP 747 – Annotating Type Forms – peps.python.org

https://peps.python.org/pep-0747/
1•rbanffy•18m ago•0 comments

Show HN: AfterLive – Preserve a Loved One's Voice and Personality with AI

https://afterlive.ai
1•crawde•19m ago•0 comments

Samsung Galaxy S26 Ultra Privacy Display Testing

https://www.lttlabs.com/articles/2026/03/01/samsung-galaxy-s26-ultra-privacy-display
1•LabsLucas•20m ago•1 comments

Securing AI Model Weights

https://www.rand.org/pubs/research_reports/RRA2849-1.html
1•fi-le•21m ago•0 comments

The information space around military AI is being weaponized against us

https://weaponizedspaces.substack.com/p/the-information-space-around-military
3•rbanffy•24m ago•0 comments

Show HN: ContractPulse – Free intelligence on federal government contracts

https://contractpulse.io
2•signalstackhq•25m ago•0 comments

Sam Altman AMA on DoD Collaboration

https://twitter.com/sama/status/2027900042720498089
8•Palmik•25m ago•1 comments

"All Lawful Use": More Than You Wanted to Know

https://www.astralcodexten.com/p/all-lawful-use-much-more-than-you
2•pchristensen•28m ago•0 comments

Show HN: Agentic Gatekeeper – Auto-patch your code to enforce Markdown rules

https://github.com/revanthpobala/agentic-gatekeeper
1•revanth1108•29m ago•0 comments

Show HN: Deploybase – Compare GPU and LLM pricing across all major providers

https://deploybase.ai
1•grasper_•30m ago•0 comments

TPM-Sniffing LUKS Keys on an Embedded Linux Device [CVE-2026-0714]

https://www.cyloq.se/en/research/cve-2026-0714-tpm-sniffing-luks-keys-on-an-embedded-device
3•Tiberium•30m ago•1 comments

Palantir Sues Swiss Magazine for Accurate Report

https://www.techdirt.com/2026/02/27/palantir-sues-swiss-magazine-for-accurately-reporting-that-th...
6•doener•31m ago•0 comments

3D dashboard to monitor and control your AI coding agents in real-time

https://github.com/coding-by-feng/ai-agent-session-center
1•kasonzhan•35m ago•0 comments

$10M factory in 600sqft room

https://www.youtube.com/watch?v=hqGFcwyXYI0
1•humbfool2•37m ago•0 comments

The Zero-Server Code Intelligence Engine

https://github.com/abhigyanpatwari/GitNexus
1•mercat•41m ago•0 comments

Google quantum-proofs HTTPS by squeezing 15kB of data into 700-byte space

https://arstechnica.com/security/2026/02/google-is-using-clever-math-to-quantum-proof-https-certi...
2•naves•42m ago•0 comments

Why Does A.I. Write Like That?

https://www.nytimes.com/2025/12/03/magazine/chatbot-writing-style.html
1•paulpauper•42m ago•0 comments

Show HN: Habitat – A Self-Hosted Social Platform for Local Communities

https://github.com/carlnewton/habitat
2•carlnewton•42m ago•0 comments

AI Accelerates the Zombification of Academia

https://www.wsj.com/opinion/ai-accelerates-the-zombification-of-academia-tech-class-america-unive...
4•paulpauper•42m ago•0 comments

What I Wish I'd Known When I Was Younger

https://www.theatlantic.com/ideas/2025/12/elderly-happiness-advice-stress/685290/
1•paulpauper•43m ago•0 comments

Show HN: Mrkd – A native macOS Markdown viewer with iTerm2/VSCode theme import

https://github.com/jahala/mrkd
2•jahala•43m ago•0 comments