frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Terminal-Wrench, a dataset of 331 realistic hackable environments

https://github.com/few-sh/terminal-wrench
6•neversupervised•3h ago
I want to share a new dataset of 331 reward-hackable environments. These are real environments used in Terminal Bench and adjacent benchmarks. I first got interested in this because, as a reviewer of Terminal Bench, I noticed a lot of our tasks were hackable. I also noticed that many contributors to the benchmark do so because it provides credibility when selling environments to labs. Hence, TBench tasks are, in my opinion, held to a higher quality standard than those being used today for RL. No one is spending hours manually reviewing the $1B in tasks being purchased by major labs. As far as I understand, while everyone knows environments are hackable, nobody has released hundreds of "realistic" environments.

Comments

kxzh•2h ago
how is it different from the berkeley 100% hack? https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

Show HN: Plain – The full-stack Python framework designed for humans and agents

https://github.com/dropseed/plain
71•focom•10h ago•27 comments

Show HN: LangAlpha – what if Claude Code was built for Wall Street?

https://github.com/ginlix-ai/langalpha
119•zc2610•13h ago•38 comments

Show HN: Keynot – Kill PowerPoint with HTML

https://github.com/shawnzam/keynot
2•shawnzam•1h ago•0 comments

Show HN: Terminal-Wrench, a dataset of 331 realistic hackable environments

https://github.com/few-sh/terminal-wrench
6•neversupervised•3h ago•1 comments

Show HN: Kelet – Root Cause Analysis agent for your LLM apps

https://kelet.ai/
39•almogbaku•11h ago•19 comments

Show HN: OpenRig – agent harness that runs Claude Code and Codex as one system

https://github.com/mvschwarz/openrig
4•mschwarz•4h ago•1 comments

Show HN: Ithihāsas – a character explorer for Hindu epics, built in a few hours

https://www.ithihasas.in
169•cvrajeesh•1d ago•44 comments

Show HN: Run GUIs as Scripts

https://github.com/skinnyjames/hokusai-pocket
20•zero-st4rs•4d ago•7 comments

Show HN: Uninum – All elementary functions from a single operator, in Python

https://github.com/Brumbelow/uninum
3•brumbelow•6h ago•1 comments

Show HN: Run Python tools on rust agents

https://github.com/eggermarc/tools-rs
2•eggermarc•7h ago•0 comments

Show HN: Send physical postcards from your coding harness

https://api.melonpost.com/SKILL.md
2•thevelop•8h ago•1 comments

Show HN: boringBar – a taskbar-style dock replacement for macOS

https://boringbar.app/
511•a-ve•2d ago•296 comments

Show HN: A stateful UI runtime for reactive web apps in Go

https://github.com/doors-dev/doors
11•derstruct•19h ago•4 comments

Show HN: A Claude Code–driven tutor for learning algorithms in Go

https://github.com/zuzuleinen/algotutor/
4•zuzuleinen•10h ago•0 comments

Show HN: Hacienda-CLI – CLI to reconcile Spanish tax returns with the tax agency

https://github.com/jatorre/hacienda-cli
2•jatorre•10h ago•0 comments

Show HN: AriaType – open-source privacy-first and local-first voice-to-text app

https://github.com/joe223/AriaType
3•Joe_Harris•12h ago•1 comments

Show HN: VibeDrift – Measure drift in AI-generated codebases

https://www.vibedrift.ai/
4•samiahmadkhan•17h ago•13 comments

Show HN: A memory database that forgets, consolidates, and detects contradiction

https://github.com/yantrikos/yantrikdb-server
46•pranabsarkar•12h ago•31 comments

Show HN: MōBrowser, a TypeScript-first desktop app framework with typed IPC

https://teamdev.com/mobrowser/
5•Ikryanov•12h ago•0 comments

Show HN: Pushduck – S3 uploads that run on Cloudflare Workers, no AWS SDK

11•abhay_ramesh•21h ago•7 comments

Show HN: Oberon System 3 runs natively on Raspberry Pi 3 (with ready SD card)

https://github.com/rochus-keller/OberonSystem3Native/releases
240•Rochus•2d ago•109 comments

Show HN: We built an MCP for Windows – ask Claude about CPU, temps, and privacy

https://github.com/AppControlLabs/appcontrol-mcp-go/
7•suprnurd•13h ago•5 comments

Show HN: Deflect One – command line dashboard for managing Linux servers via SSH

https://github.com/Frytskyy/deflect-one
8•whitemanv•22h ago•6 comments

Show HN: Pardonned.com – A searchable database of US Pardons

498•vidluther•3d ago•273 comments

Show HN: Mcptube – Karpathy's LLM Wiki idea applied to YouTube videos

https://github.com/0xchamin/mcptube
13•0xchamin•1d ago•2 comments

Show HN: Tracking takedown notices filed by UK Biobank

https://biobank.rocher.lc
2•Cynddl•15h ago•0 comments

Show HN: Claudraband – Claude Code for the Power User

https://github.com/halfwhey/claudraband
118•halfwhey•2d ago•44 comments

Show HN: I built a social media management tool in 3 weeks with Claude and Codex

https://github.com/brightbeanxyz/brightbean-studio
186•JanSchu•1d ago•128 comments

Show HN: FluidCAD – Parametric CAD with JavaScript

https://fluidcad.io/
156•maouida•4d ago•38 comments

Show HN: A CLI that writes its own integration code

https://docs.superglue.cloud/getting-started/cli-skills
15•adinagoerres•19h ago•10 comments