frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Cua-Bench – a benchmark for AI agents in GUI environments

https://github.com/trycua/cua
22•someguy101010•2d ago
Hey HN, we're excited to share Cua-Bench ( https://github.com/trycua/cua ), an open-source framework for evaluating and training computer-use agents across different environments.

Computer-use agents show massive performance variance across different UIs—an agent with 90% success on Windows 11 might drop to 9% on Windows XP for the same task. The problem is OS themes, browser versions, and UI variations that existing benchmarks don't capture.

The existing benchmarks (OSWorld, Windows Agent Arena, AndroidWorld) were great but operated in silos—different harnesses, different formats, no standardized way to test the same agent across platforms. More importantly, they were evaluation-only. We needed environments that could generate training data and run RL loops, not just measure performance. Cua-Bench takes a different approach: it's a unified framework that standardizes environments across platforms and supports the full agent development lifecycle—benchmark, train, deploy.

With Cua-Bench, you can:

- Evaluate agents across multiple benchmarks with one CLI (native tasks + OSWorld + Windows Agent Arena adapters)

- Test the same agent on different OS variations (Windows 11/XP/Vista, macOS themes, Linux, Android via QEMU)

- Generate new tasks from natural language prompts

- Create simulated environments for RL training (shell apps like Spotify, Slack with programmatic rewards)

- Run oracle validations to verify environments before agent evaluation

- Monitor agent runs in real-time with traces and screenshots

All of this works on macOS, Linux, Windows, and Android, and is self-hostable.

To get started:

Install cua-bench:

% pip install cua-bench

Run a basic evaluation:

% cb run dataset datasets/cua-bench-basic --agent demo

Open the monitoring dashboard:

% cb run watch <run_id>

For parallelized evaluations across multiple workers:

% cb run dataset datasets/cua-bench-basic --agent your-agent --max-parallel 8

Want to test across different OS variations? Just specify the environment:

% cb run task slack_message --agent your-agent --env windows_xp

% cb run task slack_message --agent your-agent --env macos_sonoma

Generate new tasks from prompts:

% cb task generate "book a flight on kayak.com"

Validate environments with oracle implementations:

% cb run dataset datasets/cua-bench-basic --oracle

The simulated environments are particularly useful for RL training—they're HTML/JS apps that render across 10+ OS themes with programmatic reward verification. No need to spin up actual VMs for training loops.

We're seeing teams use Cua-Bench for:

- Training computer-use models on mobile and desktop environments

- Generating large-scale training datasets (working with labs on millions of screenshots across OS variations)

- RL fine-tuning with shell app simulators

- Systematic evaluation across OS themes and browser versions

- Building task registries (collaborating with Snorkel AI on task design and data curation, similar to their Terminal-Bench work)

Cua-Bench is 100% open-source under the MIT license. We're actively developing it as part of Cua (https://github.com/trycua/cua), our Computer Use Agent SDK, and we'd love your feedback, bug reports, or feature ideas.

GitHub: https://github.com/trycua/cua

Docs: https://cua.ai/docs/cuabench

Technical Report: https://cuabench.ai

We'll be here to answer any technical questions and look forward to your comments!

Comments

visarga•1h ago
Interesting, a computer use environment. I made a CUA benchmark too, 200 web tasks with internal code based evaluation. You can integrate them if you want.

https://github.com/UiPath/uipath_enterprise_benchmark

https://arxiv.org/abs/2511.17131

frabonacci•1h ago
Hey visarga - I'm the founder of Cua, we might have met at the CUA ICML workshop? The OS-agnostic VNC approach of your benchmark is smart and would make integration easy. We're open to collaborating - want to shoot me an email at f@trycua.com?

Show HN: I Built a Sandbox for Agents

https://github.com/vrn21/bouvet.com
16•vrn21•1h ago•9 comments

Show HN: The HN Arcade

https://andrewgy8.github.io/hnarcade/
222•yuppiepuppie•7h ago•61 comments

Show HN: I built a small browser engine from scratch in C++

https://github.com/beginner-jhj/mini_browser
38•crediblejhj•3h ago•4 comments

Show HN: Dwm.tmux – a dwm-inspired window manager for tmux

https://github.com/saysjonathan/dwm.tmux
62•saysjonathan•4d ago•10 comments

Show HN: Cua-Bench – a benchmark for AI agents in GUI environments

https://github.com/trycua/cua
22•someguy101010•2d ago•2 comments

Show HN: Extracting React apps from Figma Make's undocumented binary format

https://albertsikkema.com/ai/development/tools/reverse-engineering/2026/01/23/reverse-engineering...
37•albertsikkema•5d ago•9 comments

Show HN: Build Web Automations via Demonstration

https://www.notte.cc/launch-week-i/demonstrate-mode
19•ogandreakiro•1d ago•9 comments

Show HN: One Human + One Agent = One Browser From Scratch in 20K LOC

https://emsh.cat/one-human-one-agent-one-browser/
302•embedding-shape•1d ago•139 comments

Show HN: A header-only C++20 compile-time assembler for x86/x64 instructions

https://github.com/mahmoudimus/static_asm
2•mahmoudimus•1h ago•0 comments

Show HN: We built a type-safe Python ORM for RedisGraph/FalkorDB

5•hello-tmst•2h ago•1 comments

Show HN: LemonSlice – Upgrade your voice agents to real-time video

108•lcolucci•1d ago•115 comments

Show HN: Multi-Agent Framework for Ruby

https://github.com/chatwoot/ai-agents
2•shivam-dev•2h ago•0 comments

Show HN: Cloakly – Hide sensitive windows from screen shares in real-time

3•jaygood•3h ago•0 comments

Show HN: PNANA - A TUI Text Editor

https://github.com/Cyxuan0311/PNANA
4•Frameser•5h ago•3 comments

Show HN: Fuzzy Studio – Apply live effects to videos/camera

https://fuzzy.ulyssepence.com/
48•ulyssepence•1d ago•12 comments

Show HN: I wrapped the Zorks with an LLM

https://infocom.tambo.co/
101•alecf•20h ago•56 comments

Show HN: We Built the 1. EU-Sovereignty Audit for Websites

https://lightwaves.io/en/eu-audit/
98•cmkr•1d ago•78 comments

Show HN: AI PDF to ePub Converter

https://pdftoepubai.com
2•svx_hn•4h ago•0 comments

Show HN: Marches & Gnats – Coding puzzle game where you program Turing machine

https://mng.quest/
2•maltsev•1h ago•1 comments

Show HN: mute your macOS mic to ZERO. But Siri keeps listening

https://github.com/BrowserBox/NoSpy
3•keepamovin•4h ago•1 comments

Show HN: Only 1 LLM can fly a drone

https://github.com/kxzk/snapbench
177•beigebrucewayne•2d ago•92 comments

Show HN: TetrisBench – Gemini Flash reaches 66% win rate on Tetris against Opus

https://tetrisbench.com/tetrisbench/
108•ykhli•1d ago•40 comments

Show HN: A blog that deletes itself if you stop writing

https://lapse.blog
3•reassess_blind•10h ago•1 comments

Show HN: An interactive map of US lighthouses and navigational aids

https://www.lighthouses.app/
100•idd2•2d ago•22 comments

Show HN: A 4.8MB native iOS voice notes app built with SwiftUI

https://apps.apple.com/us/app/convoxa-ai-meeting-minutes/id6755150446
4•karamalaskar•21h ago•0 comments

Show HN: TUI for managing XDG default applications

https://github.com/mitjafelicijan/xdgctl
135•mitjafelicijan•3d ago•45 comments

Show HN: SF Microclimates

https://github.com/solo-founders/sf-microclimates
35•weisser•2d ago•31 comments

Show HN: Nyxi – Execution-time governance for irreversible

https://github.com/indyh91/Nyxi-Showcase
3•Shaehenderson•14h ago•0 comments

Show HN: My AI tracks Polymarket whales with guardrails so it won't bankrupt me

https://predictor-dashboard.vercel.app
2•JackDavis720•14h ago•0 comments

Show HN: Netfence – Like Envoy for eBPF Filters

https://github.com/danthegoodman1/netfence
58•dangoodmanUT•3d ago•7 comments