Show HN: Cua-Bench – a benchmark for AI agents in GUI environments

40•someguy101010•1w ago

Hey HN, we're excited to share Cua-Bench ( https://github.com/trycua/cua ), an open-source framework for evaluating and training computer-use agents across different environments.

Computer-use agents show massive performance variance across different UIs—an agent with 90% success on Windows 11 might drop to 9% on Windows XP for the same task. The problem is OS themes, browser versions, and UI variations that existing benchmarks don't capture.

The existing benchmarks (OSWorld, Windows Agent Arena, AndroidWorld) were great but operated in silos—different harnesses, different formats, no standardized way to test the same agent across platforms. More importantly, they were evaluation-only. We needed environments that could generate training data and run RL loops, not just measure performance. Cua-Bench takes a different approach: it's a unified framework that standardizes environments across platforms and supports the full agent development lifecycle—benchmark, train, deploy.

With Cua-Bench, you can:

- Evaluate agents across multiple benchmarks with one CLI (native tasks + OSWorld + Windows Agent Arena adapters)

- Test the same agent on different OS variations (Windows 11/XP/Vista, macOS themes, Linux, Android via QEMU)

- Generate new tasks from natural language prompts

- Create simulated environments for RL training (shell apps like Spotify, Slack with programmatic rewards)

- Run oracle validations to verify environments before agent evaluation

- Monitor agent runs in real-time with traces and screenshots

All of this works on macOS, Linux, Windows, and Android, and is self-hostable.

To get started:

Install cua-bench:

% pip install cua-bench

Run a basic evaluation:

% cb run dataset datasets/cua-bench-basic --agent demo

Open the monitoring dashboard:

% cb run watch <run_id>

For parallelized evaluations across multiple workers:

% cb run dataset datasets/cua-bench-basic --agent your-agent --max-parallel 8

Want to test across different OS variations? Just specify the environment:

% cb run task slack_message --agent your-agent --env windows_xp

% cb run task slack_message --agent your-agent --env macos_sonoma

Generate new tasks from prompts:

% cb task generate "book a flight on kayak.com"

Validate environments with oracle implementations:

% cb run dataset datasets/cua-bench-basic --oracle

The simulated environments are particularly useful for RL training—they're HTML/JS apps that render across 10+ OS themes with programmatic reward verification. No need to spin up actual VMs for training loops.

We're seeing teams use Cua-Bench for:

- Training computer-use models on mobile and desktop environments

- Generating large-scale training datasets (working with labs on millions of screenshots across OS variations)

- RL fine-tuning with shell app simulators

- Systematic evaluation across OS themes and browser versions

- Building task registries (collaborating with Snorkel AI on task design and data curation, similar to their Terminal-Bench work)

Cua-Bench is 100% open-source under the MIT license. We're actively developing it as part of Cua (https://github.com/trycua/cua), our Computer Use Agent SDK, and we'd love your feedback, bug reports, or feature ideas.

GitHub: https://github.com/trycua/cua

Docs: https://cua.ai/docs/cuabench

Technical Report: https://cuabench.ai

We'll be here to answer any technical questions and look forward to your comments!

Comments

visarga•1w ago

Interesting, a computer use environment. I made a CUA benchmark too, 200 web tasks with internal code based evaluation. You can integrate them if you want.

https://github.com/UiPath/uipath_enterprise_benchmark

https://arxiv.org/abs/2511.17131

frabonacci•1w ago

Hey visarga - I'm the founder of Cua, we might have met at the CUA ICML workshop? The OS-agnostic VNC approach of your benchmark is smart and would make integration easy. We're open to collaborating - want to shoot me an email at f@trycua.com?

augusteo•1w ago

The trajectory export feature is smart. Evaluation and training data collection in the same tool.

I'm curious how the benchmarks handle non-determinism. Real GUIs have loading states, animations, popups that appear sometimes but not always. Does cuabench control for that, or is variance just part of the measurement?

Also interested in what "Windows Arena" tests specifically. Windows has so many edge cases - UAC prompts, driver install dialogs, random update notifications. Those feel like the hard mode for computer-use agents.

frabonacci•1w ago

Thanks - trajectory export was key for us since most teams want both eval and training data.

On non-determinism: we actually handle this in two ways. For our simulated environments (HTML/JS apps like the Slack/CRM clones), we control the full render state so there's no variance from animations or loading states. For native OS environments, we use explicit state verification before scoring - the reward function waits for expected elements rather than racing against UI timing. Still not perfect, but it filters out most flaky failures.

Windows Arena specifically - we're focusing on common productivity flows (file management, browser tasks, Office workflows) rather than the edge cases you mentioned. UAC prompts and driver dialogs are exactly the hard mode scenarios that break most agents today. We're not claiming to solve those yet, but that's part of why we're open-sourcing this - want to build out more adversarial tasks with the community.

rfw300•1w ago

Interesting project, but the lack of any actual benchmark results on existing models/agents is disappointing.

frabonacci•1w ago

Fair point - we just open-sourced this so benchmark results are coming. We're already working with labs on evals, focusing on tasks that are more realistic than OSWorld/Windows Agent Arena and curated with actual workers. If you want to run your agent on it we'd love to include your results.

alsetmusic•1w ago

I came across this last week when looking into how I'd run clawd (now Moltbot) on a Mac. I've been using their Lume cli for a few days to sandbox Claude and it's been reasonably good so far.

Expanding the VM's disk didn't propagate to the client; seems like it needs tooling like VMware-Tools to do some magic. I backed up the important stuff (I'd been using it less than a day) and created a new VM. That's been the only hurdle so far.

Anyway, I haven't used this, but thought it'd be useful to someone to hear that a related tool seems to be ok after ~five days of intermittent use. I've been an ESX (and recently Proxmox) fan in the past (and all the others cause I was curious), so I say this as someone who has kicked some VM rocks in my time.

As for locally run AI, this is the first time I've given one access to any data directly and I can't speak intelligently about any of that or this specific tool. Sorry.

arjunchint•1w ago

In our benchmark of web agents, we found that vision/GUI based agents get tripped up on popups/overlays, need large vision models and require using CDP in browsers.

Our own DOM-only web agent, rtrvr.ai, worked seamlessly underneath dialogs, can just use off the shelf Gemini Flash Lite and use Chrome native APIs leading to minimal infrastructure failures, SOTA performance and lowest cost.

https://www.rtrvr.ai/blog/web-bench-results

We Mourn Our Craft

Speed up responses with fast mode

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Al Lowe on model trains, funny deaths and working with Disney

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

I Write Games in C (yes, C)

Start all of your commands with a comma (2009)

SectorC: A C Compiler in 512 bytes

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Selection Rather Than Prediction

History and Timeline of the Proco Rat Pedal (2021)

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Software factories and the agentic moment

We Mourn Our Craft

Speed up responses with fast mode

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Al Lowe on model trains, funny deaths and working with Disney

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

I Write Games in C (yes, C)

Start all of your commands with a comma (2009)

SectorC: A C Compiler in 512 bytes

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Selection Rather Than Prediction

History and Timeline of the Proco Rat Pedal (2021)

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Software factories and the agentic moment

Show HN: Cua-Bench – a benchmark for AI agents in GUI environments

Comments