frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Cua-Bench – a benchmark for AI agents in GUI environments

https://github.com/trycua/cua
21•someguy101010•1d ago
Hey HN, we're excited to share Cua-Bench ( https://github.com/trycua/cua ), an open-source framework for evaluating and training computer-use agents across different environments.

Computer-use agents show massive performance variance across different UIs—an agent with 90% success on Windows 11 might drop to 9% on Windows XP for the same task. The problem is OS themes, browser versions, and UI variations that existing benchmarks don't capture.

The existing benchmarks (OSWorld, Windows Agent Arena, AndroidWorld) were great but operated in silos—different harnesses, different formats, no standardized way to test the same agent across platforms. More importantly, they were evaluation-only. We needed environments that could generate training data and run RL loops, not just measure performance. Cua-Bench takes a different approach: it's a unified framework that standardizes environments across platforms and supports the full agent development lifecycle—benchmark, train, deploy.

With Cua-Bench, you can:

- Evaluate agents across multiple benchmarks with one CLI (native tasks + OSWorld + Windows Agent Arena adapters)

- Test the same agent on different OS variations (Windows 11/XP/Vista, macOS themes, Linux, Android via QEMU)

- Generate new tasks from natural language prompts

- Create simulated environments for RL training (shell apps like Spotify, Slack with programmatic rewards)

- Run oracle validations to verify environments before agent evaluation

- Monitor agent runs in real-time with traces and screenshots

All of this works on macOS, Linux, Windows, and Android, and is self-hostable.

To get started:

Install cua-bench:

% pip install cua-bench

Run a basic evaluation:

% cb run dataset datasets/cua-bench-basic --agent demo

Open the monitoring dashboard:

% cb run watch <run_id>

For parallelized evaluations across multiple workers:

% cb run dataset datasets/cua-bench-basic --agent your-agent --max-parallel 8

Want to test across different OS variations? Just specify the environment:

% cb run task slack_message --agent your-agent --env windows_xp

% cb run task slack_message --agent your-agent --env macos_sonoma

Generate new tasks from prompts:

% cb task generate "book a flight on kayak.com"

Validate environments with oracle implementations:

% cb run dataset datasets/cua-bench-basic --oracle

The simulated environments are particularly useful for RL training—they're HTML/JS apps that render across 10+ OS themes with programmatic reward verification. No need to spin up actual VMs for training loops.

We're seeing teams use Cua-Bench for:

- Training computer-use models on mobile and desktop environments

- Generating large-scale training datasets (working with labs on millions of screenshots across OS variations)

- RL fine-tuning with shell app simulators

- Systematic evaluation across OS themes and browser versions

- Building task registries (collaborating with Snorkel AI on task design and data curation, similar to their Terminal-Bench work)

Cua-Bench is 100% open-source under the MIT license. We're actively developing it as part of Cua (https://github.com/trycua/cua), our Computer Use Agent SDK, and we'd love your feedback, bug reports, or feature ideas.

GitHub: https://github.com/trycua/cua

Docs: https://cua.ai/docs/cuabench

Technical Report: https://cuabench.ai

We'll be here to answer any technical questions and look forward to your comments!

Comments

visarga•1h ago
Interesting, a computer use environment. I made a CUA benchmark too, 200 web tasks with internal code based evaluation. You can integrate them if you want.

https://github.com/UiPath/uipath_enterprise_benchmark

https://arxiv.org/abs/2511.17131

frabonacci•1h ago
Hey visarga - I'm the founder of Cua, we might have met at the CUA ICML workshop? The OS-agnostic VNC approach of your benchmark is smart and would make integration easy. We're open to collaborating - want to shoot me an email at f@trycua.com?

Microsoft forced me to switch to Linux

https://www.himthe.dev/blog/microsoft-to-linux
656•bobsterlobster•3h ago•571 comments

Airfoil (2024)

https://ciechanow.ski/airfoil/
179•brk•3h ago•27 comments

Amazone One palm authentication discontinued

https://amazonone.aws.com/help
27•KerryJones•47m ago•39 comments

Show HN: The HN Arcade

https://andrewgy8.github.io/hnarcade/
212•yuppiepuppie•6h ago•61 comments

I Overengineered a Spinning Top

https://www.youtube.com/watch?v=Wp5NodfvvF4
34•bane•5d ago•12 comments

Show HN: I Built a Sandbox for Agents

https://github.com/vrn21/bouvet.com
6•vrn21•48m ago•1 comments

Will AIs Take All Our Jobs and End Human History–Or Not?

https://writings.stephenwolfram.com/2023/03/will-ais-take-all-our-jobs-and-end-human-history-or-n...
8•lukakopajtic•50m ago•2 comments

Show HN: Dwm.tmux – a dwm-inspired window manager for tmux

https://github.com/saysjonathan/dwm.tmux
60•saysjonathan•4d ago•7 comments

A verification layer for browser agents: Amazon case study

https://sentienceapi.com/blog/verification-layer-amazon-case-study
27•tonyww•15h ago•7 comments

Show HN: Cua-Bench – a benchmark for AI agents in GUI environments

https://github.com/trycua/cua
21•someguy101010•1d ago•2 comments

Rust at Scale: An Added Layer of Security for WhatsApp

https://engineering.fb.com/2026/01/27/security/rust-at-scale-security-whatsapp/
193•ubj•11h ago•75 comments

There's only one Woz, but we can all learn from him

https://www.fastcompany.com/91477114/steve-wozniak-woz-apple-the-tech-interactive-humanitarian-award
252•coloneltcb•4d ago•121 comments

Show HN: I built a small browser engine from scratch in C++

https://github.com/beginner-jhj/mini_browser
21•crediblejhj•3h ago•0 comments

Show HN: Extracting React apps from Figma Make's undocumented binary format

https://albertsikkema.com/ai/development/tools/reverse-engineering/2026/01/23/reverse-engineering...
32•albertsikkema•5d ago•8 comments

A few random notes from Claude coding quite a bit last few weeks

https://twitter.com/karpathy/status/2015883857489522876
817•bigwheels•1d ago•708 comments

Prism

https://openai.com/index/introducing-prism
729•meetpateltech•23h ago•483 comments

Kyber (YC W23) Is Hiring a Staff Engineer

https://www.ycombinator.com/companies/kyber/jobs/GPJkv5v-staff-engineer-tech-lead
1•asontha•5h ago

Show HN: Build Web Automations via Demonstration

https://www.notte.cc/launch-week-i/demonstrate-mode
19•ogandreakiro•1d ago•7 comments

Oban, the job processing framework from Elixir, has come to Python

https://www.dimamik.com/posts/oban_py/
3•dimamik•1h ago•0 comments

SVG Path Editor

https://yqnn.github.io/svg-path-editor/
198•gurjeet•5d ago•30 comments

Virtual Boy on TV with Intelligent Systems Video Boy

https://hcs64.com/video-boy-vue/
75•hcs•9h ago•20 comments

Amazon axes 16,000 jobs as it pushes AI and efficiency

https://www.reuters.com/legal/litigation/amazon-cuts-16000-jobs-globally-broader-restructuring-20...
248•DGAP•1h ago•280 comments

Immanuel 'the Königsberg clock' Kant (2015)

https://www.versobooks.com/en-gb/blogs/news/1963-immanuel-kant-the-errrr-walker
4•rishabhd•3d ago•0 comments

Make.ts

https://matklad.github.io/2026/01/27/make-ts.html
172•ingve•10h ago•94 comments

430k-year-old well-preserved wooden tools are the oldest ever found

https://www.nytimes.com/2026/01/26/science/archaeology-neanderthals-tools.html
469•bookofjoe•1d ago•244 comments

Golden Ratio using an equilateral triangle inscribed in a circle

https://geometrycode.com/free/how-to-graphically-derive-the-golden-ratio-using-an-equilateral-tri...
148•peter_d_sherman•5d ago•41 comments

Pandas 3.0

https://pandas.pydata.org/community/blog/pandas-3.0.html
216•jonbaer•5d ago•80 comments

Thirty Years of the Square Kilometre Array

https://physicsworld.com/a/thirty-years-of-the-square-kilometre-array-heres-what-the-worlds-large...
55•mooreds•2d ago•15 comments

Rust’s Standard Library on the GPU

https://www.vectorware.com/blog/rust-std-on-gpu/
239•justaboutanyone•4d ago•47 comments

Doing the thing is doing the thing

https://www.softwaredesign.ing/blog/doing-the-thing-is-doing-the-thing
509•prakhar897•1d ago•171 comments