frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
153•isitcontent•7h ago•15 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
258•vecti•9h ago•120 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
196•eljojo•9h ago•127 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
51•phreda4•6h ago•8 comments

Show HN: Smooth CLI – Token-efficient browser for AI agents

https://docs.smooth.sh/cli/overview
78•antves•1d ago•57 comments

Show HN: Slack CLI for Agents

https://github.com/stablyai/agent-slack
40•nwparker•1d ago•10 comments

Show HN: Artifact Keeper – Open-Source Artifactory/Nexus Alternative in Rust

https://github.com/artifact-keeper
147•bsgeraci•1d ago•61 comments

Show HN: Gigacode – Use OpenCode's UI with Claude Code/Codex/Amp

https://github.com/rivet-dev/sandbox-agent/tree/main/gigacode
12•NathanFlurry•14h ago•5 comments

Show HN: Horizons – OSS agent execution engine

https://github.com/synth-laboratories/Horizons
23•JoshPurtell•1d ago•5 comments

Show HN: FastLog: 1.4 GB/s text file analyzer with AVX2 SIMD

https://github.com/AGDNoob/FastLog
3•AGDNoob•3h ago•1 comments

Show HN: Falcon's Eye (isometric NetHack) running in the browser via WebAssembly

https://rahuljaguste.github.io/Nethack_Falcons_Eye/
4•rahuljaguste•6h ago•1 comments

Show HN: I built a directory of $1M+ in free credits for startups

https://startupperks.directory
3•osmansiddique•4h ago•0 comments

Show HN: Daily-updated database of malicious browser extensions

https://github.com/toborrm9/malicious_extension_sentry
13•toborrm9•11h ago•5 comments

Show HN: A Kubernetes Operator to Validate Jupyter Notebooks in MLOps

https://github.com/tosin2013/jupyter-notebook-validator-operator
2•takinosh•4h ago•0 comments

Show HN: BioTradingArena – Benchmark for LLMs to predict biotech stock movements

https://www.biotradingarena.com/hn
23•dchu17•11h ago•11 comments

Show HN: 33rpm – A vinyl screensaver for macOS that syncs to your music

https://33rpm.noonpacific.com/
3•kaniksu•5h ago•0 comments

Show HN: Chiptune Tracker

https://chiptunes.netlify.app
3•iamdan•6h ago•1 comments

Show HN: Micropolis/SimCity Clone in Emacs Lisp

https://github.com/vkazanov/elcity
171•vkazanov•1d ago•48 comments

Show HN: A password system with no database, no sync, and nothing to breach

https://bastion-enclave.vercel.app
10•KevinChasse•12h ago•9 comments

Show HN: Local task classifier and dispatcher on RTX 3080

https://github.com/resilientworkflowsentinel/resilient-workflow-sentinel
25•Shubham_Amb•1d ago•2 comments

Show HN: GitClaw – An AI assistant that runs in GitHub Actions

https://github.com/SawyerHood/gitclaw
7•sawyerjhood•12h ago•0 comments

Show HN: An open-source system to fight wildfires with explosive-dispersed gel

https://github.com/SpOpsi/Project-Baver
2•solarV26•10h ago•0 comments

Show HN: Agentism – Agentic Religion for Clawbots

https://www.agentism.church
2•uncanny_guzus•10h ago•0 comments

Show HN: Disavow Generator – Open-source tool to defend against negative SEO

https://github.com/BansheeTech/Disavow-Generator
5•SurceBeats•15h ago•1 comments

Show HN: BPU – Reliable ESP32 Serial Streaming with Cobs and CRC

https://github.com/choihimchan/bpu-stream-engine
2•octablock•12h ago•0 comments

Show HN: Craftplan – I built my wife a production management tool for her bakery

https://github.com/puemos/craftplan
567•deofoo•5d ago•166 comments

Show HN: Total Recall – write-gated memory for Claude Code

https://github.com/davegoldblatt/total-recall
10•davegoldblatt•1d ago•6 comments

Show HN: Hibana – An Affine MPST Runtime for Rust

https://hibanaworks.dev
3•o8vm•13h ago•0 comments

Show HN: Beam – Terminal Organizer for macOS

https://getbeam.dev/
2•faalbane•13h ago•2 comments

Show HN: Agent Arena – Test How Manipulation-Proof Your AI Agent Is

https://wiz.jock.pl/experiments/agent-arena/
45•joozio•16h ago•47 comments
Open in hackernews

Show HN: Cua-Bench – a benchmark for AI agents in GUI environments

https://github.com/trycua/cua
40•someguy101010•1w ago
Hey HN, we're excited to share Cua-Bench ( https://github.com/trycua/cua ), an open-source framework for evaluating and training computer-use agents across different environments.

Computer-use agents show massive performance variance across different UIs—an agent with 90% success on Windows 11 might drop to 9% on Windows XP for the same task. The problem is OS themes, browser versions, and UI variations that existing benchmarks don't capture.

The existing benchmarks (OSWorld, Windows Agent Arena, AndroidWorld) were great but operated in silos—different harnesses, different formats, no standardized way to test the same agent across platforms. More importantly, they were evaluation-only. We needed environments that could generate training data and run RL loops, not just measure performance. Cua-Bench takes a different approach: it's a unified framework that standardizes environments across platforms and supports the full agent development lifecycle—benchmark, train, deploy.

With Cua-Bench, you can:

- Evaluate agents across multiple benchmarks with one CLI (native tasks + OSWorld + Windows Agent Arena adapters)

- Test the same agent on different OS variations (Windows 11/XP/Vista, macOS themes, Linux, Android via QEMU)

- Generate new tasks from natural language prompts

- Create simulated environments for RL training (shell apps like Spotify, Slack with programmatic rewards)

- Run oracle validations to verify environments before agent evaluation

- Monitor agent runs in real-time with traces and screenshots

All of this works on macOS, Linux, Windows, and Android, and is self-hostable.

To get started:

Install cua-bench:

% pip install cua-bench

Run a basic evaluation:

% cb run dataset datasets/cua-bench-basic --agent demo

Open the monitoring dashboard:

% cb run watch <run_id>

For parallelized evaluations across multiple workers:

% cb run dataset datasets/cua-bench-basic --agent your-agent --max-parallel 8

Want to test across different OS variations? Just specify the environment:

% cb run task slack_message --agent your-agent --env windows_xp

% cb run task slack_message --agent your-agent --env macos_sonoma

Generate new tasks from prompts:

% cb task generate "book a flight on kayak.com"

Validate environments with oracle implementations:

% cb run dataset datasets/cua-bench-basic --oracle

The simulated environments are particularly useful for RL training—they're HTML/JS apps that render across 10+ OS themes with programmatic reward verification. No need to spin up actual VMs for training loops.

We're seeing teams use Cua-Bench for:

- Training computer-use models on mobile and desktop environments

- Generating large-scale training datasets (working with labs on millions of screenshots across OS variations)

- RL fine-tuning with shell app simulators

- Systematic evaluation across OS themes and browser versions

- Building task registries (collaborating with Snorkel AI on task design and data curation, similar to their Terminal-Bench work)

Cua-Bench is 100% open-source under the MIT license. We're actively developing it as part of Cua (https://github.com/trycua/cua), our Computer Use Agent SDK, and we'd love your feedback, bug reports, or feature ideas.

GitHub: https://github.com/trycua/cua

Docs: https://cua.ai/docs/cuabench

Technical Report: https://cuabench.ai

We'll be here to answer any technical questions and look forward to your comments!

Comments

visarga•1w ago
Interesting, a computer use environment. I made a CUA benchmark too, 200 web tasks with internal code based evaluation. You can integrate them if you want.

https://github.com/UiPath/uipath_enterprise_benchmark

https://arxiv.org/abs/2511.17131

frabonacci•1w ago
Hey visarga - I'm the founder of Cua, we might have met at the CUA ICML workshop? The OS-agnostic VNC approach of your benchmark is smart and would make integration easy. We're open to collaborating - want to shoot me an email at f@trycua.com?
augusteo•1w ago
The trajectory export feature is smart. Evaluation and training data collection in the same tool.

I'm curious how the benchmarks handle non-determinism. Real GUIs have loading states, animations, popups that appear sometimes but not always. Does cuabench control for that, or is variance just part of the measurement?

Also interested in what "Windows Arena" tests specifically. Windows has so many edge cases - UAC prompts, driver install dialogs, random update notifications. Those feel like the hard mode for computer-use agents.

frabonacci•1w ago
Thanks - trajectory export was key for us since most teams want both eval and training data.

On non-determinism: we actually handle this in two ways. For our simulated environments (HTML/JS apps like the Slack/CRM clones), we control the full render state so there's no variance from animations or loading states. For native OS environments, we use explicit state verification before scoring - the reward function waits for expected elements rather than racing against UI timing. Still not perfect, but it filters out most flaky failures.

Windows Arena specifically - we're focusing on common productivity flows (file management, browser tasks, Office workflows) rather than the edge cases you mentioned. UAC prompts and driver dialogs are exactly the hard mode scenarios that break most agents today. We're not claiming to solve those yet, but that's part of why we're open-sourcing this - want to build out more adversarial tasks with the community.

rfw300•1w ago
Interesting project, but the lack of any actual benchmark results on existing models/agents is disappointing.
frabonacci•1w ago
Fair point - we just open-sourced this so benchmark results are coming. We're already working with labs on evals, focusing on tasks that are more realistic than OSWorld/Windows Agent Arena and curated with actual workers. If you want to run your agent on it we'd love to include your results.
alsetmusic•1w ago
I came across this last week when looking into how I'd run clawd (now Moltbot) on a Mac. I've been using their Lume cli for a few days to sandbox Claude and it's been reasonably good so far.

Expanding the VM's disk didn't propagate to the client; seems like it needs tooling like VMware-Tools to do some magic. I backed up the important stuff (I'd been using it less than a day) and created a new VM. That's been the only hurdle so far.

Anyway, I haven't used this, but thought it'd be useful to someone to hear that a related tool seems to be ok after ~five days of intermittent use. I've been an ESX (and recently Proxmox) fan in the past (and all the others cause I was curious), so I say this as someone who has kicked some VM rocks in my time.

As for locally run AI, this is the first time I've given one access to any data directly and I can't speak intelligently about any of that or this specific tool. Sorry.

arjunchint•1w ago
In our benchmark of web agents, we found that vision/GUI based agents get tripped up on popups/overlays, need large vision models and require using CDP in browsers.

Our own DOM-only web agent, rtrvr.ai, worked seamlessly underneath dialogs, can just use off the shelf Gemini Flash Lite and use Chrome native APIs leading to minimal infrastructure failures, SOTA performance and lowest cost.

https://www.rtrvr.ai/blog/web-bench-results