frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: Skouriasmeno Papaki – S3 transfer tool, up to 12x faster than AWS-CLI

https://github.com/NetViper-Labs/skouriasmeno-papaki
1•NetViper•47s ago•0 comments

Oracle's $248 Billion Rent is Another Al 'Bombshell'

https://www.bloomberg.com/opinion/articles/2025-12-16/ai-bubble-oracle-delivers-next-bombshell-wi...
1•voxadam•4m ago•1 comments

Show HN: F. Incantatem – AI-Powered Exception Analysis for Python

https://github.com/aguilar-ai/fincantatem
1•Paralus•5m ago•0 comments

Officials Say Gunman Likely 'Cased' Campus Before Brown Shooting

https://www.nytimes.com/2025/12/16/us/brown-university-shooter-search-investigation.html
1•Bender•6m ago•0 comments

Show HN: AI Trolley Problem Arena

https://www.aitrolleyproblem.com/
3•justintorre75•9m ago•1 comments

Moving Cloudflare out of the critical path

https://duggan.ie/posts/moving-cloudflare-out-of-the-critical-path
1•duggan•9m ago•0 comments

LLMs Excel at Easy Verification Problems

https://wiki.roshangeorge.dev/w/Blog/2025-12-11/LLMs_Excel_At_Easy_Verification_Problems
1•handfuloflight•10m ago•0 comments

How is Google's AI Mode so fast and so good?

1•nthypes•10m ago•0 comments

Instacart's AI-Enabled Pricing Experiments May Be Inflating Your Grocery Bill

https://www.consumerreports.org/money/questionable-business-practices/instacart-ai-pricing-experi...
4•bookofjoe•15m ago•2 comments

Welcome to the New Project Zero Blog

https://projectzero.google/2025/12/welcome.html
3•tech234a•15m ago•0 comments

The Uncertain Origins of Aspirin

https://www.asimov.press/p/aspirin
2•dearwell•16m ago•0 comments

The great porn panic: Causal arguments rest on pseudo-science

https://unherd.com/2025/12/the-great-porn-panic/
1•iamben•18m ago•1 comments

US Threatens to Retaliate Against EU Firms over Digital Tax

https://www.bloomberg.com/news/articles/2025-12-16/us-threatens-to-retaliate-against-eu-companies...
5•petethomas•18m ago•0 comments

Hawk from Movement Labs clocks in at 22.5% on ARC-AGI-2 – Launched 40 min ago

https://movementlabs.ai
1•movementlabsAI•19m ago•0 comments

Ask HN: Do People who treat coding as a job look normal?

1•danver0•20m ago•1 comments

Open Scouts: AI-driven web monitoring

https://openscouts.firecrawl.dev/
1•mustaphah•21m ago•0 comments

I had a private chat with an LLM

https://depew.substack.com/p/i-had-a-private-chat-with-an-llm
4•dwa3592•22m ago•0 comments

Netflix Taps Snoop Dogg for Christmas Day NFL Halftime Show

https://www.hollywoodreporter.com/music/music-news/netflix-nfl-halftime-snoop-dogg-christmas-day-...
2•andsoitis•24m ago•0 comments

Trump Overtime Tax Break More a Political Tagline Than Tax Relief

https://news.bloombergtax.com/tax-insights-and-commentary/trump-overtime-tax-break-more-a-politic...
5•tldrthelaw•26m ago•0 comments

Space Data Center SIM

https://astrocompute.dev/
2•printerlover•29m ago•0 comments

Learning a new programming language with an LLM

https://feeding.cloud.geek.nz/posts/learning-new-programming-language-with-ai/
1•edward•29m ago•0 comments

Role of anthropogenic climate change in wildfire smoke concentrations in the US

https://www.pnas.org/doi/10.1073/pnas.2421903122
2•bikenaga•31m ago•1 comments

Microplastic exposure is associated with epigenomic effects in model organism

https://pubmed.ncbi.nlm.nih.gov/38742563/
2•donsupreme•32m ago•0 comments

Dafny: Verification-Aware Programming Language

https://dafny.org/
3•handfuloflight•33m ago•0 comments

Efficient Dockerfile templating for complex build scenarios

https://gagor.pro/2025/01/efficient-dockerfile-templating-for-complex-build-scenarios/
1•___timor___•35m ago•0 comments

I Ported JustHTML from Python to JavaScript with Codex CLI and GPT-5.2 in 4.5h

https://simonwillison.net/2025/Dec/15/porting-justhtml/
2•pbowyer•35m ago•0 comments

Google Fi Web Calls

https://fi.google.com/webcalls/calls
2•pcvetkovski•36m ago•0 comments

Launching ChinaRxiv, an automated translation pipeline of all Chinese preprints

https://twitter.com/seconds_0/status/2000606845644505093
2•Anon84•43m ago•1 comments

The "Commons Clause" License Condition

https://commonsclause.com/
1•Kerrick•50m ago•0 comments

Show HN: BoardSpace – AI that draws on a whiteboard in realtime for Calculus

https://www.useboardspace.com/
1•jonnotdoe•51m ago•1 comments
Open in hackernews

Why Windows XP is the ultimate AI benchmark

https://cuabench.ai
5•frabonacci•7h ago

Comments

frabonacci•7h ago
We spent the last few months trying to understand why computer-use agents (Claude Computer-Use, OpenAI CUA, Gemini 2.5 Computer-Use) fail so inconsistently.

The pattern we kept seeing: same agent, same task, different OS theme = notably different results.

Claude Sonnet 4 scores 31.9% on OSWorld and Windows Agent Arena (2 of the most relevant benchmarks for computer-use agents) — but with massive variance. An agent trained on Windows 11 light mode fails on dark mode. Works on macOS Ventura, breaks on Monterey. Works on Win11, collapses on Vista.

The root cause: training data lacks visual diversity. Current benchmarks (OSWorld, Windows Agent Arena) rely on static VM snapshots with fixed configurations. They don't capture the reality of diverse OS themes, window layouts, resolution differences, or desktop clutter.

We built cua-bench — HTML-based simulated environments that render across 10+ OS themes (macOS, Win11, WinXP, Win98, Vista, iOS, Android). Define a task once, generate thousands of visual variations.

This enables: - Oracle trajectory generation via a Playwright-like API (verified ground truth for training) - Trajectory replotting: record 1 demo → re-render across 10 OS themes = 10 training trajectories

The technical report covers our approach to trajectory generation, Android/iOS environments, cross-platform HTML snapshots, and a comparison with existing benchmarks.

We’re currently working with research labs on training data generation and benchmarks, but we’d really value input from the HN community: - What tasks or OS environments should be standardized to actually stress computer-use agents? - Legacy OSes? Weird resolutions? Broken themes? Cluttered desktops? Modal hell?

Curious what people here think are the real failure modes we should be benchmarking.

someguy101010•6h ago
as an infrastructure engineer the idea of being able to train computer use agents without provisioning infrastructure sounds amazing!

a common use case i run into is i want to be able to configure corporate vpn software on windows machines. is there a link for a getting started guide i could try this out with?

frabonacci•6h ago
Yes, in a simulated environment you can do this today using plain JS and connecting to a real VPN, while driving the desktop UI. No infra provisioning needed.

If you need a real Windows OS + corporate VPN, we also support binding agents to actual Windows sandboxes. This example shows automating a Windows app behind a VPN: https://cua.ai/docs/example-usecases/windows-app-behind-vpn

you'll need to define a new task in the cua-bench registry first though - just sign up on the website for early access!