frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: Convert your articles into videos in one click

https://vidinie.com/
1•kositheastro•34s ago•0 comments

Red Queen's Race

https://en.wikipedia.org/wiki/Red_Queen%27s_race
2•rzk•48s ago•0 comments

The Anthropic Hive Mind

https://steve-yegge.medium.com/the-anthropic-hive-mind-d01f768f3d7b
2•gozzoo•3m ago•0 comments

A Horrible Conclusion

https://addisoncrump.info/research/a-horrible-conclusion/
1•todsacerdoti•3m ago•0 comments

I spent $10k to automate my research at OpenAI with Codex

https://twitter.com/KarelDoostrlnck/status/2019477361557926281
2•tosh•4m ago•0 comments

From Zero to Hero: A Spring Boot Deep Dive

https://jcob-sikorski.github.io/me/
1•jjcob_sikorski•5m ago•0 comments

Show HN: Solving NP-Complete Structures via Information Noise Subtraction (P=NP)

https://zenodo.org/records/18395618
1•alemonti06•10m ago•1 comments

Cook New Emojis

https://emoji.supply/kitchen/
1•vasanthv•12m ago•0 comments

Show HN: LoKey Typer – A calm typing practice app with ambient soundscapes

https://mcp-tool-shop-org.github.io/LoKey-Typer/
1•mikeyfrilot•15m ago•0 comments

Long-Sought Proof Tames Some of Math's Unruliest Equations

https://www.quantamagazine.org/long-sought-proof-tames-some-of-maths-unruliest-equations-20260206/
1•asplake•16m ago•0 comments

Hacking the last Z80 computer – FOSDEM 2026 [video]

https://fosdem.org/2026/schedule/event/FEHLHY-hacking_the_last_z80_computer_ever_made/
1•michalpleban•17m ago•0 comments

Browser-use for Node.js v0.2.0: TS AI browser automation parity with PY v0.5.11

https://github.com/webllm/browser-use
1•unadlib•18m ago•0 comments

Michael Pollan Says Humanity Is About to Undergo a Revolutionary Change

https://www.nytimes.com/2026/02/07/magazine/michael-pollan-interview.html
1•mitchbob•18m ago•1 comments

Software Engineering Is Back

https://blog.alaindichiappari.dev/p/software-engineering-is-back
1•alainrk•19m ago•0 comments

Storyship: Turn Screen Recordings into Professional Demos

https://storyship.app/
1•JohnsonZou6523•19m ago•0 comments

Reputation Scores for GitHub Accounts

https://shkspr.mobi/blog/2026/02/reputation-scores-for-github-accounts/
1•edent•22m ago•0 comments

A BSOD for All Seasons – Send Bad News via a Kernel Panic

https://bsod-fas.pages.dev/
1•keepamovin•26m ago•0 comments

Show HN: I got tired of copy-pasting between Claude windows, so I built Orcha

https://orcha.nl
1•buildingwdavid•26m ago•0 comments

Omarchy First Impressions

https://brianlovin.com/writing/omarchy-first-impressions-CEEstJk
2•tosh•31m ago•1 comments

Reinforcement Learning from Human Feedback

https://arxiv.org/abs/2504.12501
2•onurkanbkrc•32m ago•0 comments

Show HN: Versor – The "Unbending" Paradigm for Geometric Deep Learning

https://github.com/Concode0/Versor
1•concode0•33m ago•1 comments

Show HN: HypothesisHub – An open API where AI agents collaborate on medical res

https://medresearch-ai.org/hypotheses-hub/
1•panossk•36m ago•0 comments

Big Tech vs. OpenClaw

https://www.jakequist.com/thoughts/big-tech-vs-openclaw/
1•headalgorithm•38m ago•0 comments

Anofox Forecast

https://anofox.com/docs/forecast/
1•marklit•39m ago•0 comments

Ask HN: How do you figure out where data lives across 100 microservices?

1•doodledood•39m ago•0 comments

Motus: A Unified Latent Action World Model

https://arxiv.org/abs/2512.13030
1•mnming•39m ago•0 comments

Rotten Tomatoes Desperately Claims 'Impossible' Rating for 'Melania' Is Real

https://www.thedailybeast.com/obsessed/rotten-tomatoes-desperately-claims-impossible-rating-for-m...
3•juujian•41m ago•2 comments

The protein denitrosylase SCoR2 regulates lipogenesis and fat storage [pdf]

https://www.science.org/doi/10.1126/scisignal.adv0660
1•thunderbong•42m ago•0 comments

Los Alamos Primer

https://blog.szczepan.org/blog/los-alamos-primer/
1•alkyon•45m ago•0 comments

NewASM Virtual Machine

https://github.com/bracesoftware/newasm
2•DEntisT_•47m ago•0 comments
Open in hackernews

Why Windows XP is the ultimate AI benchmark

https://cuabench.ai
6•frabonacci•1mo ago

Comments

frabonacci•1mo ago
We spent the last few months trying to understand why computer-use agents (Claude Computer-Use, OpenAI CUA, Gemini 2.5 Computer-Use) fail so inconsistently.

The pattern we kept seeing: same agent, same task, different OS theme = notably different results.

Claude Sonnet 4 scores 31.9% on OSWorld and Windows Agent Arena (2 of the most relevant benchmarks for computer-use agents) — but with massive variance. An agent trained on Windows 11 light mode fails on dark mode. Works on macOS Ventura, breaks on Monterey. Works on Win11, collapses on Vista.

The root cause: training data lacks visual diversity. Current benchmarks (OSWorld, Windows Agent Arena) rely on static VM snapshots with fixed configurations. They don't capture the reality of diverse OS themes, window layouts, resolution differences, or desktop clutter.

We built cua-bench — HTML-based simulated environments that render across 10+ OS themes (macOS, Win11, WinXP, Win98, Vista, iOS, Android). Define a task once, generate thousands of visual variations.

This enables: - Oracle trajectory generation via a Playwright-like API (verified ground truth for training) - Trajectory replotting: record 1 demo → re-render across 10 OS themes = 10 training trajectories

The technical report covers our approach to trajectory generation, Android/iOS environments, cross-platform HTML snapshots, and a comparison with existing benchmarks.

We’re currently working with research labs on training data generation and benchmarks, but we’d really value input from the HN community: - What tasks or OS environments should be standardized to actually stress computer-use agents? - Legacy OSes? Weird resolutions? Broken themes? Cluttered desktops? Modal hell?

Curious what people here think are the real failure modes we should be benchmarking.

someguy101010•1mo ago
as an infrastructure engineer the idea of being able to train computer use agents without provisioning infrastructure sounds amazing!

a common use case i run into is i want to be able to configure corporate vpn software on windows machines. is there a link for a getting started guide i could try this out with?

frabonacci•1mo ago
Yes, in a simulated environment you can do this today using plain JS and connecting to a real VPN, while driving the desktop UI. No infra provisioning needed.

If you need a real Windows OS + corporate VPN, we also support binding agents to actual Windows sandboxes. This example shows automating a Windows app behind a VPN: https://cua.ai/docs/example-usecases/windows-app-behind-vpn

you'll need to define a new task in the cua-bench registry first though - just sign up on the website for early access!