frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

The First Fully General Computer Action Model

https://si.inc/posts/fdm1/
107•nee1r•2d ago

Comments

rio_popper•2d ago
Curious about the masked diffusion IDM choice. They mention CTC loss and cross-entropy both underperformed — I'd love to see ablations on that. The claim that typos were "extremely common" with non-causal cross-entropy is interesting but hand-wavy without numbers.
nee1r•2d ago
the main chain of experiments was trying causal => non-causal => non-causal with ctc and CE. i think a good intuition here is that you need a generative approach fundamentally because there definitely are multiple correct IDM labels.
ennucore•2d ago
The car thing is very impressive By the way, do you have plans to handle the computer’s audio output?
g413n•2d ago
yeah we've done audio work in the past so we'll def merge the recipes at some point, long term should have full io that a human has (except maybe not generating video for video calls that seems a bit much)
ennucore•2d ago
How do you tokenize the mouse inputs?
nee1r•2d ago
good question! we use exponential binning (map the mouse movements onto a plane with exponentially increasing tick marks https://si.inc/fdm1/exponential_binning.webp) but tried a bunch of other methods (linear creates too many tokens for the model to learn well). Polar coordinates seem like a better solution but empirically didn't work well because the tokens got too coarse too fast.
g413n•2d ago
we do exponential binning but fwiw I think we can do way better just hasn't been the main research area initially
nee1r•2d ago
Hey guys! I’m Neel, been holed up in our south park office for the past year working on model training. excited to share our research!

This is a preview of a very different type of computer use model—we train on the internet. Specifically we have 11 million hours of computer video stored on our storage cluster (previously shared https://news.ycombinator.com/item?id=45438496 !) and the model can work in 30 FPS. Since we match the fundamental form factor of computer-use, we can get our model to do CAD, browse websites, and even drive a car using arrow keys. I’m super excited to see what our model can do as we scale more, it's a fun frontier to work on (not language models :) ).

The team and I will be online responding to the comments, so drop any questions.

AndrewKemendo•48m ago
This looks like a really promising approach

In particular the Forward rollout module is very important. It aligns your (effectively) world model with what it expects from the world, and keeping those in sync I think gives this the power it needs to be able to generate the state action pairs to continuously train semi supervised

dangoodmanUT•47m ago
11 million hours of data is a lot, did you have to synthesize it at all, or was it purely collected?
clemvonstengel•2d ago
I rly liked the point about ctrl-c only being able to be labelled retrocausally. I do think that with enough past context you should be able to know what was copied - in some sense the past does encode the future - but also an agentic decision is precisely the kind where the future is more informative than the past for reconstructing that decision.

It does make me wonder if you should have the inverse dynamics model split into specifically retrocausal and causal. You kind of do this already with the inverse and forward dynamics model, but the idea of a model that knows only about the future training in a feedback loop with a model that knows only about the past is kind of interesting.

I think you could just do a clever masking regime in your diffusion model to achieve the same effect without a whole architecture change.

g413n•2d ago
yeah we actually had some wacky ideas with ctc + a reverse-causal mask but diffusion does just make it all a bit more simple
ClaireBookworm•2d ago
What sort of fine tuning data was needed to allow the model to self-drive? One hour of video of someone driving, or extra labeling?
nee1r•2d ago
i actually drove the car (with arrow keys) around south park for around ~45 minutes as finetuning data, no extra labelling other than that. think the car line graph is super cool because you actually see the videegame prior working
g413n•2d ago
relevant note is that we finetuned by having the human also use arrow keys which keeps it in-distribution but also slower to collect
kdrag0n•2d ago
what tasks can the model do out of the box? was each of the examples a different fine tuned model?
g413n•2d ago
it's a pretty general policy but this is all super early, it's great at exploring websites so fuzzing was easy, for CAD it has good enough base rates with the few-shot prompt when we do the repetitive stuff, and we gave it checkpoints on each step, the other stuff in the mosaic are just some of our favorite clips from internal evals
aakashks•2d ago
The video compression is very cool. And the small tricks like binning the mouse movements.

Wonder how much data is generalizable across different UIs? ie how good will the model be at using Figma if it’s never seen it before but has seen a lot of Photoshop

nee1r•2d ago
this is honestly an issue for the inverse dynamics (for app specific shortcuts etc.) but for general UI learning we still see promising eval trends
alyxya•1d ago
This looks extremely impressive, really deserves more attention here.

Are the inverse dynamics and forward dynamics models trained separately? It sounds like if the inverse dynamics model is meant to extrapolate more training data, then perhaps all that means is it takes very little data to generalize directly with the forward dynamics model assuming the right architecture.

nee1r•1d ago
thanks! the inverse dynamics model is trained first on 40k hours of data and then frozen to label all 11 million hours. yup! the idea is that it should take a small amount of data to generalize environment dynamics, then you can use a lot of data to understand actions.
152334H•1d ago
holy crap, this is so good. How did it get buried?
nee1r•1d ago
real
sheepscreek•15m ago
Are you guys affiliated with Meta’s ex-CTO in any way? I remember he famously implied that LLMs hyped. The demos are very impressive. Does this use an attention based mechanism too? Just trying to understand (as a layman) how these models handle context and if long contexts lead to weaker results. Could be catastrophic in the real world!
sheepscreek•12m ago
I think in the long run, we may need something like a batch job that compresses context from the last N conversations (in LLMs) and applies that as an update to weights. A looser form of delayed automated reinforcement learning.

Or make something like LoRA mainstream for everyone (probably scales better for general use models shared by everyone).

yoyohello13•1h ago
Too technical for HN
Obscura-•2h ago
Amazing!
piva00•1h ago
Just wanted to say: this is might impressive research.

Really interesting breakdown, proper nerdsniped into this, thanks for the refreshing AI news outside of language models :)

wasmainiac•1h ago
Can it defeat captchas?
sp1nningaway•1h ago
May I suggest a driving demo in a parking lot with a mannequin instead of a real world video where it drives way too close to a pedestrian?

Otherwise, very cool and exciting!

cs702•1h ago
At first glance, this looks incredible to me. The authors train one model on 40K hours of computer-use video, previously labeled by contractors with keyboard and mouse actions, then use that model, in effect, to label 11M hours of computer-use video, which they use to train the computer-action model. The key advance is in compression. Quoting from the OP:

> [previous models] burn a million tokens to understand just one minute of 30 FPS computer data. Our video encoder encodes nearly 2 hours of video in the same number of tokens—that’s 50x more token-efficient than the previous state-of-the-art and 100x more token-efficient than OpenAI’s encoder.

While I was already aware that there are people working on new, more efficient "world models," this is the first one I've seen in action. I'm a bit in shock at how good it is, quite frankly.

I've added the OP, as well as a related 2018 paper on Behavioral Cloning from Obervation (BCO) to my reading list.[a] So far, I've only skimmed the 2018 paper, but it's already evident that it's well-written. I'm no expert in deep RL, and I can understand it. BTW, "Behavioral Cloning from Obervation" is a really good name, with an easy-to-remember acronym.

Thank you for sharing this on HN.

[a] https://arxiv.org/abs/1805.01954

akoboldfrying•1h ago
My tech-informed but ML-ignorant take: This will soon be the biggest thing since ChatGPT.
bitwize•33m ago
Looks like it's playing the special stages from Knuckles' Chaotix?
LorenDB•20m ago
Nice that it can drive a car, but you could just use openpilot.
davidguetta•16m ago
Beware of ending up on the top page of "things HN didn't like" with such a comment (see post a few days ago)
vessenes•17m ago
dammmmmmnnnn - lots to like here. I'm impressed with the 80,000 parallel website fuzzing desktops. And the 30hz (everything). Amazing.

Jimi Hendrix was a systems engineer

https://spectrum.ieee.org/jimi-hendrix-systems-engineer
237•tintinnabula•3h ago•92 comments

First Website

https://info.cern.ch
26•shrikaranhanda•1h ago•2 comments

Making MCP cheaper via CLI

https://kanyilmaz.me/2026/02/23/cli-vs-mcp.html
89•thellimist•3h ago•48 comments

Banned in California

https://www.bannedincalifornia.org/
115•pie_flavor•46m ago•99 comments

The Om Programming Language

https://www.om-language.com/
218•tosh•6h ago•47 comments

Bus stop balancing is fast, cheap, and effective

https://worksinprogress.co/issue/the-united-states-needs-fewer-bus-stops/
274•surprisetalk•7h ago•435 comments

Windows 11 Notepad to support Markdown

https://blogs.windows.com/windows-insider/2026/01/21/notepad-and-paint-updates-begin-rolling-out-...
154•andreynering•6h ago•290 comments

Show HN: Respectify – A comment moderator that teaches people to argue better

https://respectify.org/
68•vintagedave•9h ago•102 comments

The First Fully General Computer Action Model

https://si.inc/posts/fdm1/
107•nee1r•2d ago•37 comments

Large-Scale Online Deanonymization with LLMs

https://simonlermen.substack.com/p/large-scale-online-deanonymization
170•DalasNoin•1d ago•152 comments

Dissecting the CPU-memory relationship in garbage collection (OpenJDK 26)

https://norlinder.nu/posts/GC-Cost-CPU-vs-Memory/
33•jonasn•1d ago•9 comments

Why every automaker is quietly bringing back the inline-six engine

https://carbuzz.com/why-automakers-bringing-back-the-inline-six-engine/
23•teleforce•3d ago•29 comments

Learnings from 4 months of Image-Video VAE experiments

https://www.linum.ai/field-notes/vae-reconstruction-vs-generation
51•schopra909•1d ago•9 comments

Show HN: I ported Tree-sitter to Go

https://github.com/odvcencio/gotreesitter
169•odvcencio•5h ago•70 comments

Following 35% growth, solar has passed hydro on US grid

https://arstechnica.com/science/2026/02/final-2025-data-is-in-us-energy-use-is-up-as-solar-passes...
370•rbanffy•7h ago•295 comments

How to fold the Blade Runner origami unicorn (1996)

https://web.archive.org/web/20011104015933/www.linkclub.or.jp/~null/index_br.html
247•exvi•3d ago•35 comments

Access to a Shared Unix Computer

http://tilde.club/
32•TigerUniversity•3d ago•9 comments

GNU Texmacs

https://www.texmacs.org/tmweb/home/welcome.en.html
114•remywang•8h ago•41 comments

Origin of the rule that swap size should be 2x of the physical memory

https://retrocomputing.stackexchange.com/questions/32492/origin-of-the-rule-that-swap-size-should...
7•SeenNotHeard•53m ago•0 comments

Trellis AI (YC W24) is hiring deployment lead to accelerate medication access

https://www.ycombinator.com/companies/trellis-ai/jobs/7ZlvQkN-lead-deployment-strategist
1•macklinkachorn•7h ago

Devirtualization and Static Polymorphism

https://david.alvarezrosa.com/posts/devirtualization-and-static-polymorphism/
33•dalvrosa•5h ago•11 comments

The Misuses of the University

https://www.publicbooks.org/the-misuses-of-the-university/
114•ubasu•7h ago•82 comments

The Hydrogen Truck Problem Isn't the Truck

https://www.mikeayles.com/blog/hydrogen-refuelling-road-freight/
4•mikeayles•1d ago•2 comments

Claude Code Remote Control

https://code.claude.com/docs/en/remote-control
472•empressplay•16h ago•273 comments

Never buy a .online domain

https://www.0xsid.com/blog/online-tld-is-pain
642•ssiddharth•10h ago•403 comments

Why isn't LA repaving streets?

https://lapublicpress.org/2026/02/why-isnt-la-repaving-streets/
87•speckx•7h ago•168 comments

Launch HN: TeamOut (YC W22) – AI agent for planning company retreats

https://app.teamout.com/ai
38•vincentalbouy•10h ago•47 comments

New accounts on HN more likely to use em-dashes

https://www.marginalia.nu/weird-ai-crap/hn/
563•todsacerdoti•9h ago•476 comments

Danish government agency to ditch Microsoft software (2025)

https://therecord.media/denmark-digital-agency-microsoft-digital-independence
718•robtherobber•13h ago•365 comments

Text-Based Google Directions

https://gdir.telae.net/
48•TigerUniversity•4d ago•15 comments