frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

The first general computer action model

https://si.inc/posts/fdm1/
21•nee1r•2h ago

Comments

rio_popper•2h ago
Curious about the masked diffusion IDM choice. They mention CTC loss and cross-entropy both underperformed — I'd love to see ablations on that. The claim that typos were "extremely common" with non-causal cross-entropy is interesting but hand-wavy without numbers.
nee1r•1h ago
the main chain of experiments was trying causal => non-causal => non-causal with ctc and CE. i think a good intuition here is that you need a generative approach fundamentally because there definitely are multiple correct IDM labels.
ennucore•1h ago
The car thing is very impressive By the way, do you have plans to handle the computer’s audio output?
g413n•1h ago
yeah we've done audio work in the past so we'll def merge the recipes at some point, long term should have full io that a human has (except maybe not generating video for video calls that seems a bit much)
ennucore•1h ago
How do you tokenize the mouse inputs?
nee1r•1h ago
good question! we use exponential binning (map the mouse movements onto a plane with exponentially increasing tick marks https://si.inc/fdm1/exponential_binning.webp) but tried a bunch of other methods (linear creates too many tokens for the model to learn well). Polar coordinates seem like a better solution but empirically didn't work well because the tokens got too coarse too fast.
g413n•1h ago
we do exponential binning but fwiw I think we can do way better just hasn't been the main research area initially
nee1r•1h ago
Hey guys! I’m Neel, been holed up in our south park office for the past year working on model training. excited to share our research!

This is a preview of a very different type of computer use model—we train on the internet. Specifically we have 11 million hours of computer video stored on our storage cluster (previously shared https://news.ycombinator.com/item?id=45438496 !) and the model can work in 30 FPS. Since we match the fundamental form factor of computer-use, we can get our model to do CAD, browse websites, and even drive a car using arrow keys. I’m super excited to see what our model can do as we scale more, it's a fun frontier to work on (not language models :) ).

The team and I will be online responding to the comments, so drop any questions.

clemvonstengel•1h ago
I rly liked the point about ctrl-c only being able to be labelled retrocausally. I do think that with enough past context you should be able to know what was copied - in some sense the past does encode the future - but also an agentic decision is precisely the kind where the future is more informative than the past for reconstructing that decision.

It does make me wonder if you should have the inverse dynamics model split into specifically retrocausal and causal. You kind of do this already with the inverse and forward dynamics model, but the idea of a model that knows only about the future training in a feedback loop with a model that knows only about the past is kind of interesting.

I think you could just do a clever masking regime in your diffusion model to achieve the same effect without a whole architecture change.

g413n•1h ago
yeah we actually had some wacky ideas with ctc + a reverse-causal mask but diffusion does just make it all a bit more simple
ClaireBookworm•1h ago
What sort of fine tuning data was needed to allow the model to self-drive? One hour of video of someone driving, or extra labeling?
nee1r•1h ago
i actually drove the car (with arrow keys) around south park for around ~45 minutes as finetuning data, no extra labelling other than that. think the car line graph is super cool because you actually see the videegame prior working
g413n•1h ago
relevant note is that we finetuned by having the human also use arrow keys which keeps it in-distribution but also slower to collect
kdrag0n•1h ago
what tasks can the model do out of the box? was each of the examples a different fine tuned model?
g413n•1h ago
it's a pretty general policy but this is all super early, it's great at exploring websites so fuzzing was easy, for CAD it has good enough base rates with the few-shot prompt when we do the repetitive stuff, and we gave it checkpoints on each step, the other stuff in the mosaic are just some of our favorite clips from internal evals
snowhale•47m ago
curious about distribution in the training data. 11M hours of internet computer use is probably heavily skewed toward browser, email, and productivity apps -- the long tail of specialized tools (CAD, financial software, lab instruments) is thin. the car demo is impressive but driving is actually well-represented in internet video. how much fine-tune data did you need for blender vs the car task?
nee1r•35m ago
no finetuning data for the blender task! we actually think its the opposite, there are a lot of video tutorials for complex tasks like onshape/blender/fusion360 but not as much of people idly browsing.

but also at the 11M hour scales it still sees a substantial amount of data

aakashks•24m ago
The video compression is very cool. And the small tricks like binning the mouse movements.

Wonder how much data is generalizable across different UIs? ie how good will the model be at using Figma if it’s never seen it before but has seen a lot of Photoshop

nee1r•21m ago
this is honestly an issue for the inverse dynamics (for app specific shortcuts etc.) but for general UI learning we still see promising eval trends

A visual summary of the 5 prerequisites for improvement

https://mental-models.oldschoolburke.com/five-prerequisites/
1•zdosb•1m ago•1 comments

Zwasm: A fast, spec-compliant WebAssembly runtime written in Zig

https://github.com/clojurewasm/zwasm
1•jedisct1•1m ago•0 comments

Americans are destroying Flock surveillance cameras

https://techcrunch.com/2026/02/23/americans-are-destroying-flock-surveillance-cameras/
1•mikece•2m ago•0 comments

Life at the Frontlines of Demographic Collapse

https://www.lesswrong.com/posts/FreZTE9Bc7reNnap7/life-at-the-frontlines-of-demographic-collapse
1•reducesuffering•4m ago•0 comments

I analyzed hundreds of humans vs. AI Tetris games, here's what I found

https://www.a16z.news/p/i-built-tetrisbench-where-llms-compete
1•ykhli•4m ago•0 comments

Real-time security reasoning inside your IDE

https://open-vsx.org/extension/DevSecAI/Arko
1•mlnas•4m ago•1 comments

Fuss: OverlayFS Without Mounting

https://writethat.blog/fuss.html
2•psarna•7m ago•0 comments

Alleged Distillation Attacks by DeepSeek, Moonshot AI, and MiniMax

https://twitter.com/anthropicai/status/2025997929840857390
5•mike_kamau•8m ago•0 comments

ESR posits that the C-era is reaching its natural conclusion

https://twitter.com/esrtweet/status/2026004594590089484
2•sgt•12m ago•0 comments

Show HN: Emotica – AI that analyzes your emotions instead of just tracking them

https://apps.apple.com/us/app/emotica-mood-tracker-diary/id6757162931
2•tirupati_balan•12m ago•1 comments

Muscle Cathepsin B Improves Neurogenic Deficits in Mouse Alzheimer's Disease

https://onlinelibrary.wiley.com/doi/10.1111/acel.70242
3•bookofjoe•13m ago•0 comments

Show HN: I rebuilt my hobby mapping platform

https://trippi.app
2•velmu•14m ago•0 comments

Waymo Is Destroying Tesla's Self-Driving Dreams

https://neuralfoundry.substack.com/p/waymo-is-destroying-teslas-self-driving
4•truenfel•17m ago•0 comments

Anthropic: Industrial-scale distillation attacks on our models by Chinese AI

https://twitter.com/i/status/2025997928242811253
6•mudil•17m ago•1 comments

Neural Correlates of Envy and Schadenfreude

https://www.science.org/doi/10.1126/science.1165604
2•toomuchtodo•17m ago•1 comments

One Lib to Rule Them All: Why we build oneringai open source agentic AI library

https://medium.com/superstringtheory/one-library-to-rule-them-all-why-we-built-oneringai-689f9048...
2•jhoxray•17m ago•0 comments

Issues with "C99 implementation of new O(m log^(2/3) n) shortest path algorithm"

https://github.com/danalec/DMMSY-SSSP/issues/1
2•dunmalg•22m ago•0 comments

The Future of Social Media Is Human

https://blog.picheta.me/post/the-future-of-social-media-is-human/
1•dom96•22m ago•0 comments

AWS suffered 'at least two outages' caused by AI tools

https://www.tomsguide.com/computing/aws-suffered-at-least-two-outages-caused-by-ai-tools-and-now-...
2•randycupertino•22m ago•2 comments

Show HN: MachineAuth:open source Google login for your AI Agent

https://github.com/mandarwagh9/MachineAuth
2•mandarwagh•23m ago•0 comments

Is this cloud/local boundary for trading infra reasonable?

3•Sultan_Custodia•24m ago•0 comments

Zoye – The First AI Native Workspace for All Your Business Tools

https://zoye.io/
3•anizeu•24m ago•1 comments

The British get a nosebleed when they get too successful

https://www.reaction.life/p/the-british-get-a-nosebleed-when
2•ossa-ma•26m ago•0 comments

Liver exerkine reverses Alzheimer's-related memory loss via vasculature

https://www.sciencedirect.com/science/article/pii/S009286742600111X
6•PaulHoule•29m ago•0 comments

Show HN: Shibuya – A High-Performance WAF in Rust with eBPF and ML Engine

https://ghostklan.com/shibuya.html
4•germainluperto•30m ago•0 comments

The Era of AI human clone

2•Metalcode•30m ago•0 comments

Show HN: I built a tool track cash flow without the "spreadsheet stress"

https://www.opboard.io/
2•wwxoxo•30m ago•1 comments

Baudbot: Always-on AI assistant for dev teams

https://github.com/modem-dev/baudbot
2•tosh•32m ago•0 comments

Why Frederick Wiseman Was the Greatest Documentary Filmmaker Ever

https://www.newyorker.com/culture/the-front-row/why-frederick-wiseman-was-the-greatest-documentar...
2•mitchbob•32m ago•1 comments

Anthropic announces proof of distillation at scale by MiniMax, DeepSeek,Moonshot

https://twitter.com/anthropicai/status/2025997928242811253
28•Jimmc414•33m ago•19 comments