frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: SoMatic – Vision-based OS automation framework for AI agents

https://github.com/Smyan1909/SoMatic
1•smyansondur•38m ago
Hi HN, I'm Smyan and I enjoy building agents. Modern multimodal LLMs are great at vision and perception but are quite poor at localization. This naturally creates a massive problem when we try to take our RPA frameworks and give them to agents to perform computer use tasks.

For browsers, we have been able to solve this by using the DOM tree to supply the LLM with structural hints and now more recently modern browser use frameworks use Set-Of-Marks prompting which take the structural information of the webpage and convert it into visual bounding boxes with labels, which allow the LLM to use it's strong vision and perception and accurately convert it to a form of localization. Functionally, this means the LLM now needs to simply say "click 4" instead of having to say "click 443 213".

This methodology however fails horribly when we try to apply it to native OS automation. The accessibility tree, which is often exists for native apps, is usually quite brittle, exposes non-deterministic selectors and often stripped by developers, which can make it hard to localize elements. Fuzzy matching can help with this, but it is still none the less very hard to get right.

This is exactly why I made SoMatic. SoMatic is a pure vision based framework that uses a finetuned YOLO model (highly inspired from OmniParser v2) to identify text and interactable elements in a UI. The YOLO model runs locally on the CPU with ONNX and is quite fast. SoMatic draws the bounding boxes and labels and then maps the id for each bounding box to the coordinates for the center of the given box. This therefore enables Set-Of-Marks prompting for in principal ANY user interface.

I ran an ablation benchmark using the framework with GPT-5.5 (high) and was able to acquire a ~ 20% higher accuracy than just the raw model. What was however surprising was that the model performed slightly better with knowing just the location of the bounding boxes (without actually seeing them). This could be due to the threshold tuning for the YOLO model either drawing too many or too few boxes (I'm not entirely sure).

Either way, if you have been wanting to give your AI agents full autonomy of your computer (Windows, Mac and Linux), you can download the CLI with

   npm install -g somatic-cli/cli 
and the corresponding skill with

   npx skills add Smyan1909/SoMatic 
The CLI also comes with a stdio MCP server if you want the model to directly parse the screenshots (b64 encoded) from the chosen API instead of it having to read the image after each screenshot.

I'd love to get your feedback on the vision-only approach. Are we at the point where we can finally abandon the mess that is the OS accessibility tree for automation?

Air France and Airbus found guilty of manslaughter over 2009 plane crash

https://www.bbc.co.uk/news/articles/czd2qmdvmq6o
1•mmarian•2m ago•0 comments

Show HN: Twixt – transform one word into another in four moves

https://twixt.games/
1•jakenicolaides•2m ago•0 comments

Computing Reading List

https://timrodenbroeker.de/pages/reading-list/
1•marvinborner•3m ago•0 comments

Claude Orchestra – an org layer for your Claude Code skills/agents

https://github.com/Momo2323-ui/claude-orchestra
1•mokshmittra•4m ago•0 comments

Tesla Cybertruck driver arrested after driving into lake to use 'wade mode'

https://www.bbc.co.uk/news/articles/c072x1kml44o
1•LaSombra•6m ago•0 comments

Apalache: Symbolic model checker for TLA+ and Quint

https://github.com/apalache-mc/apalache/
1•stevefan1999•6m ago•0 comments

PhD student finds lost city in Mexico jungle by accident (2024)

https://www.bbc.com/news/articles/crmznzkly3go
1•thunderbong•7m ago•0 comments

European Parliament plenary debate on 'Stop Destroying Videogames' [video]

https://www.europarl.europa.eu/plenary/en/vod.html?mode=chapter&vodLanguage=EN&internalEPId=20170...
1•michalhosna•8m ago•0 comments

Schooling Has a Meaning Crisis

https://saigaddam.medium.com/schooling-has-a-meaning-crisis-4313decd5da8
2•ChaitanyaSai•9m ago•0 comments

Open Sourcing the European Transmission Grid Ten-Year Network Development Plan

https://github.com/open-energy-transition/open-tyndp
1•lyoncy•12m ago•0 comments

Is Huawei Too Slow on AI?

https://developer.huawei.com/consumer/en/hiai/
3•xiaoluolyg•14m ago•0 comments

FatGid: FreeBSD 14.x kernel local privilege escalation

https://fatgid.io/
2•WhyNotHugo•16m ago•0 comments

SpaceX's IPO paperwork has landed

https://www.businessinsider.com/spacex-ipo-s1-public-filing-2026-5
2•soared•17m ago•0 comments

Opal Pathtracer

https://nano-optics-opal-pathtracer.pages.dev/
2•rslice•17m ago•0 comments

India to monitor Boeing fuel-switch test tied to Air India London incident

https://www.reuters.com/world/india/india-monitor-boeing-fuel-switch-test-tied-air-india-london-i...
2•gmac•18m ago•0 comments

The Original Doom Soundtrack Is Officially in the Library of Congress

https://www.engadget.com/2173357/the-original-doom-soundtrack-is-officially-in-the-library-of-con...
2•speckx•18m ago•0 comments

"I'll buy 10 of those"–NASA science chief yearns for mass-produced satellites

https://arstechnica.com/space/2026/05/ill-buy-10-of-those-nasa-science-chief-yearns-for-mass-prod...
2•rbanffy•18m ago•1 comments

Why Patagonia?

https://hec.works/blog/why-patagonia/
2•dividedcomet•20m ago•0 comments

Dell Bulks Up Hardware as AI Infrastructure Shifts to On-Premises

https://www.nextplatform.com/compute/2026/05/19/dell-bulks-up-hardware-as-ai-infrastructure-shift...
3•rbanffy•21m ago•0 comments

Show HN: Daily word puzzle game based on polysemy

https://omitten.com
4•tomburgs•21m ago•1 comments

Opus 4.7 vs. Sonnet 4.6

2•vdelpuerto•23m ago•0 comments

SpaceX IPO filing filled with AI bets, Starship dreams, and Musk at the center

https://techcrunch.com/2026/05/20/the-spacex-ipo-filing-ai-bets-starship-dreams-elon-musk/
2•misswaterfairy•25m ago•0 comments

Sadiq Khan blocks £50M Met police deal with Palantir

https://www.lbc.co.uk/article/sadiq-khan-blocks-50m-met-police-deal-with-palantir-5HjdZTn_2/
4•testfrequency•26m ago•0 comments

Show HN: GoKubeDownscaler – Off-Hours Kubernetes Scaling Cuts Costs by 70%

https://kube-downscaler.io/
2•samuel_esp•27m ago•0 comments

Ask HN: What are you working on? (May 21)

2•Armonsrer•28m ago•3 comments

4.4 Magnitude Earthquake in Naples

https://www.euronews.com/my-europe/2026/05/21/naples-woken-by-strong-earthquake-as-tremors-hit-city
1•xg15•29m ago•1 comments

The Sovereign Retailer: Building a Spaceship in My Own Backyard

https://brewhubsystems.com/p/the-sovereign-retailer-building-a
1•tomc267•30m ago•0 comments

Show HN: I gave my AI video generator 86 MCP tools so Claude Code can drive it

https://github.com/openclaw-easy/ViralMint
1•tangxinzhi158•31m ago•0 comments

Show HN: See how much gold, plutonium and cocaine SpaceX's Bitcoin buys

https://bitcoinweighin.com/
1•hmg-ocean•31m ago•0 comments

Show HN: LoongForge-A high-performance training framework for LLM, VLM, VLA, Wan

https://github.com/baidu-baige/LoongForge
7•mindzzz•32m ago•2 comments