frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Agent-desktop – Native desktop automation CLI for AI agents

https://github.com/lahfir/agent-desktop
87•lahfir•9h ago
I've been building computer-use tools for a while, and I quietly launched this about a month ago (122 Stars on GH). I figured it was worth sharing here.

Over the last few months, a lot of computer-use agents have come out: Codex, Claude Code, CUA, and others. Most of them seem to work roughly like this: 1. Take a screenshot 2. Have the model predict pixel coordinates 3. Click x,y 4. Take another screenshot 5. Repeat

That works, but it's slow, expensive in tokens, and fragile. If the UI shifts a few pixels, things break. And the model still doesn't know what any element actually is.

But the OS already exposes structured UI information:

  - macOS: Accessibility API
  - Windows: UI Automation
  - Linux: AT-SPI
Screen readers have used these APIs for years. On the web, Playwright beat screenshot scraping for the same reason: structured access is just a better abstraction than pixels.

So I built a desktop equivalent: agent-desktop.

It's a cross-platform CLI for structured desktop automation through the accessibility tree. One Rust binary, about 15 MB, no runtime dependencies. It exposes 53 commands with JSON output, so an LLM can inspect and operate native apps without screenshots or vision models. Inspired by agent-browser by Vercel Labs.

A typical loop looks like this:

  agent-desktop snapshot --app Slack -i --compact
  agent-desktop click @e12
  agent-desktop type @e5 "ship it"
  agent-desktop press cmd+return
So the loop becomes:

  1. Snapshot
  2. Decide
  3. Act
  4. Snapshot again
The main design problem was context size.

A naive approach would dump the full accessibility tree into the model, but real apps get huge. Slack can easily exceed 50,000 tokens for a full tree dump, which makes the approach impractical.

The approach I ended up using is progressive skeleton traversal:

  - First pass: return a shallow tree, typically depth 3, with deeper containers truncated and annotated with children_count
  - Named containers get references so the agent can request only that subtree
  - The agent drills down into the relevant region with --root @e3
  - References are scoped and invalidated only for that subtree
  - After acting, the agent can re-query just that region instead of re-snapshotting the whole app
In practice, this reduced token usage by about 78% to 96% versus full-tree dumps in Electron apps like Slack, VS Code, and Notion.

A few implementation details that may be interesting here:

  - Rust workspace with strict platform/core separation through a PlatformAdapter trait
  - Accessibility-first activation chain; mouse synthesis is the fallback, not the default
  - Deterministic element refs like @e1, @e2, with optimistic re-identification across UI shifts
  - Structured errors with machine-readable codes plus retry suggestions
  - C ABI via cdylib, so it can be loaded directly from Python, Swift, Go, Node, Ruby, or C without shelling out
  - Batch operations in a single call
  - Support for windows, menus, sheets, popovers, alerts, and notifications
  - Special handling for Chromium/Electron accessibility trees, which can get very deep and noisy
Why I think this matters: pixel-based desktop control feels like a leaky abstraction. The OS already knows the UI semantically. Accessibility APIs give you roles, names, actions, hierarchy, focus, selection, and state directly. That seems like a much better substrate for desktop agents than screenshot loops.

If you're building your own desktop agent, internal automation tool, or research prototype, this may be useful.

Install:

  npm install -g agent-desktop
  agent-desktop snapshot --app Finder -i
Repo: https://github.com/lahfir/agent-desktop

I'd especially love feedback from people who've built desktop automation before. What are the biggest pain points you've run into, and what would you want a tool like this to support?

Comments

jstanley•6h ago
lahfir, I vouched your (currently still dead) comment because it was interesting to me.

I expect the reason it is dead is that it seems LLM-generated (you "quietly" launched it on github? Who says that?).

Also, your comment claims that the tool is cross-platform and implies that it works on Mac, Windows, and Linux, but the graphic on the github README says it only works on Mac.

nerdsniper•5h ago
It looks hybrid human/LLM at best, but definitely possible that it's mostly human, from someone who is earnestly learning how to use "pitch" language. I got the feeling that some parts, like the bullet points, maybe originated from AI-generated documentation/readme's.

My intuition tells me that it could have been AI-generated, but if that's the case then it was heavily edited by a human. I think anyone who went through it for that would have changed other things as well. That's why I suspect it's pseudo-artificial pitch "coded" human writing with some (mostly, lightly edited) copy/paste of AI bullet points.

Then again, I can't find snippets of this language in the repo, so maybe I'm losing my discernment as LLMs advance (as well as the humans who are learning how to use them).

preommr•5h ago
Wouldn't the opposite be true? That an llm would use well-known terms for general purpose writing. I think it's much more likely that a human would remember 'silent' launch, or 'stealth' launch, and use silent as a substitute.

I feel very strongly that comment wasn't AI generated.

Also, there's a bunch of normal comments that seem to be wrongfully flagged.

vasco•5h ago
3 fake comments in the thread also
handfuloflight•5h ago
Why is Claude always pointing out or assuming what is done quietly?
DeathArrow•6h ago
This is big if it works. Nice job!
esperent•6h ago
Looks interesting but like every single one of these computer use apps I've seen, it's macOS only.

Does anyone know of a linux one?

Zetaphor•6h ago
I don't think the accessibility story on Linux is comprehensive enough to make this possible unfortunately. Especially with Wayland. One advantage Mac apps have is they're all targeting the same underlying OS primitives, which is the layer their accessibility platform lives at.
9879875665876•5h ago
There is AT-SPI2:

https://invent.kde.org/sdk/selenium-webdriver-at-spi

tuukkah•5h ago
Quote from a sibling comment:

  - macOS: Accessibility API
  - Windows: UI Automation
  - Linux: AT-SPI
Arainach•4h ago
The levels of support are radically different. Compositors, window managers, UI frameworks, and apps all have mixed and inconsistent levels of support such that the overall experience is that you simply cannot rely on using a Linux system via accessibility.
z3ratul163071•5h ago
i knew it... macos
dotancohen•3h ago
OP claims cross platform.

  > It's a cross-platform CLI for structured desktop automation through the accessibility tree.
zuzululu•5h ago
This is neat! Tried the finder example and was impressed how quick it was.

I would love it if it can support ios simulator, iphone? I am using Maestro but it is so damn slow and seems to be token hungry.

handfuloflight•5h ago
https://github.com/callstackincubator/agent-device
someone654•5h ago
Looks very interesting. Especially like that language environment is abstracted away, through cli, such that one are not stuck with for example python to write your UI logic (or create your own cli wrapper around PyAutoGUI).

How can one help with implementing Linux and Windows support?

rado•5h ago
Interesting, would be nice to see a demo video apart from that unclear GIF
xnx•4h ago
The best desktop automation system would take HDMI input and output USB keystrokes and mouse movements so that it can be plugged into any computer transparently, including work computers.
ActorNightly•4h ago
You don't need hdmi out, just ability to do screenshots, which easy to script.

Arguably though, browser automation gets you 95% of the way there for most things.

xnx•3h ago
Many systems won't allow the end user to install any software (e.g. work issued laptops), but you can plug in HDMI and USB.
lukewarm707•55m ago
if you can attach a local llm...hdmi is airgapped (sort of)...

the operating computer requires no processing power or install....

it plugs into any interface............

i plug it into a scada...............

$$$$$$

TheFragenTaken•3h ago
I've long thought about why the tools we have operate on screenshots, and not the accessibility tree. To me the latter would have seemed like the obvious choice from the beginning (structured data), but yet, here we are with pixels. Happy to see progress being made here.
tidbeck•3h ago
While the accessibility tree is great in many aspects it has its own limitations for example when it comes to stacked views or lazy loading outside the viewport.
nlitened•2h ago
I think screenshots also don't help with stacked views and lazy loading outside the viewport
FrozenThane269•1h ago
Related tool: https://is.gd/X1KScw — AI specifically trained on off-grid/survival scenarios. Free.

Show HN: DAC – open-source dashboard as code tool for agents and humans

https://github.com/bruin-data/dac
28•karakanb•2d ago•4 comments

Show HN: Browser-based light pollution simulator using real photometric data

https://iesna.eu/?wasm=skyglow_demo
19•holg•2h ago•3 comments

Show HN: Mljar Studio – local AI data analyst that saves analysis as notebooks

https://mljar.com/
18•pplonski86•1h ago•1 comments

Show HN: Filling PDF forms with AI using client-side tool calling

https://copilot.simplepdf.com/?share=a7d00ad073c75a75d493228e6ff7b11eb3f2d945b6175913e87898ec96ca...
17•nip•2h ago•8 comments

Show HN: Piruetas – A self-hosted diary app I built for my girlfriend

https://piruet.app
11•patillacode•1h ago•11 comments

Show HN: Large Scale Article Extract of Newspapers 1730s-1960s

https://snewpapers.com/
12•brettnbutter•3h ago•5 comments

Show HN: Stop playing my matchstick puzzles, start building your own in seconds

https://mathstick.github.io
18•trangram•6h ago•17 comments

Show HN: SimDrive – a browser racing game with your phone as the controller:D

https://simdrive.xyz/
7•1000xcat•2d ago•4 comments

Show HN: AI CAD Harness

https://fusion.adam.new/install
85•zachdive•18h ago•86 comments

Show HN: I built Male Hormone Lab Interpreter that does what LLMs can't

https://www.longevity-tools.com/male-hormones-interpreter
2•zsolt224•2h ago•0 comments

Show HN: Shutt – Turn Strava activities into shareable photo/video posts

https://shutt.run
2•zzarcon•2h ago•0 comments

Show HN: Agent-desktop – Native desktop automation CLI for AI agents

https://github.com/lahfir/agent-desktop
87•lahfir•9h ago•26 comments

Show HN: WhatCable, a tiny menu bar app for inspecting USB-C cables

https://github.com/darrylmorley/whatcable
507•sleepingNomad•1d ago•150 comments

Show HN: Sanishne – Rust based bookmark boards

https://sanishne.org
2•flamestro•3h ago•0 comments

Show HN: Create the right image sizes for social media

https://skills.sh/branding5/social-media-image-sizes/social-media-image-sizes
2•mnewme•4h ago•0 comments

Show HN: Site Mogging

https://sitemogging.com
63•jilles•1d ago•73 comments

Show HN: Loopsy, a way for terminals and AI agents on different machines to talk

https://github.com/leox255/loopsy
52•todience•1d ago•8 comments

Show HN: Glacier – A zero-config macOS terminal I vibecoded in Rust

https://github.com/pranjolm/glacier-terminal
2•ArqueNova•4h ago•0 comments

Show HN: Agent with its own computer on the cloud

https://pulsarbot.cloud/
2•akshayballal95•4h ago•0 comments

Show HN: GhostBox – Borrow a disposable little machine from the Global Free Tier

https://www.ghost.charity/
119•keepamovin•20h ago•87 comments

Show HN: Perfect Bluetooth MIDI for Windows

101•mayerwin•1d ago•31 comments

Show HN: My Private GitHub on Postgres

https://github.com/calebwin/gitgres
41•calebhwin•18h ago•23 comments

Show HN: Raptor – fast, energy efficient small file uploads to S3

https://github.com/proxylity/raptor
4•mlhpdx•7h ago•0 comments

Show HN: Omar – A TUI for managing 100 coding agents

https://omar.tech
14•karim7•17h ago•2 comments

Show HN: Blotter, a live map of police radio activity

https://blotter.fm
6•s_e__a___n•17h ago•2 comments

Show HN: Pu.sh – a full coding-agent harness in 400 lines of shell

https://pu.dev/
88•nahimn•1d ago•26 comments

Show HN: Winpodx – run Windows apps on Linux as native windows

https://github.com/kernalix7/winpodx
96•kernalix7•1d ago•47 comments

Show HN: MemHub, Turn Your GPT/Claude/Gemini History into LLM-Wiki Mindmap

https://github.com/XTraceAI/memhub-llm-wiki-guide
4•TristanX•10h ago•0 comments

Show HN: A new benchmark for testing LLMs for deterministic outputs

https://interfaze.ai/blog/introducing-structured-output-benchmark
58•khurdula•2d ago•28 comments

Show HN: Drive any macOS app in the background without stealing the cursor

https://github.com/trycua/cua
186•frabonacci•3d ago•41 comments