Hi HN,
I’ve been experimenting with a different approach to computer-using AI agents.
Most current AI agents control computers using: • cloud APIs with stored credentials
• browser automation
• screenshot + vision + mouse control
I tried something else.
Instead of embedding the AI inside the computer, I use the official mobile LLM apps (ChatGPT / Claude) as the intelligence source, and built an external execution gateway that translates model intent into deterministic OS actions.
The model never gets system privileges, and the computer never exposes credentials to the model.
Architecture:
phone LLM app → data link → action gateway → predefined action skills → desktop OS
The gateway only executes whitelisted primitives: keyboard sequences
window operations
command calls
The key idea is separating cognition and execution.
The model outputs decisions, not motor control.
The gateway performs verified actions.
This turns computer control from a continuous UI manipulation problem into a discrete decision problem, which makes it more predictable and auditable.
Early prototype — I’d really appreciate feedback, especially from people working on agent safety or permission models.
Comments
Ruikhu•2h ago
Hi — author here.
One clarification:
The goal is not to let an AI freely control a computer.
I built a fixed local action skill library.
Each skill is a deterministic OS operation (open app, switch window, run command, structured input).
The model does not generate UI steps or mouse actions.
It only selects a skill.
The gateway executes it.
So the LLM is making decisions, not performing motor control.
The computer isn’t remotely driven by the model —
the model chooses from a constrained set of allowed actions.
This is mainly an experiment in making computer-using agents more predictable and auditable.
I’d especially value thoughts from people working on agent safety.
Ruikhu•1h ago
Another clarification since a few people messaged me privately:
This is not just a conceptual architecture — we actually tested it using the official Claude mobile app controlling a real desktop computer.
The phone runs the model inside the official app.
The app produces instructions in natural language.
Our gateway parses intent and maps it to a verified local action skill (keyboard/window/command primitives).
So the model is not embedded in the OS and not calling an API.
It is literally the mobile LLM app interacting with a real operating system through a constrained execution layer.
We were interested in whether an official consumer LLM app (without system privileges) could still reliably operate a computer when paired with a deterministic action layer.
Ruikhu•2h ago