Hi HN, I built *native-devtools-mcp*, a Model Context Protocol (MCP) server for interacting with native desktop applications UIs. Right now it supports MacOS and Windows, but I intend on adding more platforms in the future.
Motivation: Most MCP servers today target specific environments (the Chrome DevTools MCP server for browser automation is a good example) but there’s no general MCP bridge for native desktop GUIs. native-devtools-mcp gives AI agents the ability to:
- capture screenshots and extract text (OCR) from the screen
- simulate user input (mouse clicks, typing, scrolling) with hight precision by using OS local OCR
- manage windows and focus
- optionally connect to deeper UI trees for instrumented apps
It runs locally, does not upload any data externally (except for the LLM integration), and supports both macOS and Windows for now. The goal is to enable AI-driven workflows for GUI testing, automation, and desktop tool interaction.
Tech stack/highlights:
- MCP JSON-RPC interface for tool clients
- Visual feedback (images + OCR) plus input simulation
- Dual interaction modes: universal visual relying on OCR/screenshots + debug-kit structural where available (MacOS)
Limitations / roadmap:
- Early stage; improvements to accuracy and reliability planned
- Expanding deeper support for more app platforms (Android is next!)
- Integration with more AI tools - right now it's tested with Claude Code and Claude Desktop (and Cowork); it should work with other AI platforms too, but I haven't had the time to test it yet...
- Better documentation and tooling around agent integration
Feedback I’m looking for:
- Practical use cases where this changed your automation or testing workflow
- Ideas to make MCP server integration with existing AI agent stacks easier
sh3ll3x3c•1h ago
Motivation: Most MCP servers today target specific environments (the Chrome DevTools MCP server for browser automation is a good example) but there’s no general MCP bridge for native desktop GUIs. native-devtools-mcp gives AI agents the ability to:
- capture screenshots and extract text (OCR) from the screen - simulate user input (mouse clicks, typing, scrolling) with hight precision by using OS local OCR - manage windows and focus - optionally connect to deeper UI trees for instrumented apps
It runs locally, does not upload any data externally (except for the LLM integration), and supports both macOS and Windows for now. The goal is to enable AI-driven workflows for GUI testing, automation, and desktop tool interaction.
Tech stack/highlights: - MCP JSON-RPC interface for tool clients - Visual feedback (images + OCR) plus input simulation - Dual interaction modes: universal visual relying on OCR/screenshots + debug-kit structural where available (MacOS)
Limitations / roadmap: - Early stage; improvements to accuracy and reliability planned - Expanding deeper support for more app platforms (Android is next!) - Integration with more AI tools - right now it's tested with Claude Code and Claude Desktop (and Cowork); it should work with other AI platforms too, but I haven't had the time to test it yet... - Better documentation and tooling around agent integration
Feedback I’m looking for: - Practical use cases where this changed your automation or testing workflow - Ideas to make MCP server integration with existing AI agent stacks easier
Happy to answer questions.