Phone GUI agents (e.g., AutoGLM-Phone, GELab) can already do NL-driven taps/navigation/form filling.
My observation: smaller GUI models (often 4B/9B class) work well for single interactions, but become brittle on long workflows with branching and recovery.
I built a Skill layer that separates planning from execution:
- Planner: Claude Code / Codex (task decomposition, decision-making, replanning)
- Orchestrator: Skill layer (state machine, retries/rollback, tool protocol)
- Executor: phone GUI model (screen parsing + UI actions + cross-app execution)
Execution loop:
1. Goal in NL/template
2. Planner emits step plan + conditions + fallback strategy
3. Skill compiles into atomic actions (tap/type/swipe/wait/verify)
4. GUI executor runs on real/cloud phone, returns screenshots/state/structured output
5. Planner/orchestrator decide next step until success/fallback
Potential use cases:
- recruiting outreach automation
- multi-platform content distribution
- social outreach workflows
- lead extraction
- competitor monitoring