We build DOM-native web agents (no screenshot-based vision, no CDP/Playwright debugger-port control). We handle captchas natively including Google reCAPTCHA image challenges by traversing cross-origin iframes and shadow DOM. The latency is high on this one currently.
The problem: when debugging image selection captchas ("select all images with traffic lights"), logs don't tell you why the agent clicked the wrong tiles. I found myself staring at execution logs thinking "did it even see the grid correctly?" and realized I just wanted to watch it work.
So we built live VNC view + takeover for serverless Chrome workers on Cloud Run.
Key learnings:
1. Session affinity is best-effort; "attach later" can hit a different instance
2. A separate relay service that pairs viewer↔runner by short-lived tokens makes attach deterministic
3. Runner stays clean: concurrency=1, one browser per container, no mixed traffic
Would love feedback from folks who've shipped similar:
1. What replaced VNC for you (WebRTC etc) and why?
2. Best approach for recording/replay without huge storage?
3. How do you handle "attach later" safely in serverless?