It uses:
YOLOX for gesture detection
ONNX Runtime Web for in-browser inference
Plain JS for the UI
The original goal was simple: Could I make real-time gesture-based input usable inside a browser without freezing the UI?
A few observations:
In-browser ML performance is better than I expected on modern laptops
Subtle gesture distinctions (e.g. similar seals like Tiger vs Ram) require stronger detection than MediaPipe provided — YOLOX performed noticeably better
Lighting consistency matters more than hand size
It’s obviously not production-grade, but it was an interesting exploration of browser-based vision input.
Curious what others think about gesture interfaces as alternative input systems.