I’m building a voice-first AI language teacher in the browser using: - Next.js - Gemini Audio API for STT - Gemini TTS - Supabase
The product vision is a structured AI “language university” rather than a general chatbot.
Right now my biggest technical problem is: How would you implement lightweight browser-native lip sync for a static avatar image while TTS audio is playing?
I’ve tried: - Three.js + VRM (too heavy / unstable for this use case) - simple canvas mouth animation - CSS-only pulse effects
I want: - something realistic enough to feel alive - low dependency weight - web-compatible - stable on mobile
Secondary issues: - MediaRecorder reliability on mobile Safari - reducing transcript latency - voice UX for guided teaching rather than free-form chat
Demo: https://koshe-al.onrender.com
Repo: https://github.com/Bugsbuny24/Koshe-Al-
Would love technical suggestions, architecture criticism, or examples of similar systems done well.