What moved the needle:
Voice is a turn-taking problem, not a transcription problem. VAD alone fails; you need semantic end-of-turn detection.
The system reduces to one loop: speaking vs listening. The two transitions - cancel instantly on barge-in, respond instantly on end-of-turn - define the experience.
STT → LLM → TTS must stream. Sequential pipelines are dead on arrival for natural conversation.
TTFT dominates everything. In voice, the first token is the critical path. Groq’s ~80ms TTFT was the single biggest win.
Geography matters more than prompts. Colocate everything or you lose before you start.
GitHub Repo: https://github.com/NickTikhonov/shuo
Follow whatever I next tinker with: https://x.com/nick_tikhonov
MbBrainz•2h ago