Some technical details: - Predicts conversational floor ownership, not speech endpoints - Audio-native streaming model, no ASR dependency - Human-timed responses without silence-based delays - Zero interruptions at sub-100ms median latency - In benchmarks Sparrow-1 beats all existing models at real world turn-taking baselines
I wrote more about the work here: https://www.tavus.io/post/sparrow-1-human-level-conversation...