Built by augmenting Gemma 3 12B with our new text-to-speech and speech-to-text models, both of which we will release as open-source soon. Stay tuned.
Built by augmenting Gemma 3 12B with our new text-to-speech and speech-to-text models, both of which we will release as open-source soon. Stay tuned.
The latency is about 500ms once we detect that it's the bot's turn to speak (roughly 200ms for the LLM's time-to-first token and 300ms for the TTS audio to start), plus a variable time for the semantic pause detection (VAD).
If it's clear that you're done talking, like when you ask a question, the model will reply very fast. If you stop mid-sentence as if you have more to say, it will wait for longer to avoid interrupting you.
lightbulbish•7mo ago
To the author: what happens to my voice after I upload it? What is your plan moving forward? I am too far left field to understand how to build a business and monetize an open source product like this, even though I found it fun to play around with.
unmute-sh•7mo ago
edit: Ah yes, and we do not store the voice sample on our server. The voice embedding is cached for 24 hours.