This aligns with what I've been thinking and chatting with my peers about - technical documentation would be useful to benchmark performance globally, but I have heard murmurs of it already being used for voice-gen usecases by a WITCH company.
shivekkhurana•1h ago
The TTS/STT models are actually good and aggressively priced. I personally built a voice-mode ai assistant.
STT time to first token is ~300ms. ~20 second audio takes less than 1 second to be converted.
TTS time to first token is ~700ms. ~20 second of audio is generated under 2 seconds.
alephnerd•1h ago
Absolutely! The TTS/STT approach that Sarvam and the other Indian firms are taking is more intuitive for a larger share of people and usecases. The "replace an SDR" or "replace a call-center" usecase is such an easy win to show POV.
I feel this is also why you don't see the same degree of hype as you would with the other players. When you are taking an application-driven approach to launching AI products, hype matters less than targeting decisionmakers and showing that your product directly aligns with their outcomes.
alephnerd•1h ago
shivekkhurana•1h ago
STT time to first token is ~300ms. ~20 second audio takes less than 1 second to be converted.
TTS time to first token is ~700ms. ~20 second of audio is generated under 2 seconds.
alephnerd•1h ago
I feel this is also why you don't see the same degree of hype as you would with the other players. When you are taking an application-driven approach to launching AI products, hype matters less than targeting decisionmakers and showing that your product directly aligns with their outcomes.