I couldn’t find anything that combined these, so I reluctantly built and self-hosted a conversational voice pipeline (STT → LLM → TTS). It sounds close to ElevenLabs, but costs ~10× less per minute at scale. However, it’s difficult to maintain, and requires serious GPU capacity, so it feels like total overkill for just my app.
I’m considering exposing this as a turn-key conversational voice API or embeddable widget.
Is this something others would want to use?
willbdavenport•1h ago