This release supports English text-to-speech applications in eight voices: four male and four female. The model is quantized to int8 + fp16, and it uses onnx for runtime. The model is designed to run literally anywhere eg. raspberry pi, low-end smartphones, wearables, browsers etc. No GPU required!
We're releasing this to give early users a sense of the latency and voices that will be available in our next release (hopefully next week). We'd love your feedback! Just FYI, this model is an early checkpoint trained on less than 10% of our total data.
We started working on this because existing expressive OSS models require big GPUs to run them on-device and the cloud alternatives are too expensive for high frequency use. We think there's a need for frontier open-source models that are tiny enough to run on edge devices!
GaggiX•6h ago
https://github.com/KittenML/KittenTTS
This is the model and Github page, this blog post looks very much AI generated.