Most open TTS models are either computationally heavy or generate 16-24kHz audio. Mira achieves high fidelity and speed by combining two things:
FlashSR: For generating crisp and clearer 48kHz audio outputs.
LMDeploy: Heavily optimized inference allowing for 100x real-time speed and low latency (roughly150ms).
I built this so local users have access to a high quality local text-to-speech model that works for any usecase. It’s currently in its early stages, and I'm currently experimenting with multilingual versions and multi-speaker versions. Streaming is coming soon as well.
Repo: https://github.com/ysharma3501/MiraTTS
Model: https://huggingface.co/YatharthS/MiraTTS
I also wrote a breakdown on how these LLM based TTS models work: https://huggingface.co/blog/YatharthS/llm-tts-models