I wanted to share a small project I've been working on to solve a personal pain point: TinyTTS.
We all love our massive 70B+ LLMs, but when building local voice assistants, running a heavy TTS framework alongside them often eats up way too much precious VRAM and compute. I wanted something absurdly small and fast that "just works" locally.
TL;DR Specs:
Size: ~9 Million parameters
Disk footprint: ~20 MB checkpoint (G.pth)
Speed (CPU): ~0.45s to generate 3.7s of audio (~8x faster than real-time)
Speed (GPU - RTX 4060): ~0.056s (~67x faster than real-time)
Peak VRAM: ~126 MB
License: Apache 2.0 (Open Weights)
Why TinyTTS? It is designed specifically for edge devices, CPU-only setups, or situations where your GPU is entirely occupied by your LLM. It's fully self-contained, meaning you don't need to run a complex pipeline of multiple models just to get audio out.
How to use it? I made sure it’s completely plug-and-play with a simple Python API. Even better, on your first run, it will automatically download the tiny 20MB model from Hugging Face into your cache for you.
pip install git+https://github.com/tronghieuit/tiny-tts.git
Python API:
from tiny_tts import TinyTTS
# Auto-detects device (CPU/CUDA) and downloads the 20MB checkpoint
tts = TinyTTS()
tts.speak("The weather is nice today, and I feel very relaxed.", output_path="output.wav")
CLI:
tiny-tts --text "Local AI is the future" --device cpu
Links:
GitHub: https://github.com/tronghieuit/tiny-tts
Gradio Web Demo: Try it on HF Spaces here
Hugging Face Model: backtracking/tiny-tts
What's next? I plan to clean up and publish the training code soon so the community can fine-tune it easily. I am also looking into adding ultra-lightweight zero-shot voice cloning.
Would love to hear your feedback or see if anyone manages to run this on a literal potato! Let me know what you think.
If you find this project helpful, please give it a on GitHub.