We've been tinkering with TTS models for a while, and I'm excited to share KaniTTS – an open-source text-to-speech system we built at NineNineSix.ai. It's designed for speed and quality, hitting real-time generation on consumer GPUs while sounding natural and expressive.
Quick overview:
Architecture: Two-stage pipeline – a LiquidAI LFM2-350M backbone generates compact semantic/acoustic tokens from text (handling prosody, punctuation, etc.), then NVIDIA's NanoCodec synthesizes them into 22kHz waveforms. Trained on ~50k hours of data.
Performance: On an RTX 5080, it generates 15s of audio in ~1s with only 2GB VRAM.
Languages: English-focused, but tokenizer supports Arabic, Chinese, French, German, Japanese, Korean, Spanish (fine-tune for better non-English prosody).
Use cases: Conversational AI, edge devices, accessibility, or research. Batch up to 16 texts for high throughput.
It's Apache 2.0 licensed, so fork away.
Check the audio comparisons on the page – it holds up well against ElevenLabs or Cartesia.
there are lots of TTS models, while what I care most is how can I easily use it, like clone my voice?
ulan_kg•1h ago
most TTS models are either small and too robotic or big and slow. We try to hit the sweet spot: near-human quality (MOS 4.3/5) while running fast on consumer GPUs like an RTX 5080.
For voice cloning, it's better to fine-tune the base model, if you are chasing top quality. What’s your use case?
ulan_kg•1h ago
We've been tinkering with TTS models for a while, and I'm excited to share KaniTTS – an open-source text-to-speech system we built at NineNineSix.ai. It's designed for speed and quality, hitting real-time generation on consumer GPUs while sounding natural and expressive.
Quick overview: Architecture: Two-stage pipeline – a LiquidAI LFM2-350M backbone generates compact semantic/acoustic tokens from text (handling prosody, punctuation, etc.), then NVIDIA's NanoCodec synthesizes them into 22kHz waveforms. Trained on ~50k hours of data. Performance: On an RTX 5080, it generates 15s of audio in ~1s with only 2GB VRAM.
Languages: English-focused, but tokenizer supports Arabic, Chinese, French, German, Japanese, Korean, Spanish (fine-tune for better non-English prosody).
Use cases: Conversational AI, edge devices, accessibility, or research. Batch up to 16 texts for high throughput.
It's Apache 2.0 licensed, so fork away.
Check the audio comparisons on the page – it holds up well against ElevenLabs or Cartesia.
Repo: https://github.com/nineninesix-ai/kani-tts
Model: https://huggingface.co/nineninesix/kani-tts-450m-0.1-pt
Page: https://www.nineninesix.ai/n/kani-tts
Feedback welcome – what's your go-to TTS setup?
homarp•1h ago
ulan_kg•1h ago