Kitten TTS: 25MB CPU-Only, Open-Source Voice Model

https://algogist.com/kitten-tts-the-25mb-ai-voice-model-thats-about-to-change-everything-runs-on-a-potato/

168•jainilprajapati•2h ago

Comments

GaggiX•2h ago

https://huggingface.co/KittenML/kitten-tts-nano-0.1

This is the model and Github page, this blog post looks very much AI generated.

nine_k•2h ago

I hope this is the future. Offline, small ML models, running inference on ubiquitous, inexpensive hardware. Models that are easy to integrate into other things, into devices and apps, and even to drive from other models maybe.

mayli•2h ago

Is this english only?

g7r•2h ago

Yes. The FAQ says that multilingual capabilities are in the works.

a2128•1h ago

If you're looking for other languages, Piper has been around in this scene for much longer and they have open-source training code and a lot of models (they're ~60MB instead of 25MB but whatever...) https://huggingface.co/rhasspy/piper-voices/tree/main

kenarsa•1h ago

Or use https://github.com/Picovoice/orca which is about 7MB and supports 8 languages

evgpbfhnr•1h ago

I tried on some Japanese for the kicks of it, it reads... "Chinese letter chinese letter japanese letter chinese letter..." :D

But yeah, if it's like any of the others we'll likely see a different "model" per language down the line based on the same techniques

toisanji•2h ago

Wow, amazing and good work, I hope to see more amazing models running on CPUs!

onair4you•2h ago

Okay, lots of details information and example code, great. But skimming through I didn’t see any audio samples to judge the quality?

TheAceOfHearts•2h ago

They posted a demo on reddit[0]. It sounds amazing given the tiny size.

[0] https://old.reddit.com/r/LocalLLaMA/comments/1mhyzp7/kitten_...

onair4you•36m ago

Thanks! Yeah. It definitely isn’t the absolute best in quality but it trounces the default TTS options on macOS (as third party developers are locked out of the Siri voices). And for less than the size of many modern web pages…

blopker•2h ago

Web version: https://clowerweb.github.io/kitten-tts-web-demo/

It sounds ok, but impressive for the size.

nine_k•2h ago

Does anybody find it funny that sci-fi movies have to heavily distort "robot voices" to make them sound "convincingly robotic"? A robotic, explicitly non-natural voice would be perfectly acceptable, and even desirable, in many situations. I don't expect a smart toaster to talk like a BBC host; it'd be enough is the speech if easy to recognize.

roywiggins•2h ago

This one is at least an interesting idea: https://genderlessvoice.com/

cosmojg•25m ago

The voice sounds great! I find it quite aesthetically pleasing, but it's far from genderless.

quantummagic•1h ago

Doesn't work here. Backend module returns 404 :

https://clowerweb.github.io/node_modules/onnxruntime-web/dis...

Retr0id•1h ago

Looks like this commit 15 minutes ago broke it https://github.com/clowerweb/kitten-tts-web-demo/commit/6b5c...

(seems reverted now)

kenarsa•1h ago

Try https://github.com/Picovoice/orca It's about 7MB all included

satvikpendem•1h ago

Does an apk for Android exist for replacing its speech to text engine? I tried sherpa-onnx but it was too slow for real time usage it seemed, and especially so for audiobooks when sped up.

kenarsa•1h ago

https://github.com/Picovoice/orca/tree/main/demo%2Fandroid

satvikpendem•50m ago

I can't test this out right now, is this just a demo or is it actually an apk for replacing the engine? Because those are two different things, the latter can be used any time you want to read something aloud on the page for example. This is the sherpa-onnx one I'm talking about.

https://k2-fsa.github.io/sherpa/onnx/tts/apk-engine.html

gary_0•1h ago

Not open source. "You will need internet connectivity to validate your AccessKey with Picovoice license servers ... If you wish to increase your limits, you can purchase a subscription plan." https://github.com/Picovoice/orca#accesskey

Retr0id•1h ago

I tried to replicate their demo text but it doesn't sound as good for some reason.

If anyone else wants to try:

> Kitten TTS is an open-source series of tiny and expressive text-to-speech models for on-device applications. Our smallest model is less than 25 megabytes.

itake•17m ago

> Error generating speech: failed to call OrtRun(). ERROR_CODE: 2, ERROR_MESSAGE: Non-zero status code returned while running Expand node. Name:'/bert/Expand' Status Message: invalid expand shape

Doesn't seem to work with thai.

mlboss•2h ago

Reddit post with generated audio sample: https://www.reddit.com/r/LocalLLaMA/comments/1mhyzp7/kitten_...

pkaye•1h ago

Where does the training data come for the models? Is there an openly available dataset the people use?

colonCapitalDee•1h ago

Very cool model, but the post is a caricature of AI writing. "Okay, let's get into the nitty-gritty. What makes this little beast tick? These aren't just bullet points on a GitHub README; these are the specs that will fundamentally redefine what you thought was possible with local AI." Sure.

esseph•1h ago

Everybody always thinks everything is AI. AI learned from consuming writing.

This is a ouroboros that will continue.

(Not saying this is or isn't, simply that these claims are rampant on a huge number of posts and seem to be growing.)

treyd•1h ago

This is strictly true but not correct. LLMs were trained on human-written text, but they were post-trained to generate text in a particular style. And that style does have some common patterns.

esseph•1h ago

So are you saying all LLMs were post-trained in that style then?

Because, well, there's a huge number of models. Are they all, as they say, "in cahoots"? (working together, clandestinely)

someguy101010•1h ago

if the people who develop and release these models were all optimizing for the same goals, they could converge on strategies or behaviors, without coordinating.

koolala•1h ago

Seems like many train on the output of other models for post-training and catch some kind of cooties.

rgoulter•1h ago

Examples of LLM-style text: short & punchy sentences, negative parallelism ("not just X, it's Y"), bullet points especially with emojis and bolded text. Overuse of em-dash.

This is a good list: https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing

It's one thing to observe "LLM-generated writing all looks the same". Whether the LLMs were all post-trained the same way is a different question.

I don't agree "everyone says everything is AI". Do you have examples where a consensus of people are accusing something of being AI generated, where it lacks those indicators?

Teknomadix•48m ago

It's our fault—we've all been overusing emojis and the em—dash for years.

EGreg•13m ago

I know exactly what you mean ^_^ honestly — and I’m saying this with a certain satisfaction — it’s been difficult to stop smiling :)

It’s not slop — it’s inspiration!

esseph•43m ago

Just reading through posts on here about various blogs/posts/opinion pieces there always seems to be a handful of people that jumps to "this is AI". And maybe it is! But the driving force behind this seems not to be to identify that something is AI, but because they spite it so (AI writing), to quickly rule out caring about the material if identified as AI slop.

The problem I see this leading to is plenty of legitimate written things getting thrown away because somebodys online exposure bubbles don't end up including a lot of Medium or Tumblr or a certain Discord or whatever bubble where _huge_ groups of people actually are writing in whatever $STYLE is being identified by the reader and commenter as AI. Which then, because of their post, also gets other people to not even look.

It seems like a disaster, frankly.

raincole•56m ago

There was a time everyone trained their models with ChatGPT output. You can still find open source models that tell you they're ChatGPT if you ask.

mikepurvis•1h ago

I'm one of the unlucky ones who has coincidentally trained myself over the past fifteen years to write in the style that is now largely recognized to be the ChatGPT style— bolded lists, clear section breakdowns with intro and concluding sentences, correct and liberal use of semicolons and em-dashes. The only parts of it I don't do are litter my text with random emojis or directly address the reader with simpering praise.

esseph•42m ago

Sounds like someone that shoots for simple but effective communication, to me.

dismalaf•1h ago

The writing style we associate with AI is the 2010's blogging style that AI learned from... So it definitely could have been written by a person.

hildolfr•1h ago

No it isn't, it's something new born from ingesting that stuff... That's exactly why a lot of us can detect it from a mile away.

No human comments on meta formatting like that outside the deepest trenches of Apple/FB corporate stuff.

croes•1h ago

> That's exactly why a lot of us can detect it from a mile away.

Is that tested and proven or just gut feeling?

dismalaf•17m ago

You must not have read a lot of blogs... This style is 100% the pretentious kind of writing that was in vogue.

PontifexMinimus•54m ago

Indeed the blurb is absurd and very off-putting. It's not a big deal that "It clocks in at under 25MB with just 15 million parameters", because text to speech is a long-solved problem, in fact the Texas Speak and Spell from 1978 (half a century ago FFS) solved it, probably with a good deal less than 25MB.

paulryanrogers•46m ago

Speak and Spell was a toy. I loved it as a kid in the eighties. But it was very limited and sounded terrible.

namuol•13m ago

I think it’s fair enough to just say that the writing is cringe, AI or not.

jainilprajapati•7m ago

This is HOW I WRITE man

wewewedxfgdf•1h ago

say is only 193K on MacOS

  ls -lah /usr/bin/say
  -rwxr-xr-x  1 root  wheel   193K 15 Nov  2024 /usr/bin/say

Usage:

  M1-Mac-mini ~ % say "hello world this is the kitten TTS model speaking"

dented42•1h ago

That’s not a far comparison. Say just calls the speech synthesis APIs that have been around since at least Mac OS 8.

That being said, the ‘classical’ (pre-AI) speech synthesisers are much smaller than kitten, so you’re not wrong per se, just for the wrong reason.

wnoise•1h ago

And what dynamic libraries s it linked to? And what other data are they pulling in?

satvikpendem•1h ago

`say` sounds terrible compared to modern neural network based text to speech engines.

wewewedxfgdf•54m ago

Sounds about the same as Kitten TTS.

satvikpendem•53m ago

To me it sounds worse, especially on the construction of certain more complex sentences or words.

selcuka•1h ago

SAM on Commodore 64 was only 6K:

https://project64.c64.org/Software/SAM10.TXT

Obviously it's not fair to compare these with ML models.

tonypapousek•13m ago

Tried that on 26 beta, and the default voice sounds a lot smoother than it used it.

Running `man say` reveals that "this tool uses the Speech Synthesis manager", so I'm guessing the Apple Intelligence stuff is kicking in.

RobKohr•1h ago

What's a good one in reverse; speech to text?

jasonjmcghee•1h ago

Whisper and the many variants. Here's a good implementation.

https://github.com/ggml-org/whisper.cpp

wkat4242•1h ago

Hmm the quality is not so impressive. I'm looking for a really naturally sounding model. Not very happy with piper/kokoro, XTTS was a bit complex to set up.

For STT whisper is really amazing. But I miss a good TTS. And I don't mind throwing GPU power at it. But anyway. this isn't it either, this sounds worse than kokoro.

kenarsa•1h ago

Try https://github.com/Picovoice/orca

echelon•1h ago

> Hmm the quality is not so impressive. [...] And I don't mind throwing GPU power at it.

This isn't for you, then. You should evaluate quality here based on the fact you don't need a GPU.

Back in the pre-Tacotron2 days, I was running slim TTS and vocoder models like GlowTTS and MelGAN on Digital Ocean droplets. No GPU to speak of. It cost next to nothing to run.

Since then, the trend has been to scale up. We need more models to scale down.

In the future we'll see small models living on-device. Embedded within toys and tools that don't need or want a network connection. Deployed with Raspberry Pi.

Edge AI will be huge for robotics, toys and consumer products, and gaming (ie. world models).

kamranjon•1h ago

The best open one I've found so far is Dia - https://github.com/nari-labs/dia - it has some limitations, but i think it's really impressive and I can run it on my laptop.

guskel•49m ago

Chatterbox is also worth a try.

jainilprajapati•11m ago

You should give try to https://pinokio.co/

andai•1h ago

Can you run it in reverse for speech recognition?

keyle•1h ago

I don't mind so much the size in MB, the fact that it's pure CPU and the quality, what I do mind however is the latency. I hope it's fast.

Aside: Are there any models for understanding voice to text, fully offline, without training?

I will be very impressed when we will be able to have a conversation with an AI at a natural rate and not "probe, space, response"

sandreas•37m ago

Cool.

While I think this is indeed impressive and has a specific use case (e.g. in the embedded sector), I'm not totally convinced that the quality is good enough to replace bigger models.

With fish-speech[1] and f5-tts[2] there are at least 2 open source models pushing the quality limits of offline text-to-speech. I tested F5-TTS with an old NVidia 1660 (6GB VRAM) and it worked ok-ish, so running it on a little more modern hardware will not cost you a fortune and produce MUCH higher quality with multi-language and zero-shot support.

For Android there is SherpaTTS[3], which plays pretty well with most TTS Applications.

1: https://github.com/fishaudio/fish-speech

2: https://github.com/SWivid/F5-TTS

3: https://github.com/woheller69/ttsengine

divamgupta•24m ago

Thanks for posting about our project in HN! I am one of the creators of KittenTTS

Here is the link to our repo: https://github.com/KittenML/KittenTTS

We would appreciate a star!

Thanks

jainilprajapati•16m ago

♥

maxloh•16m ago

Hi. Will the training and fine-tuning code also be released?

It would be great if the training data were released too!

maxloh•18m ago

Will the training and fine-tuning code also be released? Better with training data too!

MutedEstate45•10m ago

The headline feature isn’t the 25 MB footprint alone. It’s that KittenTTS is Apache-2.0. That combo means you can embed a fully offline voice in Pi Zero-class hardware or even battery-powered toys without worrying about GPUs, cloud calls, or restrictive licenses. In one stroke it turns voice everywhere from a hardware/licensing problem into a packaging problem. Quality tweaks can come later; unlocking that deployment tier is the real game-changer.

We're Told Men Are in Crisis. A.I. Is Going to Make It Worse

NASA ordered to destroy two important satellites monitoring climate change

Google launches Gemini CLI GitHub Actions for automating PR reviews and more

Let's see whether it is true or not that light cannot reach a destination late

The Principal Agent Problem

Show HN: Open-source DIY NAS with GUI and aluminum enclosure

Noise sensitivity disrupts the mind, brain and body

What if you could search every visible word on New York City's streets?

Playready DRM Leaked, Used by Netflix, Amazon, and Disney+

Show HN: Twitter Viewer – View & Download Tweets and Media Without an Account

The Much-Hyped New Wizard of Oz for Sphere Is an Atrocity

An AI Company Just Fired Someone for Endorsing Human Extinction

Explore BLM Lands Marked for Potential Sale

Backpropagating through a maze with candle and WASM

Whitehouse Moves to Destroy Satellite That Monitors Greenhouse Gases

Hiroshima marks 80 years since atomic bombing

Acid Jazz – Kyle Kingsbury [video]

The Amaranth hardware description language

Against the Computer and Its World

Italy to Be Referred to the Assembly of States Parties or the Security Council

US Coast Guard Report on Titan Submersible Implosion Singles Out OceanGate CEO

Three former employees of TSMC detained, accused of smuggling secrets of 2nm

Real-Time Boxing Coaching with Gemini Live, Stream Video and Ultralytics YOLO

Recommend waiting for Proxmox VE 9.1 for now

Genes are passports in coercive sci-fi world of 'Code 46' (2004)

Open Sourcing Shaper: Minimal Data Platform for Embedded Analytics

The oil-natural gas conundrum

Health Secretary cancels $500M in mRNA vaccine research

I'm Archiving Picocrypt

Rust Making Progress on Its 2025 Project Goals

We're Told Men Are in Crisis. A.I. Is Going to Make It Worse

NASA ordered to destroy two important satellites monitoring climate change

Google launches Gemini CLI GitHub Actions for automating PR reviews and more

Let's see whether it is true or not that light cannot reach a destination late

The Principal Agent Problem

Show HN: Open-source DIY NAS with GUI and aluminum enclosure

Noise sensitivity disrupts the mind, brain and body

What if you could search every visible word on New York City's streets?

Playready DRM Leaked, Used by Netflix, Amazon, and Disney+

Show HN: Twitter Viewer – View & Download Tweets and Media Without an Account

The Much-Hyped New Wizard of Oz for Sphere Is an Atrocity

An AI Company Just Fired Someone for Endorsing Human Extinction

Explore BLM Lands Marked for Potential Sale

Backpropagating through a maze with candle and WASM

Whitehouse Moves to Destroy Satellite That Monitors Greenhouse Gases

Hiroshima marks 80 years since atomic bombing

Acid Jazz – Kyle Kingsbury [video]

The Amaranth hardware description language

Against the Computer and Its World

Italy to Be Referred to the Assembly of States Parties or the Security Council

US Coast Guard Report on Titan Submersible Implosion Singles Out OceanGate CEO

Three former employees of TSMC detained, accused of smuggling secrets of 2nm

Real-Time Boxing Coaching with Gemini Live, Stream Video and Ultralytics YOLO

Recommend waiting for Proxmox VE 9.1 for now

Genes are passports in coercive sci-fi world of 'Code 46' (2004)

Open Sourcing Shaper: Minimal Data Platform for Embedded Analytics

The oil-natural gas conundrum

Health Secretary cancels $500M in mRNA vaccine research

I'm Archiving Picocrypt

Rust Making Progress on Its 2025 Project Goals

Kitten TTS: 25MB CPU-Only, Open-Source Voice Model

Comments