Show HN: I trained a 9M speech model to fix my Mandarin tones

https://simedw.com/2026/01/31/ear-pronunication-via-ctc/

150•simedw•4h ago

Built this because tones are killing my spoken Mandarin and I can't reliably hear my own mistakes.

It's a 9M Conformer-CTC model trained on ~300h (AISHELL + Primewords), quantized to INT8 (11 MB), runs 100% in-browser via ONNX Runtime Web.

Grades per-syllable pronunciation + tones with Viterbi forced alignment.

Try it here: https://simedw.com/projects/ear/

Comments

jellojello•4h ago

This is amazing, if you feel like opening an entire language to being learned more easily.. Farsi is a VERY overlooked language, my wife/her family speak it but it's so difficult finding great language lessons (it's also called Persian/Dari)

simedw•3h ago

Thank you.

I had a quick look at Farsi datasets, and there seem to be a few options. That said, written Farsi doesn’t include short vowels… so can you derive pronunciation from the text using rules?

kranner•3h ago

> written Farsi doesn’t include short vowels… so can you derive pronunciation from the text using rules?

You can't, but Farsi dictionaries list the missing short vowels/diacritics/"eraab" for every word.

For instance, see this entry: https://vajehyab.com/dehkhoda/%D8%AD%D8%B3%D8%A7%D8%A8?q=%D8...

With the short vowel on the first letter it would be written حِساب (normally written as just حساب)

The dictionary entry linked shows that there is a ِ on the first letter ح

But you would have to disambiguate between homographs that differ only in the eraab.

vunderba•4h ago

When I was living in Taiwan, one of the ways I forced myself to remember to pronounce the tones distinctly was by waving my hand in front of me, tracing the arc of each character’s tone.

It helped a lot even if I did look like an insane expat conducting an invisible orchestra.

One more thing: there's quite a bit of variation in how regional accents in the mainland can affect tonal pronunciation. It might be worth reaching to some native speakers to give you some baseline figures.

simedw•3h ago

For accents, I’ve mostly tested with a few friends so far. I’m wondering whether region should be a parameter, because training on all dialects might make the system too lax.

zdragnar•3h ago

In a university Mandarin class, one of the adult students (i.e. probably 40 or so) WAY over exaggerated his tones, to the point that the little old lady teaching us laughed out loud after one of his answers.

A few years later, he had the most clean and consistent pronunciation out of anyone I'd been in a class with, and easily switched between the Beijing and other accents depending on which teacher we had on any given day.

I rather regret not emulating him, even though I haven't really used it for nearly 20 years and have forgotten most of it.

ecshafer•3h ago

From a language learning standpoint that does make sense. Over-exageration while you are learning to help cement the idea, and then when you are speaking more naturally you will fall back into a regular kind of tone.

luckydata•2h ago

that's EXACTLY how I taught myself to speak with a Spanish accent from Madrid. I repeated the way tv celebrities and the speakers on the metro announced the stations, and it gave me a base for how to use my mouth and throat appropriately. After a while I was able to tone it down and my accent got so good that locals couldn't tell I wasn't spanish - I had this cool party trick pulling out my id and showing them I was truly a foreigner!

devin•3h ago

This sounds like how solfeg training works. You use a hand signal to indicate a specific tone: do re mi fa so la ti

cyberax•2h ago

Hand motions help! Especially when you want to memorize new words, because initially you need to treat tone as something additional to remember.

I used simple index finger motions to mark tones.

sowbug•57m ago

You'll love Mike Laoshi: https://youtu.be/cna89A2KAU4?si=SQEZ_0ooO1z119_k

rahimnathwani•3h ago

This is incredible. When I was first learning Chinese (casually, ~20 years ago), my teacher used some Windows software that drew a diagram of the shape of my pronunciation, so she could illustrate what I was getting wrong in some objective way.

The thing you've built is so good, and I would have loved to have it when I was learning Mandarin.

I tried it with a couple of sentences and it did a good job of identifying which tones were off.

drekipus•3h ago

instantly awesome.

I suck at chinese but I want to get better and I'm too embarassed to try and talk with real people and practise.

This is a great compromise. even just practising for a few minutes I already feel way more confident based on its feedback, and I feel like I know more about the details of pronunciation.

I'm worried this might get too big and start sucking like everything else.

btrlsnqtn•3h ago

The article mentions the bitter lesson. I'm confused about the status of Sutton's opinion of the bitter lesson. On the one hand, he invented the concept. On the other hand, he appears to be saying that LLMs are not the correct approach to artificial intelligence, which to a naive outsider looks like a contradiction. What gives?

affogarty•3h ago

This is extremely cool, although I asked my wife (who is Chinese) to try it out and it said she made some mistakes.

hawflakes•26m ago

I tried it out and it has some issues with my native speech. I grew up with more Taiwan mandarin but I know the Beijing standard and the recognizer was flagging some of my utterances incorrectly.

dapangzi•3h ago

Longtime lurker, made an account specifically to give feedback here as an intermediate speaker. :)

This is a great initiative and I hope to see more come out of this; I am not criticizing, but just want to provide my user experience here so you have data points.

In short, my experience lines up with your native speakers.

I found that it loses track of the phonemes when speaking quickly, and tones don't seem to line up when speaking at normal conversational speed.

For example, if I say 他是我的朋友 at normal conversational speed, it will assign `de` to 我, sometimes it interprets that I didn't have the retroflexive in `shi` and renders it `si`. Listened back to make sure I said everything, the phonemes are there in the recording, but the UI displays the wrong phonemes and tones.

By contrast, if I speak slowly and really push each tone, the phonemes and tones all register correctly.

Also, is this taking into account tone transformation? Example, third tones (bottom out tone) tend to smoosh into a second tone (rising) when multiple third tones are spoken in a row. Sometimes the first tone influences the next tone slightly, etc.

Again, great initiative, but I think it needs a way to deal with speech that is conversationally spoken and maybe even slurred a bit due to the nature of conversational level speech.

tifan•1h ago

I had the same issue! Perhaps being another dapangzi is the problem here lol

et-al•29m ago

I'm not familiar with this slang: what's a big plate?

dirteater_•10m ago

the commenter's username (i'm guessing they mean 大胖子, feel free to google translate)

sqs•1h ago

I don't think it takes care of tone transformation (eg 他是 ni3shi4 -> ni2shi4). Or if it does, my tones are just off. But it's a really cool idea!

mercanlIl•1h ago

The tool definitely needs to address tone transformations, it’s a big part of how the language is spoken. Otherwise it’s mostly useful for a first year student speaking in isolation.

Hoping to see improvements in this area

ecshafer•3h ago

Anyone that is a native European language speaker that hasn't tried to learn Chinese or some other tonal language, its really hard to understand how hard it is. The tones can really be very subtle, and your ear is not fine tuned to them. So you think you are saying it right, but native speakers have no idea what you are saying.

cyberax•3h ago

I'm a native Russian speaker, and I decided to learn Mandarin, because it's linguistically almost the opposite of Russian.

I had no problems with tone pronunciation, but tone recognition was indeed much trickier. I still often get lost when listening to fast speech although I can follow formal speech (news) usually without problems.

dionian•2h ago

its critical because without proper tonal enunciation the words can be ambiguous.

laurieg•2h ago

For someone who hasn't grown up speaking an language with tones or pitches, the process of learning them can be maddening. I applaud anyone who makes tools like this to try to make the process easier.

My experience in learning Japanese pitch accent was eye-opening. At the start, I couldn't hear any difference. On quizzes I essentially scored the same as random guessing.

The first thing that helped me a lot was noticing how there were things in my native language (English) that used pitch information. For example, "uh-oh" has a high-low pitch. If you say it wrong it sounds very strange. "Uh-huh" to show understanding goes low-high. Again, if you reverse it it sounds unusual.

The next part was just doing lots of practice with minimal pairs. Each time I would listen and try my best to work out where the pitch changed. This took quite a lot of time. I feel like massed practice (many hours in a day) helped me more than trying to do 10 minutes regularly. Try to hear them correctly, but don't try too hard. I didn't have any luck with trying harder to 'understand' what was going on. I liken it to trying to learn to see a new color. There isn't much conscious thought.

The final piece of the puzzle was learning phrases, not individual words, that had pitch changes. For example: "yudetamago" could be boiled egg or boiled grandchildren. Somehow my brain just had a much easier time latching on to multi-word phrases instead of single words. Listening to kaki (persimmon) vs kaki (oyster) again and again seemed much harder.

Of course, your mileage may vary with these techniques. I already spoke decent Japanese when I started doing this.

danparsonson•1h ago

Wholeheartedly (or maybe downheartedly?) agree with this - sometimes I try to say the simplest things and people just stare at me like I'm speaking Martian. Which I suppose I might as well be! One of my big problems is implicit use of tones for things like expressing uncertainty; that's a very difficult habit to get out of.

bunderbunder•41m ago

Another one that I wish I had realized sooner is that, contrary to the impression teachers tend to convey, tones aren’t just a pitch contour thing. There are also intensity and cadence elements. Native speakers can fairly accurately recognize tones in recordings that have had all the pitch contour autotuned out.

vjvjvjvjghv•1h ago

Agree. It’s really hard. It also explains why a lot of people born in China tend to make serious pronunciation errors when speaking English or German. They are used to focus on different things than us westerners.

It took me very long time to really understand how impersonating tone is in Chinese.

bytesandbits•3h ago

great work! I am going to try it out. Currently about to learn some Mandarin to be able to talk with hawker stand owners for a trip I am doing soon. I am trilingual and can speak a few languages on top of that, but none of them tonal. I am new to tonal languages and I find myself struggling with this... a lot!

anonzzzies•3h ago

goof luck! I speak 6 languages fluent but none of them tonal and I find mandarin very challenging; it does not help that people in places where you might need it are not very forgiving; asking for green fork in a tea shop has people very bewildered.

cmuguythrow•2h ago

Awesome idea!

nirvanatikku•2h ago

talk about 30 seconds to wow. great app, UX and demo. would love to use this. kudos.

jrockway•2h ago

Interesting application! A friend of mine built a model like this to help her make her voice more feminine, and it is neat to see a similar use case here.

dionian•2h ago

it heard wu2 but i heard wo2 from you fine. and it should sound like wo2 not wo3 if spoken quickly. not a native speaker though so i could be wrong

byb•2h ago

Neat. A personal tone trainer. Seriously, shut up and take my money now. Of course, it needs a vocabulary trainer, and zhuyin/traditional character support.

SequoiaHope•2h ago

Amazingly I just did the same thing! Only with AISHELL. It needs work. I used the encoder from the Meta MMS model.

https://github.com/sequoia-hope/mandarin-practice

ChadNauseam•2h ago

This is amazing. I'm also working on free language learning tech. (I have some SOTA NLP models on huggingface and a free app.) I have some SOTA NLP models on huggingface and a free app. My most recent research is a list of every phrase [0].

Pronunciation correction is an insanely underdeveloped field. Hit me up via email/twitter/discord (my bio) if you're interested in collabing.

[0]: https://gist.github.com/anchpop/acbfb6599ce8c273cc89c7d1bb36...

stuxnet79•2h ago

How difficult would it be to adapt this to Cantonese? It is a surprisingly difficult language to learn. It has more tones than Mandarin plus comparatively less access to learning resources (in my experience)

baby•1h ago

For people trying to say the "j" sound correctly, as in "jiu" (old), just say "dz", so in that example "dziu"

iamanllm•1h ago

holy crap, I was literally imaging how I wanted something exactly like this yesterday! you are a hero!

tifan•1h ago

Well, it would work only when I speak word by word, not as a sentence or in a normal speed for daily conversations. The model thinks I was making mistakes when I speak casually (as a native Chinese speaker, I had Mandarin 2A certification, which is required for teachers or other occupations that requires a very high degree of Mandarin accuracy). You wouldn’t really notice it but language pronunciations is very different between causal and formal speech…

rablackburn•1h ago

> And if there’s one thing we’ve learned over the last decade, it’s the bitter lesson: when you have enough data and compute, learned representations usually beat carefully hand-tuned systems.

There are still holdouts!

Come back to me in a couple of decades when the trove of humanity's data has been pored over and drifted further out of sync with (verifiable) reality.

Hand-tuning is the only way to make progress when you've hit a domain's limits. Go deep and have fun.

memalign•1h ago

I wish this had a pinyin mode…! I am learning to speak Mandarin but I am not learning to read/write.

( I’m learning using a flashcards web app I made and continue to update with vocab I encounter or need: https://memalign.github.io/m/mandarin/cards/index.html )

data_ders•1h ago

same! but if you get it inevitably wrong the first time it gives you the pinyin. but i struggled to get it to transcribe the consonants I was making let alone the tones. i'm pretty sure i'm not as bad as that!

bunderbunder•57m ago

This is very cool, but from one Mandarin learner to another I’d caution against relying too heavily on any external feedback mechanism for improving your pronunciation.

If you can’t easily hear your pronunciation mistakes so clearly it hurts, consider putting more energy into training your ear. Adult language learners usually have brains that have become resistant to, but not incapable of, changing the parts of the brain responsible for phoneme recognition. The neuroplasticity is still there but it needs some nudging with focused exercises that make it clear to your brain exactly what the problem is. Minimal pair recognition drills, for example, are a great place to start.

It’s not the most fun task, but it’s worth it. You will tighten the pronunciation practice feedback loop much more than is possible with external feedback, so a better accent is the most obvious benefit. But beyond that, it will make a night and day difference for your listening comprehension. And that will get you access to more interesting learning materials sooner. Which hopefully increases your enjoyment and hence your time on task. Plus, more accurate and automatic phoneme recognition leaves more neurological resources free for processing other aspects of your input materials. So it may even help speed things like vocabulary and grammar acquisition.

zdc1•19m ago

I completely agree with this. There's a certain confidence you get when you can hear a word you don't know, but can still comprehend it well enough to know what pinyin to type into your dictionary app. Mandarin Blueprint has a nice pinyin pronunciation video on YouTube that I worked through a while ago, and then followed with a few weeks of immersion in Taiwan, I was able to really pick out what people were saying.

I feel like listening is the key to speaking. You don't necessarily need to rote learn the tones for each word. You just need say words as you hear them spoken by others.

cocoa19•42m ago

Have you tried the Azure Speech Studio? I wonder how your custom model compares to this solution.

I played around with python scripts for the same purpose. The AI gives feedback that can be transformed to a percentage of correctness. One annoyance is that for Mandarin, the percentage is calculated at the character level, whereas with English, it gives you a more granular score at the phoneme level.

frozennothing•34m ago

This is really cool. Thank you for sharing. Before now I had not sought to understand how this technology works under the hood, but seeing it done at this scale made me curious to see if I could do something similar.

Antirender: remove the glossy shine on architectural renderings

Show HN: I trained a 9M speech model to fix my Mandarin tones

Peerweb: Decentralized website hosting via WebTorrent

Stonebraker on CAP theorem and Databases (2010)

Kimi K2.5 Technical Report [pdf]

Disrupting the largest residential proxy network

The $100B megadeal between OpenAI and Nvidia is on ice

Moltbook

HTTP Cats

International Collection of Tongue Twisters (2018)

P vs. NP and the Difficulty of Computation: A ruliological approach

Declassifying JUMPSEAT: an American pioneer in space

I trapped an AI model inside an art installation (2025) [video]

The engineer who invented the Mars rover suspension in his garage [video]

Surely the crash of the US economy has to be soon

How to explain Generative AI in the classroom

Ask HN: Do you also "hoard" notes/links but struggle to turn them into actions?

Show HN: Pinecone Explorer – Desktop GUI for the Pinecone vector database

Code is cheap. Show me the talk

The National Herbarium of Ireland digital collection of Irish plants

Self Driving Car Insurance

Email experiments: filtering out external images

Roots is a game server daemon that manages Docker containers for game servers

Show HN: Amla Sandbox – WASM bash shell sandbox for AI agents

Chuck Klosterman on why we've never actually seen a real football game

Show HN: I built an AI conversation partner to practice speaking languages

Quack-Cluster: A Serverless Distributed SQL Query Engine with DuckDB and Ray

Building docs like a product

The Home Computer Hybrids

Netflix Animation Studios Joins the Blender Development Fund as Corporate Patron

Show HN: I trained a 9M speech model to fix my Mandarin tones

Comments

Antirender: remove the glossy shine on architectural renderings

Show HN: I trained a 9M speech model to fix my Mandarin tones

Peerweb: Decentralized website hosting via WebTorrent

Stonebraker on CAP theorem and Databases (2010)

Kimi K2.5 Technical Report [pdf]

Disrupting the largest residential proxy network

The $100B megadeal between OpenAI and Nvidia is on ice

Moltbook

HTTP Cats

International Collection of Tongue Twisters (2018)

P vs. NP and the Difficulty of Computation: A ruliological approach

Declassifying JUMPSEAT: an American pioneer in space

I trapped an AI model inside an art installation (2025) [video]

The engineer who invented the Mars rover suspension in his garage [video]

Surely the crash of the US economy has to be soon

How to explain Generative AI in the classroom

Ask HN: Do you also "hoard" notes/links but struggle to turn them into actions?

Show HN: Pinecone Explorer – Desktop GUI for the Pinecone vector database

Code is cheap. Show me the talk

The National Herbarium of Ireland digital collection of Irish plants

Self Driving Car Insurance

Email experiments: filtering out external images

Roots is a game server daemon that manages Docker containers for game servers

Show HN: Amla Sandbox – WASM bash shell sandbox for AI agents

Chuck Klosterman on why we've never actually seen a real football game

Show HN: I built an AI conversation partner to practice speaking languages

Quack-Cluster: A Serverless Distributed SQL Query Engine with DuckDB and Ray

Building docs like a product

The Home Computer Hybrids

Netflix Animation Studios Joins the Blender Development Fund as Corporate Patron