High-fidelity simultaneous speech-to-speech translation

https://arxiv.org/abs/2502.03382

115•Bluestein•7mo ago

Comments

benlivengood•7mo ago

Now to get the model to run in an earbud...

lapink•7mo ago

The model can actually run on an iPhone 16 Pro, so if the earbud is connected to one that could work!

Bluestein•7mo ago

That would be insane.-

Thinking of it, the whole "stack" from earbuds to phone to cloud - even in just something so "commonplace" as Assistant or Alexa ...

... Is amazing: All that computing power at our disposal.-

gcanyon•7mo ago

Almost as good as a babel fish!

wedn3sday•7mo ago

For anyone else looking for examples: https://huggingface.co/spaces/kyutai/hibiki-samples

littlestymaar•7mo ago

The high fidelity examples (see CFG-10 in the page) where the translated version has a very heavy French accent is kind of impressive (not that it is really useful, but impressive indeed).

AIorNot•7mo ago

this is amazing - love to play with this- what about other languages besides french to english

lapink•7mo ago

Adding more languages is definitely planned! This was Tom (the first author) master’s internship project with Kyutai, and it was easier to prototype the idea with a single pair. Also he will be presenting this work at ICML in two weeks if anyone is around and wants to learn more.

iambateman•7mo ago

This is why I wonder about the value of language learning for reasons other than “I’m really passionate about it.”

We are so close to interfaces that reduce the language barrier by a lot…

rafale•7mo ago

What about brain development and general intelligence. Knowledge will always have a value, or else we become slaves to the machine.

numpad0•7mo ago

Well if you take a look ... at the Multistream Visualization examples provided in the demo page, it's jus ... t the same as existing human provided interpretation solution at best. Constant 3-5s delays, random pauses, and likely lots of omissions here and there to absorb differences in sentence structures. I'd argue this only nullified another one of excuses to not learn a language.

nisa•7mo ago

It's not personal but I can't help myself to think that's such a sad post here. Reducing learning a different culture through language by plugging in an earbud. Is the battery is gone or your phone is stolen you realize you can't automate anything and that you've learned nothing. It's not about the tech if it works it's amazing it's like babelfish but it's so shallow to assume everything has some direct and simple "value" that can be replaced by some machine or even better some paid service. It's so common here. Is this an US thing?

CamperBob2•7mo ago

It's a much older theme, going all the way back to the Biblical legend of the Tower of Babel (hence the name of the fish.) Like most of that material, the Babel myth was probably stolen from the Babylonians or even older cultures.

The powers that be -- whether gods or governors -- tend to feel threatened when people can communicate freely with each other. Don't join their side.

nisa•7mo ago

I think you misunderstood my post. It's wonderful technology and a great aid. I just wanted to say there is so much more to learning a foreign language (and culture) than machine translation - even if almost perfect. At least that was my take away from learning Czech as a German. Lot's of subtle details.

CamperBob2•7mo ago

No, I was making a larger point: there shouldn't be any such thing as a "foreign language." We're all members of the same species. (Yes, even Americans.) Technology like this is what will realize that ideal.

If cultures around the world had all grown up alongside each other, speaking the same language, and someone came along and said, "That's no good, every nation and every ethnic group should speak a different language," we wouldn't rush to embrace that point of view, would we? Who would benefit from such a policy? Certainly not you and me.

boplicity•7mo ago

Change is hard, but diversity is good, and certainly better than monoculture (of language).

CamperBob2•7mo ago

What's the point of diversity if people can't communicate with each other, or if only educated elites within each subculture can do so? Diversity should bring different people together, not divide them artificially.

boplicity•7mo ago

The irony here, is that diversity is actually extremely aligned with conservative values: freedom of expression and the ability to do what one wants (without regard to others).

Freedom of diversity allows for the flourishing of unique ideas and perspectives, which in turn, has many benefits, in terms of the creation of new value in unexpected ways. Diversity, in a sense, can be a synonym for independence and freedom.

nisa•7mo ago

Ah I see. I disagree because it's impossible. Even the next village or town has a different language even if it's subtle. I'm more for embracing the differences.

On the other hand we are probably almost there - it's English and social media is the global teacher.

caymanjim•7mo ago

I think it'll greatly increase cultural learning, by increasing the opportunity to interact with people. I've traveled to a lot of countries, and never learned more than a handful of words in each, primarily related to basic service interactions. I enjoyed talking to locals when they spoke English. I couldn't interact in any meaningful way with the vast majority of people, though.

Learning languages is great. If you can become fluent in two that's impressive. Even simple conversational ability in a few languages is impressive. But it's a big world.

nisa•7mo ago

Thanks. Wonderful take and optimistic. You are correct I think.

nottorp•7mo ago

He's not, because those locals will stop being able to speak English in a few generations. Either you'll have battery and signal or you'll point at things and make monkey noises.

caymanjim•7mo ago

What do we care what happens in a few generations? We'll all be dead, and the people alive will probably have universal translators implanted in their brains at birth. We absolutely won't need a "signal" to translate on a device anymore (that'll happen in just a few years, forget about generations), and there won't be anyplace on the entire planet that doesn't have network connectivity (that will also happen in just a few years; it's already reality with Starlink cellular).

iambateman•7mo ago

I think you’re reading a sense of cultural reductionism in my comment that I didn’t intend.

There’s more to learning a culture than the language. And having a real-time translator makes it possible to enjoy a huge range of cultures much more directly than before. The fact is, I’m not going to learn Chinese and Swahili and Japanese. So my choices are to go through a human translator or nothing if I want to talk to those people.

How is it sad that a technology is going to allow me to directly talk to a huge number of people that I never could have before?

ViscountPenguin•7mo ago

I don't know if you're multilingual, but some concepts are just legitimately easier to express in some languages; and the different grammatical structures that languages have can be useful for emphasising certain things, or to express subtle relationships between concepts.

I'm not a particularly fluent speaker of Japanese and Russian, but I still find it helpful to drop into them sometimes when speaking with someone who understands them.

Escapado•7mo ago

I have to second this. I study Japanese myself and the entire way the Japanese communicate is reflected so deeply in the language. There is so so much nuance to pretty much every sentence they speak and there are certain grammar points that carry more meaning in three syllables than what can be expressed in English or German in a full sentence. And ok turn this way of communicating shapes their culture too I believe. If I were to translate a German conversation into Japanese, even if I did so idiomatically it would most likely come off as a rude exchange, because of all the unapologetic directness in the source language.

coderatlarge•7mo ago

I’ve tried to learn Mandarin and failed because of lack of memory and practice. mostly i’m shocked at how ambiguous it appears to an english-trained mind - you have to fill in a lot of fine article/pronoun detail from custom and common understanding. which is why i think a lot of automatic translations are poor.

GaggiX•7mo ago

So many nuances are lost in translation. I also can't imagine speaking English with actual people through a machine instead of speaking it directly.

noiv•7mo ago

Well, the value is obvious for romantically involved people not sharing a language, when the batteries run empty. :)

cs702•7mo ago

Nice. I'm impressed.

Translator jobs are going to go poof! overnight.

Just sayin'.

mschuster91•7mo ago

As long as youtube keeps translating "ham" to "Schinken" no matter the context, translators will have jobs.

desultir•7mo ago

Translators sure, interpreters no.

Interpreters also have to factor in cultural context and customs, ensuring that meaning is conveyed without offence being given in formal contexts.

esafak•7mo ago

I don't see why software couldn't do that, if you give them the context.

yorwba•7mo ago

The end-user is unlikely to know which part of the context is relevant, and it may also change from moment to moment depending on who is speaking to whom. Of course you could imagine an AI interpreter that has cameras for situational awareness and asks for clarification if anything important is unclear while smoothing over minor stuff without interrupting, but you could equally easily imagine an AGI, so it's not clear that this could be built to a reasonable quality standard with current technology.

cortesoft•7mo ago

That seems like something LLMs could eventually get good at

nottorp•7mo ago

They'll just push everyone to use corporate wooden language and then they won't have to worry about tone and implied meanings :)

gagabity•7mo ago

Yandex Browser has been doing this for Russian for a while, if you go to YT it offers to translate to Russian, it does multiple speakers and voices from what I remember. Not sure if all the technicalities are the same.

Grosvenor•7mo ago

This is so cool. The future is cool!

I wonder how it will work on languages that have different grammatical structure than french/english? Like Finno-Ugric languages which have sort of a Yoda speech to them. Edit: In Finno-Ugric languages words later on in a sentence can completely change the meaning. Will be interesting to look at.

It's considerate of them to name it after my favourite whisky.

lapink•7mo ago

The alignment between source and target is automatically inferred, basically by searching when the uncertainty over a given output word reduces the most once enough input words are seen. This is then lifted to the audio domain. In theory the same trick should work even with longer grammatical inversions between languages, although this will lead to larger delays. To be tested!

nine_k•7mo ago

If Finnish is not widely known, German is more familiar, and there you can put the "nicht" at the very end of a sentence, reversing its meaning. Also, the verb may come close to the end, after an extended description of the subject / object; in English, you want the verb early.

Human translators somehow handle that; machines would likely exhibit a similar delay.

mananaysiempre•7mo ago

Vaguely related anecdote: have you ever dictated a number to a French speaker? When you say “forty-two” or “seventy-six”, an English speaker will start writing the 4 or the 7 the moment they hear the “forty” or the “seventy”. The French speaker will also write the 4 the moment they hear the “quarante” in “quarante-deux” (40+2), but when you say “soixante-seize” (60+16), they will (without thinking about it!) only start writing 76 at the end of the whole thing, because after only hearing the “soixante” they can’t tell if they’ll need to write a 6 or a 7.

dgan•7mo ago

Belgian have figured this correctly

amy214•7mo ago

>If Finnish is not widely known, German is more familiar, and there you can put the "nicht" at the very end of a sentence,

I've never heard of an english speaker doing that.. .. .. NOT!

yalok•7mo ago

even in regular languages with similar structure, sometimes the ending of a sentence forces you to change how you would say the whole sentence. Human synchronous translators usually correct themselves in such cases, which is a trade-off of having better latency in most cases, at the cost of having to correct yourself once in a while.

jauntywundrkind•7mo ago

Link to repo: https://github.com/kyutai-labs/hibiki

totetsu•7mo ago

All these Japanese project names and no Japanese support (ToT)

woodson•7mo ago

Check out this model based on the same architecture for Japanese: https://github.com/nu-dialogue/j-moshi

usui•7mo ago

I wonder why it's so popular to use Japanese words for random software projects. Bonus points if the project's application of the loanword is off-target from the word's usual meaning/usage, or if it's completely unrelated to the project.

notphilipmoran•7mo ago

It will interesting to see if it runs into issues in syntax of sentences. What am thinking of is specifically between Spanish and English, sentence structures often look completely different. How will this real time interpretation be affected?

jdkee•7mo ago

They just open sourced their newest TTS today.

https://x.com/kyutai_labs/status/1940767331921416302

wenc•7mo ago

Wow, that's impressive! It even has a "sarcastic" voice which drips with sarcasm.

clueless•7mo ago

"Hibiki currently only supports French-to-English translation."

almaight•7mo ago

https://fanyi.caiyunapp.com/

lukax•7mo ago

Soniox also supports real-time speech-to-text translation with 60 languages. You can hook that to a TTS and you have Speech-to-Speech translation. That failed Google I/O real-time translation demo? With Soniox it just works.

You can try it out here (select translation instead of transcription) https://soniox.com/

Disclaimer: I work at Soniox.

jhurliman•7mo ago

I didn’t see any mention of running a model locally on device like the Hibiki abstract states. Is this available?

nottorp•7mo ago

Is this deterministic or random like a LLM?

l-m-z•7mo ago

Hibiki is an auto-regressive model with temperature based sampling so very similar to a LLM, generations are "random" and you can make them deterministic by fixing the RNG seed.

We Mourn Our Craft

Hoot: Scheme on WebAssembly

I Write Games in C (yes, C)

SectorC: A C Compiler in 512 bytes

Stories from 25 Years of Software Development

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

History and Timeline of the Proco Rat Pedal (2021)

Selection Rather Than Prediction

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Hackers (1995) Animated Experience

Making geo joins faster with H3 indexes

Sheldon Brown's Bicycle Technical Info

We Mourn Our Craft

Hoot: Scheme on WebAssembly

I Write Games in C (yes, C)

SectorC: A C Compiler in 512 bytes

Stories from 25 Years of Software Development

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

History and Timeline of the Proco Rat Pedal (2021)

Selection Rather Than Prediction

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Hackers (1995) Animated Experience

Making geo joins faster with H3 indexes

Sheldon Brown's Bicycle Technical Info

High-fidelity simultaneous speech-to-speech translation

Comments