Complete silence is always hallucinated as "ترجمة نانسي قنقر" in Arabic

https://github.com/openai/whisper/discussions/2608

326•edent•5h ago

Comments

GaggiX•5h ago

Whisper frequently generates random credits. I guess they didn't curate the dataset much at the time.

sivers•5h ago

to save you a lookup:

The Arabic text "رجمة نانسي قنقر" translates to English as: "Nancy Qanqar's translation" or "Translation by Nancy Qanqar"

"رجمة" means "translation" and "نانسي قنقر" is the name "Nancy Qanqar"

aprilthird2021•5h ago

And it seems to be because the training data is largely unofficial subtitles from movies. Which often have a string like "Translated by X" at the end of the movie which is often silent while credits roll.

iqfareez•4h ago

make sense..

rob74•4h ago

Looks like they used more official sources for German - there, silence is apparently hallucinated as "Untertitelung des ZDF für funk, 2017" according to one of the comments on the issue. Which makes sense, as the public broadcasters' "Mediathek" is probably the largest freely available resource of subtitled videos in Germany. I wonder if the ZDF gave its approval for it being used for LLM training though?

MrGilbert•4h ago

> I wonder if the ZDF gave its approval for it being used for LLM training though?

I am pretty sure they didn't get asked.

h784gljf•4h ago

Just like the people forced to pay for ZDF under threat of imprisonment.

eclecticfrank•4h ago

This person refers to the German television and radio fee (Rundfunkgebühren).[1] It is a state-mandated system that ensures free (as in free speech) and (relatively) neutral public broadcasting institutions. There is a constant and engaged discussion, because every household in Germany has to pay this fee. Exceptions are made only for low-income households.

[1] https://en.wikipedia.org/wiki/ARD_ZDF_Deutschlandradio_Beitr...

rob74•2h ago

A constant discussion, lately fueled by extremist parties (AfD) who feel treated unfairly by (amongst others) the public broadcasters (which has parallels to Trump's recent campaign against public broadcasters in the US).

throwaway290•3h ago

So, people are made pay for it, and it makes it fair if billion USD corporations don't?

TheBicPen•3h ago

Just like any other public service paid for with public funds?

h43z•2h ago

Oh it's like any other. Then just add another one!

wongarsu•2h ago

I'm being made to pay for Autobahnen I barely use, finance kindergartens despite not having a child, and made to pay into public pensions with little hope of getting close to the same value out. All under threat of imprisonment, many without a way to even refuse (not that I'd want to) The only thing that sets the pubic broadcasting fee apart is that it's collected separately from taxes in an attempt to reduce the influence politicians have on broadcasters

Zacharias030•4h ago

definitely not! The media platform of the German public television networks is even geoblocking anyone outside of Germany.

https://www.ardmediathek.de/

unusual-name•4h ago

Most content from Funk (youtubers funded by public german broadcasters) is available on youtube without any geoblocking or other limitations.

rob74•3h ago

Ah, ok, thanks for the info, TIL! "We are funk – the first public service content network that started on October 1, 2016. We create online-only content on social networks and third-party platforms, including YouTube, Instagram, Snapchat, TikTok, Spotify, Apple Music or Twitch for 14-29 year-olds." (https://presse.funk.net/das-ist-funk/, scroll down for the English version). I live in Germany, and I even watch public broadcasters regularly, but this is the first time I have heard about funk (I even initially thought it was misspelled, usually it's written with a capital F). But I'm not part of the targeted audience (not now, nor even back in 2016 when it was launched), so all good...

layer8•3h ago

I’m pretty sure that content doesn’t come with a license granting unlimited usage rights.

darkwater•3h ago

from the link[1] another user posted:

> We have a public service mandate, which means that we have very clear responsibilities according to the state media treaty. For us, this means that our top priority is actually reaching our target audience, namely approximately 15 million people living in Germany between the age of 14 and 29 who have internet access

It's not a binding contract for sure but I don't think that OpenAI or other AI scraper is their target.

[1] https://presse.funk.net/das-ist-funk/

bigiain•2h ago

A more appropriate output might be ``4'33" -- John Cage, 1952``

4gotunameagain•4h ago

I'm sure they totally did not pirate the audio of said movies.

mormegil•5h ago

In Czech, Whisper usually transcribes music as "Titulky vytvořil JohnyX" ("subtitles made by JohnyX") for the same reason.

actionfromafar•4h ago

Haha, trained on torrented movies! :-D

The MPA must be so proud.

Incipient•4h ago

It's absolutely insane that these companies can't be held liable for what is obvious piracy.

jdiff•3h ago

That's the magic of money. Download your favorite artist's discography for personal use? If the MPAA had its way (and it occasionally has), torrenting that could bankrupt you.

The AI industry - soaking up every bit of media available online for commercial purposes, often reproducing it nearly identically - has enough money and capital to influence things its way. And only its way, in case anyone was hoping this might change anything at all for the little guy.

1718627440•3h ago

The movie industry also has some money and lobbying power. Surely this is a way larger threat than any single torrenter could ever be?

jdiff•3h ago

The fact that this is propping up the entire AI industry adds additional weight. When legislating or deciding court cases, some won't be willing to pop the cash cow, some will be worried about falling behind countries that don't enforce copyright evenly. IP owners are trying to go after the AI industry, with only mixed to poor success.

verzali•1h ago

Hard to justify that they can't afford to pay when they have multi-billion dollar valuations and are apparently paying hundreds of millions to get a single engineer.

d1sxeyes•2h ago

> Download your favorite artist's discography for personal use? If the MPAA had its way (and it occasionally has), torrenting that could bankrupt you.

I don't think that there are any clear examples of cases where ONLY downloading has resulted in huge fines. All the big bankrupting level fines have been for both downloading and sharing.

You mention that 'torrenting' could bankrupt you, and that is true, but the main reason for the huge fines are that you are taking part in distribution rather than just 'downloading for personal use'.

jdiff•1h ago

Given the lack of sense in treating each peer as a lost sale for damages, I think we can safely say they're only interested in making examples out of people and would absolutely go after people for only downloading if the law permitted. Thankfully it's not, but maybe they lobby to make changes in that direction to try and curb future AI industry shenanigans.

0points•1h ago

> I don't think that there are any clear examples of cases where ONLY downloading has resulted in huge fines.

They [1, and others] been hunting and fining downloaders for over a decade now, with the only "evidence" being IP addresses connected with the torrent [2].

1: https://www.njordlaw.com/filesharing-and-downloading-films/q...

2: https://admin.ovpn.com/en/blog/online-integrity-new-threats-...

boredhedgehog•2h ago

It's an indication how few people consider license infringements as a matter of actual moral import. Those tend to evoke strong feelings.

ACCount36•1h ago

It's the way it should be.

pavon•55m ago

Anthropic is going to trial over pirating books for training. The judge was pretty clear that even if training is fair use, the training material must be obtained legally.

These regurgitations combined with proof that a model is familiar with a work could be sufficient evidence to force discovery to determine if the work was pirated.

scotty79•16m ago

What's insane is copyright. How come you can own intellectual property but not pay a property tax? The ecosystem would be much healthier if to get copyright protections you should declare value of your IP (that you are obligated to sell for if the buyer pops up) and pay tax on this for every year you hold the IP.

beshrkayali•4h ago

You've got a little typo, it's not "رجمة", it's "ترجمة" that means translation, the ت at the beginning is missing.

dandiep•5h ago

Whisper is unusable IMO because of the hallucinations. Widely documented. Removing silence from audio clips helps, but even then it will auto correct grammar, translating bilingual speech, etc. Improved in the latest audio models but not solved [1]

1. https://news.ycombinator.com/item?id=43427376

eric-burel•5h ago

That's the problem with raws large models, it should always be coupled with satellite small models and logic. It's (probably) easier to detect hallucinations using a traditional ML/DL model that can catch mismatches (it's easy to build a synthetic dataset for this) than transcribing. And the simplest piece of code can detect a silence and that it should match no text.

ilyakaminsky•3m ago

I wouldn't describe it as "unusable" so much as needing to understand its constraints and how to work around them. I built a business on top of Whisper [1] and one of the early key insights was to implement a good voice activity detection (VAD) model in order to reduce Whisper's hallucinations on silence.

[1] https://speechischeap.com

GodelNumbering•5h ago

Interesting that this happens even on large v3. I had once done a deep dive into STT and Whisper Large was the only model that could correctly transcribe Yann LeCun (it was a Lex Friedman podcast), ever since I held the belief that it was the best STT model, this was over 2 years ago

dlcarrier•5h ago

Classic overfitting

It's the LLM equivalent of thinking that an out-of-office reply is the translation: https://www.theguardian.com/theguardian/2008/nov/01/5

stingraycharles•4h ago

How is this overfitting, rather than a data quality / classification issue?

hsn915•4h ago

ُThe Arabic text is the translator's self credit

"Translated by Nancy Qanfar"

wongarsu•3h ago

And the German is “subtitles of [public broadcaster] for [content network], 2017

I'm not sure this is really overfitting, the network does exactly what the training data demands. According to the training data silence art the end transcribes to a copyright notice or subtitle credits

baobabKoodaa•3h ago

> I'm not sure this is really overfitting, the network does exactly what the training data demands.

What do you think overfitting is, if not that?

wongarsu•3h ago

Overfitting would be replicating overly specific details. Like if a specific pattern of silence (or quiet noise) matched to specific copyright notices.

But in this case the behavior seems to generalize over multiple languages, with the model choosing representative "outro silence" captions depending on the language. Which is consistent with the training data showing that outro silence is captioned.

If the model was generalizing perfectly it would show something like "[subtitle credits here]" but that'd be demanding a bit much.

Transcribing outro silence as silence despite the training data consistently transcribing outro silence differently from regular silence would be underfitting

maxbond•2h ago

The optimizer is functioning correctly, and the pattern really exists in the training data. But consider:

- This behavior damages the model's performance on out of sample data; every word you predict during silence increases the transcript's Word Error Rate.

- These translation credits are an artifact of our training data, and not a reflection of the process we are modeling (spoken language).

So, while you are correct about the mechanism at work here, it is still correct to call learning a spurious pattern which damages our performance "overfitting".

bmacho•31m ago

Overfitting is achieving better and better scores on the training material and worse and worse scores on unseen tasks. More at: https://en.wikipedia.org/wiki/Overfitting#Machine_learning

This is just wrong training data.

samrus•3h ago

fitting on noise in the training data is exactly what overfitting is. underfitting is smoothing out signal

wongarsu•2h ago

Exactly. Underfitting would be if the model doesn't pick up on the fact that outro silence is labeled differently from regular silence and transcribes them the same

xigoi•1h ago

Overfitting implies a failure to properly generalize the training data. Here it generalized them correctly. Garbage in, garbage out.

andrepd•1h ago

That's literally what overfitting means.

Side-note: it's also yet more evidence that AI companies hoover all data with no regard for legality or copyright status, the very same offences that got other people in jail or with heavy fines.

maxbond•3h ago

It is a data quality issue which caused the model to overfit.

mort96•3h ago

Isn't overfitting just when the model picks up on an unintended pattern in the training data? Isn't that precisely what this is?

bGl2YW5j•3h ago

If the model was able to generalise, you’d expect it to output something like “[silence]” or “…”, in response to silence.

Instead, it reverted to what it has seen before (in the training data), hence the overfit.

stingraycharles•2h ago

Right, maybe my definition of overfitting was wrong, I always understood it more as trying to optimize for a specific benchmark / use case, and then it starts failing in other areas.

But the way you phrase it, it’s just “the model is not properly able to generalize”, ie it doesn’t understand the concept of silence also makes sense.

But couldn’t you then argue that any type of mistake / unknown could be explained as “overfitting” ? Where do you draw the line ?

heavyset_go•9m ago

Your definition is one, but the one the OP is using is overfitting to training data.

Lucasoato•4h ago

In Italian as well there are random hallucination when parsing silence, something like: “Thank you for watching”, “Subtitles by…”

userbinator•4h ago

I wouldn't be surprised if "like share and subscribe" also shows up at some point.

Muromec•4h ago

"наша зброя в цей момент -- вподобайка і комент"

tornikeo•4h ago

Garbage in, garbage out. If the training dataset (accidentally) paired silence (`X_train`) with `رجمة نانسي قنقر` tokens (`y_pred`), then any silence will always be translated to that. Fortunately, this particular problem is easy to fix--just detect and remove silent parts before API call. This also has a side benefit of saving you money on transcription.

Zacharias030•4h ago

and saving money on litigation.

Hobadee•4h ago

Little did you all know, this is just being mechanical turked by Nancy Qunqar.

Way to go Nancy! Keep up the good work, ya crazy bastard!

vanschelven•4h ago

I wonder if hallucinated copyright claims (esp. like the ZDF one at the bottom of the OP) will be introduced as evidence in one of the court cases against "big AI"

staplers•3h ago

It already has been and meta won the lawsuit because corporations are sacrosanct.

vanschelven•3h ago

Do you mean that specifically a hulucibated text "copyright by not-meta" made it into evidence? Or are you talking about copyright generally?

berkes•2h ago

Evidence against what?

"Big AI" is transparent and open about the fact they use all sorts of copyrighted material to train the data. How would "we see an exact chunk of text from our copyrighted material" add to that?

pbmonster•1h ago

It appears they have not been training on the official studio subtitle files, but on community transcriptions/translations commonly distributed with torrents.

So not only are they training on copyrighted material, but they didn't even pay for it once, and then they didn't even do minimal data cleaning before training. Which, by the way, is the type of cleaning their LLMs could have done.

sofixa•49m ago

Their main defence is that it's fair use because it's transformative (like a human reading a book, getting inspired, and writing something of their own) and not a copypaste illegal distribution (like a human scanning that book and selling it themselves).

Having models hallucinate copyright notices shows that some content is being copypasted as is, which kind of goes against the transformative argument.

(Note: I think that trying to litigate AI with current copyright laws is weird. They were created before LLMs were even imagined, so of course they can't handle them clearly. New laws are needed around this, not trying to bend over backwards to think about what a lawmarker a century ago would have thought about how transformative a thing they couldn't have imagined is.)

cyp0633•4h ago

The same happens with whisper-large-v3 on Chinese transcription: silence is transcribed to something like "please upvote, share and favourite this video". I suspect they trained the model on some random YouTube video without carefully picking really useful data.

st_goliath•4h ago

That's interesting, the few times I tried playing with whisper, I had the impression that YouTube style videos or random cellphone videos was something it did particularly bad with (compared to movies). My guess at the time was that most of the training material might be sub titles and raw screen plays.

The videos I tried to transcribe were also Mandarin Chinese, using whisper-large-v3. Besides the usual complaints that it would phonetically "mishear" things and generate nonsense, it was still surprisingly good, compared to other software I played around with.

That said, it would often invent names for the speakers and prefix their lines, or randomly switch between simplified and traditional Chinese. For the videos I tested, intermittent silence would often result in repeating the last line several times, or occasionally, it would insert direction cues (in English for some reason). I've never seen credits or anything like that.

In one video I transcribed, somebody had a cold and was sniffling. Whisper decided the person was crying (transcribed as "* crying *", a cough was turned into "* door closing *"). It then transcribed the next line as something quite unfriendly. It didn't do that anymore after I cut the sniffling out (but then the output switched back to traditional Chinese again).

isoprophlex•3h ago

Indeed, with another model I would get persistent transcriptions of silent parts into 'Thanks for watching!' or '[MUSIC]'. Pretty dumb that this failure mode wasn't caught in some QA process, and there are now multiple transcription models suffering from the same issue. Having silent parts in your input audio seems like it should be a very common occurrence...

rollcat•3h ago

When I was taught mathematics, the zero value was always considered the most important edge case. You prove something for N=0 (or N=1), then for N=M+1.

It's even more important in audio DSP: processing near-zeroes can end up being extremely CPU intensive, look up denormal/subnormal floats.

inglor_cz•3h ago

Yeah, I studied mathematics (algebra and number theory) and zero is the point, often sporting discontinuities, or weird asymptotic behavior.

Quite a lot of algorithms use some form of division and zero is the only number in our typical structures (Z, Q, R, C), that cannot be used to divide with.

isoprophlex•3h ago

Well, now in this brave new age of AI we can enjoy computer programs crashing with an

    Error: division by please upvote, share and like!

xyproto•2h ago

This also works; I upvoted your comment.

o1bf2k25n8g5•1h ago

I have discovered a truly marvelous proof of how to smash that like and subscribe button, which this comment box is too small to contain.

msopena•1h ago

Signed by Pierre de FermAIt

Bluestein•30m ago

NaN

edwcross•1h ago

In machine integer arithmetics, one must also beware division by -1, which can convert MIN_INT into MIN_INT with a signed overflow and violate some arithmetics invariants, such as sign (negative divided by negative is _usually_ positive).

KeplerBoy•46m ago

Denormals are flushed to zero by default on most GPUs by the way.

wahnfrieden•2h ago

whisper MUST be combined with silence detection / VAD

pferde•2h ago

Ah, the good old "you're holding it wrong".

What good is a speech recognition tool that literally hears imaginary voices?

Xmd5a•2h ago

faster-whisper has a min_silence_duration_ms option

zettabomb•54m ago

Considering that if you DO use VAD (voice activity detection), it's the best open weights voice recognition model by a very wide margin, it's quite good. I'd be willing to be that commercial products that "don't have this problem" are using VAD as well, and that this is well known to them. But Whisper is just the weights, and I suppose a simple reference implementation, not a full product.

bmacho•35m ago

> What good is a speech recognition tool that literally hears imaginary voices?

Well, if it is supposed to work after silence detection, then it is good for speech recognition I guess. It's like blaming a wheel why is it circular, you can't sit on it. It's a part of a larger machine.

DANmode•2h ago

What's VAD?

maxbond•2h ago

Voice Activity Detection (it predicts whether a short clip contains speech, eg to mute your microphone when you aren't speaking).

ttflee•1h ago

In Chinese, it always added something like "For study/research purpose only. Please delete after 48 hours." This is what those volunteers added in subtitles of (pirated) movies/shows.

cyp0633•31m ago

That is not the case here - I never encountered this with whisper-large-v3 or similar ASR models. Part of the reason, I guess, is that those subs are burnt into the movie, which makes them hard to extract. Standalone subs need the corresponding video resource to match the audio and text. So nothing is better than YouTube videos which are already aligned.

xigoi•1h ago

Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?

mmcwilliams•45m ago

Similar in the English model. Pretty clear they trained on YouTube videos where creators will put that in otherwise silent sections to ensure it shows up for people with CC on.

indrora•36m ago

When YouTube began building automatic transcriptions for captions, it regularly flagged any noise or music -- typically industrial noise -- with "[foreign]"

If it couldn't understand it, it was "foreign" for the longest time.

stndef•13m ago

Yeah, I can confirm seeing that a fair bit specifically during non-verbal parts of videos when someone is using a tool.

sandspar•4h ago

Neat, we finally know the answer! What is the sound of one hand clapping? Translation by Nancy Qunqar.

layer8•4h ago

I can clap with one hand (fingers on palm) and it produces a clapping sound.

boomlinde•2h ago

Your brain merely hallucinates a clapping sound as "Translation by Nancy Qunqar" enters your ears.

kristjank•4h ago

roses are red

violets are blue

unregistered hypercam 2

maxbond•3h ago

Roses are red,

Silence is golden,

Translated by Nancy,

To copyright, we aren't beholden

arnejenssen•4h ago

Who is Nicolai Winther? https://medium.com/@lehandreassen/who-is-nicolai-winther-985...

io84•1h ago

"In the future, everyone will be world-famous for 15 minutes" _in a microniche techno-linguistic community, at a time and choosing of the swirling AI clouds_

haiku2077•4h ago

I've noticed this also happens in english Whisper models with the phrases:

"[ sub by sk cn2 ]"

"Anyways, thanks for watching! Please subscribe and like! Thanks for watching! Bye!"

"This is the end of the video. Thank you for watching. If you enjoyed this video, please subscribe to the channel. Thank you."

tarikozket•3h ago

this happens in Turkish too. I believe the reason is that the movie subtitles were used for training without cleaning up the comments / intros subtitle authors leave in them.

leaving personal comments, jokes, reactions, intros in subtitles is very common in eastern cultures.

Turkish readers will probably remember “esekadam iyi seyirler diler” :)

jdiff•3h ago

Kind of mindblowing considering who it is we're talking about. Of all companies, OpenAI couldn't be bothered to throw an LLM at this problem? Finding amorphously phrased but clearly recognizable needles in large numbers of haystacks seems like a patently perfect task for them.

sofixa•47m ago

Don't even need an LLM, a regex would have sufficed (I've used my fair share of community sourced subtitles, and comments are almost always in a different font, colour, between brackets, etc etc).

verzali•2h ago

That name translates as "Donkey Man" btw :D

flexagoon•3h ago

In Russian it often hallucinates "Субтитры сделал DimaTorzok" ("Subtitles by DimaTorzok") at the end of things. Interestingly, I wasn't able to find any YouTube videos with that name in the subtitles, so it's not like it's in a lot of training data.

berkes•2h ago

Could it be someone distributing subs online, e.g. showing up in the opensubtitles.org dataset?

voidUpdate•24m ago

Or possibly someone subtitling pirated movies? That seems to be a common thing according to other comments

xg15•3h ago

Yeah, the subtitle "credits" occur very frequently. I found with whisper-2, they're also triggered by music.

I suppose the cause is the same, generally subtitle creators adding all kinds of stuff during the credits that is NOT a transcript.

Seems to me it could have been filtered out relatively easily during training, by clipping the first and last few minutes of all audios. But I guess that's just in hindsight.

Whisper also likes to transcribe cut off speech or unintelligible noise as "Thank you". I have no idea where that is coming from, but I guess it's a very polite model...

nottorp•3h ago

Title should be changed to "OpenAI publishes evidence they trained on pirated movies".

berkes•2h ago

How is this evidence of that fact? Honest question.

I can see how this might show that subtitles from online sub communities are used, or that maybe even original subtitles from e.g. DVDs are used. But isn't it already known and admitted (and allowed?) that AI uses all sorts of copyrighted material to train models?

nemomarx•1h ago

The Chinese subtitles for silence use a common mark for pirated media in that language, according to other commentors here. In general it's pretty likely that if you're finding non professional subtitles they were distributed with pirated media in some form, that's where you get the most fan subs after all

0points•1h ago

> I can see how this might show that subtitles from online sub communities are used, or that maybe even original subtitles from e.g. DVDs are used.

Indeed, the captioning is copyrighted work and you are not legally allowed to copy and redistribute it.

> But isn't it already known and admitted (and allowed?)

No, and I don't see where you got that from. Meta [1], OpenAI [2] and everybody else is being sued as we speak.

1: https://petapixel.com/2025/01/10/lawsuit-alleges-mark-zucker...

2: https://www.reuters.com/legal/litigation/openai-hit-with-new...

skeezyboy•59m ago

> Indeed, the captioning is copyrighted work and you are not legally allowed to copy and redistribute it. Unless you qualify for one of the many exceptions, such as fair use

pjc50•1h ago

Of course. Piracy is legal when you have a bigger pile of money than the studios.

Hnrobert42•4m ago

HN is pretty strict about not editorializing titles. Even if you statement was unequivocably correct, the post would get flagged.

VMG•3h ago

related: googles song detection alg detects my phone vibrating as the song "Montagem Dilatação Hipnótica"

1718627440•3h ago

Well, I fail to see how the LLM is in the wrong here. Surely if a sufficiently large part of the training data comes from a single source, it is correct to credit them for the output.

bilekas•3h ago

This is a nice reminder that there is no real reasoning in the "AI" it is just still guessing the next word. After being trained on subtitle files which I guess is actually a clever idea as they convey real conversations without pirating, subtitles are freely distributed after all by dedicated translators. Good to see they're the ones getting credit though!

terribleperson•3h ago

Using Whisper to sub Japanese vtuber concerts for my enjoyment, I've noticed a similar trend. Not one specific phrase, but several. Some are strange ("I'm going to make a hole in the back of the head"), some are clearly from lyrics websites.

theanirudh•3h ago

In English, silence is transcribed to "Please like and subscribe"

cheschire•1h ago

I get thanks for watching a lot when using speech to text on ChatGPT

michalpleban•3h ago

The same things happen on Dutch as well, it brings up some kind of radio channel name if I recall correctly.

ninetyninenine•2h ago

https://www.instagram.com/nancyrk/?hl=en

johtso•2h ago

I get the same with Welsh, when having some network issues in voice chat it hallucinated me saying "Diolch yn fawr am wylio'r fideo." which translates as "Thank you very much for watching the video."

tacone•2h ago

Same happened to me with English: I've got "Thanks for watching" many times.

majke•2h ago

I've spent some time with whisper, and indeed this happens all the time. To my untrained eye it seems like:

- they indeed seem to have trained on movies/subtitles

- you absolutely positively must use Voice Activity Detection (VAD) in front of whisper

0points•1h ago

Interesting! I used whipser last year to attempt to build an audio transcription tool but gave up due to excessive amount of hallucinated output no matter what model I used.

It would produce seemingly ok output until you started paying attention.

One example, it insisted that Biggie Smalls sings "Puttin five carrots in my baby girl ear". (its "carats").

It's apparently not useful in transcription as it don't reason [sic].

IshKebab•1h ago

That example is not hallucination, it's just a homonym with insufficiently clear context for the model to disambiguate it.

0points•55m ago

I'm well aware mishearing "carots" as "carrots" is not a hallucination.

That's an example I gave after having used Whisper, the topic of discussion.

jacobgorm•1h ago

In Danish I get credits to a known subtitler.

Oras•57m ago

Searching Google for older posts, found many DailyMotion links for translated movies in Arabic with "ترجمة نانسي قنقر".

I suspected as others mentioned, these were extracted from torrents movies.

arblor•54m ago

Just to add some trivia: ChatGpt interprets(/ed) silence as "Sottotitoli e Revisione a cura di QTSS". Now many videos (mainly dailymotion) with autogenerated subtitles have their Transcripts full of the same message

i.e. https://www.dailymotion.com/video/x9g9d6u

h1fra•41m ago

Time to weapon this: publish thousands of videos, add a referal link or AI instructions in subtitles when there is a silent section, ???, profit

blindstitch•6m ago

The fork that I've been using, WhisperX, seems to do better. I've used it on clean splits of mic tracks (ie total silence when the other is talking) with far fewer hallucinations.

dipierro•1m ago

Субтитры сделал Dima Torzhok

Update on Dispytch: Just Got Dynamic Topics – Event Handling Leveled Up

NASA Goddard director to step down – SpaceNews

MCP for Skeptics

American teen pilot detained on small island in Antarctica

Show HN: Focus Bundle – A Minimal Desktop Focus App and 25 Productivity Tactics

Anthropic CEO Says the Company Will Pursue Gulf State Investments After All

All but the Clergy Believe

Starlink Capacity Analysis v0.2 [pdf]

You Are Reading Reddit a Lot More These Days

Show HN: Unstyled, accessible components from Radix/Floating UI/MUI team

Netherlands rations electricity to ease power grid stresses

Show HN: InkyCut – The open-source Canva alternative with a vibe editor

NASA hacked hardware of camera orbiting Jupiter

Trump admin wants to curtail 'gain-of-function' research

Identity Crisis in the Aftermath of Human-Decentering

DeepMind Table Tennis Robots Train Each Other

The Hater's Guide to the AI Bubble

Bell Labs Takes a Topological Approach to Quantum 2.0

Anatomy of a SYN-ACK Attack

Show HN: AI Powered Mock Interview Practice Platform

Notable People Globe

Show HN: Arc OS – spec-first audit layer for GPT prompts (MD and self-check)

We Won't Reach AGI by Scaling Up LLMs

Comparing the Glove80 and Maltron Keyboards

How to Firefox

The modern art form that redeemed a Windows utility has lessons for all

Support Amazon Multi-Site Operations for Cross-Border E-Commerce

X refuses to cooperate with French probe into algorithm bias

Status, Class, and the Crisis of Expertise

The Em Dash Responds to the AI Allegations

Complete silence is always hallucinated as "ترجمة نانسي قنقر" in Arabic

Comments

Update on Dispytch: Just Got Dynamic Topics – Event Handling Leveled Up

NASA Goddard director to step down – SpaceNews

MCP for Skeptics

American teen pilot detained on small island in Antarctica

Show HN: Focus Bundle – A Minimal Desktop Focus App and 25 Productivity Tactics

Anthropic CEO Says the Company Will Pursue Gulf State Investments After All

All but the Clergy Believe

Starlink Capacity Analysis v0.2 [pdf]

You Are Reading Reddit a Lot More These Days

Show HN: Unstyled, accessible components from Radix/Floating UI/MUI team

Netherlands rations electricity to ease power grid stresses

Show HN: InkyCut – The open-source Canva alternative with a vibe editor

NASA hacked hardware of camera orbiting Jupiter

Trump admin wants to curtail 'gain-of-function' research

Identity Crisis in the Aftermath of Human-Decentering

DeepMind Table Tennis Robots Train Each Other

The Hater's Guide to the AI Bubble

Bell Labs Takes a Topological Approach to Quantum 2.0

Anatomy of a SYN-ACK Attack

Show HN: AI Powered Mock Interview Practice Platform

Notable People Globe

Show HN: Arc OS – spec-first audit layer for GPT prompts (MD and self-check)

We Won't Reach AGI by Scaling Up LLMs

Comparing the Glove80 and Maltron Keyboards

How to Firefox

The modern art form that redeemed a Windows utility has lessons for all

Support Amazon Multi-Site Operations for Cross-Border E-Commerce

X refuses to cooperate with French probe into algorithm bias

Status, Class, and the Crisis of Expertise

The Em Dash Responds to the AI Allegations