The Arabic text "رجمة نانسي قنقر" translates to English as: "Nancy Qanqar's translation" or "Translation by Nancy Qanqar"
"رجمة" means "translation" and "نانسي قنقر" is the name "Nancy Qanqar"
I am pretty sure they didn't get asked.
[1] https://en.wikipedia.org/wiki/ARD_ZDF_Deutschlandradio_Beitr...
> We have a public service mandate, which means that we have very clear responsibilities according to the state media treaty. For us, this means that our top priority is actually reaching our target audience, namely approximately 15 million people living in Germany between the age of 14 and 29 who have internet access
It's not a binding contract for sure but I don't think that OpenAI or other AI scraper is their target.
The MPA must be so proud.
The AI industry - soaking up every bit of media available online for commercial purposes, often reproducing it nearly identically - has enough money and capital to influence things its way. And only its way, in case anyone was hoping this might change anything at all for the little guy.
I don't think that there are any clear examples of cases where ONLY downloading has resulted in huge fines. All the big bankrupting level fines have been for both downloading and sharing.
You mention that 'torrenting' could bankrupt you, and that is true, but the main reason for the huge fines are that you are taking part in distribution rather than just 'downloading for personal use'.
They [1, and others] been hunting and fining downloaders for over a decade now, with the only "evidence" being IP addresses connected with the torrent [2].
1: https://www.njordlaw.com/filesharing-and-downloading-films/q...
2: https://admin.ovpn.com/en/blog/online-integrity-new-threats-...
These regurgitations combined with proof that a model is familiar with a work could be sufficient evidence to force discovery to determine if the work was pirated.
It's the LLM equivalent of thinking that an out-of-office reply is the translation: https://www.theguardian.com/theguardian/2008/nov/01/5
"Translated by Nancy Qanfar"
I'm not sure this is really overfitting, the network does exactly what the training data demands. According to the training data silence art the end transcribes to a copyright notice or subtitle credits
What do you think overfitting is, if not that?
But in this case the behavior seems to generalize over multiple languages, with the model choosing representative "outro silence" captions depending on the language. Which is consistent with the training data showing that outro silence is captioned.
If the model was generalizing perfectly it would show something like "[subtitle credits here]" but that'd be demanding a bit much.
Transcribing outro silence as silence despite the training data consistently transcribing outro silence differently from regular silence would be underfitting
- This behavior damages the model's performance on out of sample data; every word you predict during silence increases the transcript's Word Error Rate.
- These translation credits are an artifact of our training data, and not a reflection of the process we are modeling (spoken language).
So, while you are correct about the mechanism at work here, it is still correct to call learning a spurious pattern which damages our performance "overfitting".
This is just wrong training data.
Side-note: it's also yet more evidence that AI companies hoover all data with no regard for legality or copyright status, the very same offences that got other people in jail or with heavy fines.
Instead, it reverted to what it has seen before (in the training data), hence the overfit.
But the way you phrase it, it’s just “the model is not properly able to generalize”, ie it doesn’t understand the concept of silence also makes sense.
But couldn’t you then argue that any type of mistake / unknown could be explained as “overfitting” ? Where do you draw the line ?
Way to go Nancy! Keep up the good work, ya crazy bastard!
"Big AI" is transparent and open about the fact they use all sorts of copyrighted material to train the data. How would "we see an exact chunk of text from our copyrighted material" add to that?
So not only are they training on copyrighted material, but they didn't even pay for it once, and then they didn't even do minimal data cleaning before training. Which, by the way, is the type of cleaning their LLMs could have done.
Having models hallucinate copyright notices shows that some content is being copypasted as is, which kind of goes against the transformative argument.
(Note: I think that trying to litigate AI with current copyright laws is weird. They were created before LLMs were even imagined, so of course they can't handle them clearly. New laws are needed around this, not trying to bend over backwards to think about what a lawmarker a century ago would have thought about how transformative a thing they couldn't have imagined is.)
The videos I tried to transcribe were also Mandarin Chinese, using whisper-large-v3. Besides the usual complaints that it would phonetically "mishear" things and generate nonsense, it was still surprisingly good, compared to other software I played around with.
That said, it would often invent names for the speakers and prefix their lines, or randomly switch between simplified and traditional Chinese. For the videos I tested, intermittent silence would often result in repeating the last line several times, or occasionally, it would insert direction cues (in English for some reason). I've never seen credits or anything like that.
In one video I transcribed, somebody had a cold and was sniffling. Whisper decided the person was crying (transcribed as "* crying *", a cough was turned into "* door closing *"). It then transcribed the next line as something quite unfriendly. It didn't do that anymore after I cut the sniffling out (but then the output switched back to traditional Chinese again).
It's even more important in audio DSP: processing near-zeroes can end up being extremely CPU intensive, look up denormal/subnormal floats.
Quite a lot of algorithms use some form of division and zero is the only number in our typical structures (Z, Q, R, C), that cannot be used to divide with.
Error: division by please upvote, share and like!
What good is a speech recognition tool that literally hears imaginary voices?
Well, if it is supposed to work after silence detection, then it is good for speech recognition I guess. It's like blaming a wheel why is it circular, you can't sit on it. It's a part of a larger machine.
If it couldn't understand it, it was "foreign" for the longest time.
violets are blue
unregistered hypercam 2
Silence is golden,
Translated by Nancy,
To copyright, we aren't beholden
"[ sub by sk cn2 ]"
or
"Anyways, thanks for watching! Please subscribe and like! Thanks for watching! Bye!"
or
"This is the end of the video. Thank you for watching. If you enjoyed this video, please subscribe to the channel. Thank you."
leaving personal comments, jokes, reactions, intros in subtitles is very common in eastern cultures.
Turkish readers will probably remember “esekadam iyi seyirler diler” :)
I suppose the cause is the same, generally subtitle creators adding all kinds of stuff during the credits that is NOT a transcript.
Seems to me it could have been filtered out relatively easily during training, by clipping the first and last few minutes of all audios. But I guess that's just in hindsight.
Whisper also likes to transcribe cut off speech or unintelligible noise as "Thank you". I have no idea where that is coming from, but I guess it's a very polite model...
I can see how this might show that subtitles from online sub communities are used, or that maybe even original subtitles from e.g. DVDs are used. But isn't it already known and admitted (and allowed?) that AI uses all sorts of copyrighted material to train models?
Indeed, the captioning is copyrighted work and you are not legally allowed to copy and redistribute it.
> But isn't it already known and admitted (and allowed?)
No, and I don't see where you got that from. Meta [1], OpenAI [2] and everybody else is being sued as we speak.
1: https://petapixel.com/2025/01/10/lawsuit-alleges-mark-zucker...
2: https://www.reuters.com/legal/litigation/openai-hit-with-new...
- they indeed seem to have trained on movies/subtitles
- you absolutely positively must use Voice Activity Detection (VAD) in front of whisper
It would produce seemingly ok output until you started paying attention.
One example, it insisted that Biggie Smalls sings "Puttin five carrots in my baby girl ear". (its "carats").
It's apparently not useful in transcription as it don't reason [sic].
That's an example I gave after having used Whisper, the topic of discussion.
I suspected as others mentioned, these were extracted from torrents movies.
GaggiX•5h ago