The Arabic text "رجمة نانسي قنقر" translates to English as: "Nancy Qanqar's translation" or "Translation by Nancy Qanqar"
"رجمة" means "translation" and "نانسي قنقر" is the name "Nancy Qanqar"
I am pretty sure they didn't get asked.
[1] https://en.wikipedia.org/wiki/ARD_ZDF_Deutschlandradio_Beitr...
Back in 2011, Tageeschau openly rallied against Muslims and wanting public broadcasting gone was a leftist position. The whole thing is completely asinine to anyone who remembers.
> We have a public service mandate, which means that we have very clear responsibilities according to the state media treaty. For us, this means that our top priority is actually reaching our target audience, namely approximately 15 million people living in Germany between the age of 14 and 29 who have internet access
It's not a binding contract for sure but I don't think that OpenAI or other AI scraper is their target.
Obviously a rhetorical question. The AI grifters of this decade take what they want and laugh at your pitiful future
The MPA must be so proud.
The AI industry - soaking up every bit of media available online for commercial purposes, often reproducing it nearly identically - has enough money and capital to influence things its way. And only its way, in case anyone was hoping this might change anything at all for the little guy.
I don't think that there are any clear examples of cases where ONLY downloading has resulted in huge fines. All the big bankrupting level fines have been for both downloading and sharing.
You mention that 'torrenting' could bankrupt you, and that is true, but the main reason for the huge fines are that you are taking part in distribution rather than just 'downloading for personal use'.
They [1, and others] been hunting and fining downloaders for over a decade now, with the only "evidence" being IP addresses connected with the torrent [2].
1: https://www.njordlaw.com/filesharing-and-downloading-films/q...
2: https://admin.ovpn.com/en/blog/online-integrity-new-threats-...
Hint: there is a distinction.
Copying from another comment I wrote here:
> These are two separate things:
> * Making content available for unauthorized distribution
> * Distributing unauthorized content that someone else already made available
> Seeding isn't making content available, it's keeping content available.
Is that an unreasonable assumption? As much as people like to come up with excuses like "I had open wifi!" or "I was running a TOR node", judges don't seem inclined to believe them, probably for the same reason they don't seem inclined to believe excuses like "somebody took my car on a joyride and then returned it!" for parking tickets. Remember, both non-commercial copyright infringement lawsuits and parking tickets are tried in civil court, which means the standard is "preponderance of evidence", not "beyond reasonable doubt".
How hard could it be to keep DHCP logs? Assuming they exist at all, what would cause it to be incorrect?
For all intents and purposes, participating in the torrent almost guarantees that you seeded, because all torrent clients upload as you download.
* Making content available for unauthorized distribution
* Distributing unauthorized content that someone else already made available
Seeding isn't making content available, it's keeping content available.
That still doesn't make them the same thing. There are different shades of grey, etc.
> Moreover isn't AI companies also "keeping content available"?
I don't know what you mean by that.
The whole point of the thread is that AI companies are getting away with piracy but individuals aren't. But the reality is that AI companies aren't getting away with it (a judge ruled that Anthropic must face trial over their use of pirated books).
More specific to this thread is that claim that "ONLY downloading" hasn't resulted in fines for anyone. So far as I can tell, this is true. People are just quibbling over how someone who's torrenting somehow counts as "only downloading", even though their client is uploading.
But then if I download a file, create a copy, and share it with you, have I done anything wrong?
To all intents and purposes, seeding is an act of reproduction. You, while keeping your copy, create copies of (parts) of the file and share it to someone else to allow them to assemble a new, second copy.
Whether this is, or should be, a crime is a different question altogether. The main point I was making is that it’s the copying/sharing to other people which seems to be a crucial element in these prosecutions.
That’s likely intentional: the last thing the *AA folks want is a decision that creating a copy of a copyrighted work for your own personal use is not a crime. But it does seem the courts have decided: making a copy for someone else is indeed illegal.
If you don't understand how torrents work on technical level I suggest at least some shallow reading. Property rights holders don't care about details, as long as you tick the box of sending a single packet to somebody, off to court with ya.
If this is true, I have been unable to find any. Can you please share? In all of the cases I was able to find, the huge fines were based on also uploading.
> If you don't understand how torrents work on technical level I suggest at least some shallow reading
This is a bit patronising, and I'm not sure what point you're trying to make. My point is that the only prosecutions I've been able to find are where they were able to prove uploading as well as downloading (and yes, the fact that someone used BitTorrent makes it a slam-dunk, because the protocol makes it impossible to download without also uploading). Are you trying to argue that someone who torrents a copyrighted work doesn't also share it?
The fight about digitized media for personal (entertainment / informational) use were the early aughts. The precedents crafted then don't immediately translate to these cases (novel transformative work from protected materials), and the new precedents have to account for the fact that universities have been training via "piracy" for ages.
(The magic of money factors in to the extent that they can afford the lawyers to remind the court that this isn't settled law yet).
These regurgitations combined with proof that a model is familiar with a work could be sufficient evidence to force discovery to determine if the work was pirated.
I think this would have some unpalatable consequences. Let's say an author is writing a modestly successful book series: it's not going to make them rich, but it's commercially viable and they care a lot about it for its own sake. Under this system, if the author declares a value commensurate with the (quite small) pure economic value of the IP, they have to live in fear of their right to continue working on their creation being abruptly taken away from them at any point. If they instead declare a value commensurate with the economic value + the extra value that it has to them personally, the resulting tax liability could easily tip the balance and destroy their ability to pursue their writing as a career.
There are always some cases on the edge. The question is if saving them is worth the cost of the major players running rampant.
We shouldn't abandon the line of investigation, however. We should continue thinking of ways to do this until we find one that works well.
There's a chance it ends up being something that requires a judge to interpret each individual case...
Most jurisdictions that have "property tax" only apply it on certain types of property, most commonly real estate. So it's not that weird that IP isn't taxed.
It's the LLM equivalent of thinking that an out-of-office reply is the translation: https://www.theguardian.com/theguardian/2008/nov/01/5
"Translated by Nancy Qanfar"
I'm not sure this is really overfitting, the network does exactly what the training data demands. According to the training data silence art the end transcribes to a copyright notice or subtitle credits
What do you think overfitting is, if not that?
But in this case the behavior seems to generalize over multiple languages, with the model choosing representative "outro silence" captions depending on the language. Which is consistent with the training data showing that outro silence is captioned.
If the model was generalizing perfectly it would show something like "[subtitle credits here]" but that'd be demanding a bit much.
Transcribing outro silence as silence despite the training data consistently transcribing outro silence differently from regular silence would be underfitting
- This behavior damages the model's performance on out of sample data; every word you predict during silence increases the transcript's Word Error Rate.
- These translation credits are an artifact of our training data, and not a reflection of the process we are modeling (spoken language).
So, while you are correct about the mechanism at work here, it is still correct to call learning a spurious pattern which damages our performance "overfitting".
This is just wrong training data.
Side-note: it's also yet more evidence that AI companies hoover all data with no regard for legality or copyright status, the very same offences that got other people in jail or with heavy fines.
Instead, it reverted to what it has seen before (in the training data), hence the overfit.
But the way you phrase it, it’s just “the model is not properly able to generalize”, ie it doesn’t understand the concept of silence also makes sense.
But couldn’t you then argue that any type of mistake / unknown could be explained as “overfitting” ? Where do you draw the line ?
Where do you draw the line between “overfitting to training data” and “incorrect data” ?
Not really, getting 94381294*123=... wrong, but close within the actual answer, cannot be overfitting since it wasn't in the training data.
No it doesn't, for instance some errors would be caused by under fitting. The data could also be correct but your hyperparameters (such as the learning rate or dropout rate) could cause your model to overfit.
> Where do you draw the line between “overfitting to training data” and “incorrect data” ?
There's no need to draw a line between two explanations that aren't mutually exclusive. They can (as in this case) both be true. Overfitting is the symptom; dirty data is the cause.
Silence is never put in the subtitles of a film, since it isn't necessary. The viewers can tell that nothing is being said if there are actors on the screen. And in situations where there are no actors, then there will be a subtitle to indicate what is going on, like "[rock music plays]".
Subtitle authors use this silence to fit in meta information and have done so since the closed captions era.
Proper data cleaning procedures would be to strip this meta data from any subtitle sources. Since this wasn't done, this is fundamentally a classification issue. It may also be an over-fitting issue, but that is secondary to the classification problem.
How would the AI know that a series of zero-amplitude audio samples should generate the string "[silence]"?
It can only know that if the vast majority of silent audio segments in the trainser are consistently labelled with that string. But that doesn't seem to be the case: Silence is either not labeled at all, or labeled with all kinds of different markers or labeled with unrelated things, like copyright credits.
So even if the model successfully learns a generalized representation of the concept of "silence", it's not clear at all which of all the different labels it should use for that concept.
So what might happen is that the model then starts to overfit on the tiny variations of the individual silence segments, in a desperate attempt to devise some kind of system behind the all the different "silence" labels - which will of course go wrong spectacularly as such a system doesn't exist. (Or if it does, is entirely accidental and not something that should be learned)
overfitting means that the model is too closely aligned to the test data, picked up noise and does not generalize well to *new, unseen* data. think students that learn to reproduce questions and their answers for a test instead of learning concepts and to transfer knowledge to new questions that include the same concepts.
while this sounds like overfitting, I'd just say it's garbage in, garbage out; wrong classification. the training data is shit and didn't have (enough) correct examples to learn from.
in romanian, i’ve noticed multiple instances where the transcripts ends with “nu uitati sa da-ti like si subscribe” which, as you might easily infer , translates to “don’t forget to like and subscribe”.
Way to go Nancy! Keep up the good work, ya crazy bastard!
"Big AI" is transparent and open about the fact they use all sorts of copyrighted material to train the data. How would "we see an exact chunk of text from our copyrighted material" add to that?
So not only are they training on copyrighted material, but they didn't even pay for it once, and then they didn't even do minimal data cleaning before training. Which, by the way, is the type of cleaning their LLMs could have done.
This is the key part. And it's not certain this happened. Not defending AI data gobbling, but if we truly and honestly want to fight big-AI use of content, we cannot just presume bad faith. OpenSubtitles.org has a large dataset that is "public". It is be a dataset perfectly suitable, intended for, and therefore used for, training and data analysis.
I've used it for data analysis.
Having models hallucinate copyright notices shows that some content is being copypasted as is, which kind of goes against the transformative argument.
(Note: I think that trying to litigate AI with current copyright laws is weird. They were created before LLMs were even imagined, so of course they can't handle them clearly. New laws are needed around this, not trying to bend over backwards to think about what a lawmarker a century ago would have thought about how transformative a thing they couldn't have imagined is.)
Indeed a good example. We've seen several examples of code snippets where this happens too, mentioned on HN.
But it does not prove that they infringed copyright by ingesting "illegal" stuff, as GP tried to argue. Seeing a verbatim string only "proves" that it came from a specific source. But not if this source was illegally acquired, which was my point.
The videos I tried to transcribe were also Mandarin Chinese, using whisper-large-v3. Besides the usual complaints that it would phonetically "mishear" things and generate nonsense, it was still surprisingly good, compared to other software I played around with.
That said, it would often invent names for the speakers and prefix their lines, or randomly switch between simplified and traditional Chinese. For the videos I tested, intermittent silence would often result in repeating the last line several times, or occasionally, it would insert direction cues (in English for some reason). I've never seen credits or anything like that.
In one video I transcribed, somebody had a cold and was sniffling. Whisper decided the person was crying (transcribed as "* crying *", a cough was turned into "* door closing *"). It then transcribed the next line as something quite unfriendly. It didn't do that anymore after I cut the sniffling out (but then the output switched back to traditional Chinese again).
It's even more important in audio DSP: processing near-zeroes can end up being extremely CPU intensive, look up denormal/subnormal floats.
Quite a lot of algorithms use some form of division and zero is the only number in our typical structures (Z, Q, R, C), that cannot be used to divide with.
Error: division by please upvote, share and like!What good is a speech recognition tool that literally hears imaginary voices?
I'd really appreciate it.
Well, if it is supposed to work after silence detection, then it is good for speech recognition I guess. It's like blaming a wheel why is it circular, you can't sit on it. It's a part of a larger machine.
Show us a technology with better results that does not use VAD. If you can’t, then I’m not sure what you’re arguing against except superficialities so inconsequential that I can’t comprehend the condescension. The results speak for itself
Do you also moan that before applying glue to a surface or it won't stick? Or if you need to drill a guiding hole before making a larger one in wood? Or that you need to use truly prime numbers for a security key to actually be safe?
On the other hand, I can imagine that when things get quiet and the signal-to-noise ratio gets close to zero, random background audio (or randomness introduced in the transcription model) will be enough to tickle a critical number of neurons and elicit hallucinations.
The related thought exercise is this: Try scanning across the band with an AM or sideband radio, and after a while your brain will start to wonder "was that a voice I just heard, or music perhaps?" when in reality it was just environmental static.
I agree their products could be better "end to end" integrated. Meanwhile there is a continuously-improving field of work for detecting speech (which Whisper is incapable of). They offer official "cookbooks" with guidance on an approach they recommend: https://cookbook.openai.com/examples/whisper_processing_guid...
> At times, files with long silences at the beginning can cause Whisper to transcribe the audio incorrectly. We'll use Pydub to detect and trim the silence.
(Official OpenAI quote)
Say if I wanted to use it for Voice Nav, or Voice Input, but not piss off random people speaking the wrong language.
*Although it used to be more common for AVI files in the olden days.
It only highlights how the world really works. If you have money you get to do whatever the fuck you want. If you're just a normal person you get to spend years in jail or worse.
Reminds me of https://www.youtube.com/watch?v=8GptobqPsvg
If you owe the bank $100,000,000 the bank has a problem.
We live in an era where the president of the United States uses his position to pump crypto scams purely for personal profit.
And the US is not the only jurisdiction
"Has been argued" -- sure, but never successfully; in fact, in HiQ v. LinkedIn, the 9th Circuit ruled (twice, both before and on remand again after and applying the Supreme Court ruling in Van Buren v. US) against a cease and desist on top of robots.txt to stop accessing data on a public website constituting "without authorization" under the CFAA.
[1] https://www.thefederalcriminalattorneys.com/unauthorized-rec...
Would it be a "fair use" to download pirated papers for research instead of buying?
Also I was gradually migrating from obtaining software from questionable sources to open source software, thinking that this is going out of trend and nobody torrents apps anymore, but it seems I was wrong?
Or another example: if someone wants to make contributions to Wine but needs a Windows for developing the patch, what would be the right choice, buy it or download a free copy from questionable source?
It's more that the law about "one guy decides to pirate twelve movies to watch them at home and share with his buddies" is already well-settled, but the law about "a company pirates 10,000,000 pieces to use as training data for an AI model (a practice that the law already says is legal in an academic setting, i.e. universities do this all the time and nobody bats an eye)" is more complicated and requires additional trials to resolve. And no, even though the right answer may be self-evident to you or me, it's not settled law, and if the force of law is applied poorly suddenly what the universities are doing runs afoul of it and basically nobody wants that outcome.
Training on copyright is a separate claim than skirting payment for copyright.
Which pretty much boils down to: "If they put it out there for everyone to see, it's probably OK to train on it, if they put it behind a paywall and you don't pay, the training part doesn't matter, it's a violation."
Because it's important to grasp the scale of these copyright violations:
* They downloaded, and admitted to using, Anna's Archive: Millions of books and papers, most of which are paywalled but they pirated it instead
* They acquired Movies and TV shows and used unofficial subtitles distributed by websites such as OpenSubtitles, which are typically used for pirated media. Official releases such as DVDs tend to have official subtitles that don't sign off with "For study/research purpose only. Please delete after 48 hours" or "Subtitles by %some_username%"
If you skirt payment, its a violation. If it's free, but still copyright, it's likely not a violation.
By comparison, someone here brought up that it might be transformative fair use to write a play heavily based on Blood Meridian, but you still need to buy a copy of the book. It would still be infringement to pirate the e-book for your writing process, even if the end result was legal.
The only thing I've been able to find is the note that since copyright is federal law, state contract law actually can't supersede it, to wit: if you try to put a clause in the contract that says the contract is void if I use your work to make transformative fair-use works (or I owe you a fee), that clause is functionally unenforceable (for the same reason that I don't owe you a fee if I make transformative fair-use works of your creations in general).
Or they can negotiate a deal at scale with whatever price / restrictions make sense to both parties.
I don’t see a way they could be “trapped”. Worst case they pay retail price.
Clearly Bonnie and Clyde shouldn’t have been prosecuted. Imagine they were just robbing banks for literary research purposes. They could have then used the learnings to write a book and sell it commercially…
Or imagine one cracks 10000 copyrighted DVDs and then sells 30 second clips… (a derived work).
To me, for profit companies and universities have a huge difference — the latter is not seeking to directly commercially profit from copyrighted data.
Seems fair.
We wish we lived in a world where change was reliably positive for our lives. Often changes are sold that way, but they rarely are.
But when new things introduce dramatic capabilities that former things couldn't match (every chatbot before LLMs), it is as clear of an objective technological advance as has ever happened.
--
Not every technical advance reliably or immediately makes society better.
But whether or when technology improves the human condition is far more likely to be a function of human choices than the bare technology. Outcomes are strongly dependent on the trajectories of who has a technology, when they do, and how they use it. And what would be the realistic (not wished for) outcome of not having or using it.
For instance, even something as corrosive as social media, as it is today, could have existed in strongly constructive forms instead. If society viewed private surveillance, unpermissioned collation across third parties, and weaponizing of dossiers via personalized manipulation of media, increased ad impact and addictive-type responses, as ALL being violations of human rights to privacy and freedom from coercion or manipulation. And worth legally banning.
Ergo, if we want tech to more reliably improve lives, we need to ban obviously perverse human/corporate behaviors and conflicts of interest.
(Not just shade tech. Which despite being a pervasive response, doesn't seem to improve anything.)
Either both AI teams cheated, in which case there's nothing to worry about, or they didn't, in which case you've set a pretty high bar. Where is that bar, exactly? What exactly does it take to justify blowing off copyright law in the larger interest of progress? (I have my own answers to that question, including equitable access to the resulting models regardless of how impressive their performance might be, but am curious to hear yours.)
Social networks as they exist today represent technology that didn't exist decades ago. I wouldn't call it an "advancement" though. I think social media is terrible for humans in aggregate.
I'm pretty bullish on ML progress in general, but I'm finding it harder every day to disagree with recursive's take on social media.
Everyone I know has stories about their ISP sending nastygrams threatening legal action over torrenting, but now that corporations (whose US legal personhood appears to matter only when it benefits them) are doing it as part of the development of a commercial product that they expect to charge people for, that's fine?
And in any case, my argument had nothing to do with copyright (though I do hate the hypocrisy of the situation), and whether or not it's "nothing to worry about" in the long run, it seems like it'll cause a lot of harm before the benefits are felt in society at large. Whatever purported benefits actually come of this, we'll have to deal with:
- Even more mass layoffs that use LLMs as justification (not just in software, either). These are people's livelihoods; we're coming off of several nearly-consecutive "once-in-a-generation" financial crises, a growing affordability crisis in much of the developed world, and stagnating wages. Many people will be hit very hard by layoffs.
- A seniority crisis as companies increasingly try to replace entry-level jobs with LLMs, meaning that people in a crucial learning stage of their jobs will have to either replace much of the learning curve for their domain with the learning curve of using LLMs (which is dubiously a good thing), or face unemployment, and leaving industries to deal with the aging-out of their talent pools
- We've already been heading towards something of an information apocalypse, but now it seems more real than ever, and the industry's response seems to broadly be "let's make the lying machines lie even more convincingly"
- The financial viability of these products seems... questionable right now, at best, and given that the people running the show are opening up data centres in some of the most expensive energy markets around (and in the US's case, one that uniquely disincentivizes the development of affordable clean energy), I'm not sure that anyone's really interested in a path to financial sustainability for this tech
- The environmental impact of these projects is getting to be significant. It's not as bad as Bitcoin mining yet, AFAIK, but if we keep on, it'll get there.
- Recent reports show that the LLM industry is starting to take up a significant slice of the US economy, and that's never a good sign for an industry that seems to be backed by so much speculation rather than real-world profitability. This is how market crashes happen.
>If you're just a normal person you get to spend years in jail or worse.
Not that I'm a big fan of the criminalization of copyright infringement in the United States, but who has ever spent years in jail for this?
Besides, if it really bothered you, then we might not see this weird tone-switch from one sentence to the next, where you seem to think that piracy is shocking and "something should be done" and then "it's not good tht someone should spend time in jail for it". What gives?
What a weirdly condescending way to interpret my post. My point boils down to: Either prosecute copyright infringement or don't. The current status quo of individuals getting their lives ruined while companies get to make billions is disgusting.
This is the absolute core of the issue. Technical people see law as code, where context can be disregarded and all that matters is specifying the outputs for a given set of inputs.
But law doesn’t work that way, and it should not work that way. Context matters, and it needs to.
If you go down the road of “the law is the law and billion dollar companies working on product should be treated the same as individual consumers”, it follows that individuals should do SEC filings (“either require 10q’s or don’t!”), and surgeons should be jailed (“either prosecute cutting people with knives or don’t!”).
There is a lot to dislike about AI companies, and while I believe that training models is transformative, I don’t believe that maintaining libraries of pirated content is OK just because it’s an ingredient to training.
But insisting that individual piracy to enjoy entertainment without paying must be treated exactly the same as datasets for model training is the absolute weakest possible argument here. The law is not that reductive.
Copyright laws target everyone. SEC laws don't.
As Anatole France famously quipped:
"The law, in its majestic equality, forbids the rich and poor alike to sleep under bridges, to beg in the streets, and to steal bread."
Aaron Swartz?
EDIT: apparently he wasn't in jail, he was on bail while the case was ongoing - but the shortest plea deal would still have had him in jail for 6 months, and the penalty was 35 to 50 years.
As for actually gathering the copyrighted material: I believe the jury hasn't even been empaneled for that yet (in the OpenAI case), but the latest ruling from the court is that copyright may have been violated in the creation of their training corpus.
They can. I don't think anyone got prosecuted for using an illegal streaming site or downloading from sci-hub, for instance. What people do get sued for is seeding, which counts as distribution. If anything AI companies are getting prosecuted more aggressively than "ordinary people", presumably because of their scale. In a recent lawsuit Anthropic won on the part about AI training on books, but lost on the part where they used pirated books.
Same goes for recording: I'm just training my skills of recording. Or maybe I'm just recording it so I can rewatch it later, for training purposes, of course.
None of this is relevant because Anthropic was only left off the hook for training, and not for pirating the books itself. So far as the court cases are playing out, there doesn't appear to be a special piracy exemption for AI companies.
>Same goes for recording: I'm just training my skills of recording. Or maybe I'm just recording it so I can rewatch it later, for training purposes, of course.
You can certainly use that as a defense. That's why we have judges, otherwise there's going to be some smartass caught with 1KG of coke and claiming it's for "personal consumption" rather than distribution.
None of this matters in reality, though. If you're caught with AV gear in a movie theater once, you'd likely be ejected and banned from the establishment/chain, not have the FBI/MPAA go after you for piracy. If you come again, you'd likely be prosecuted for trespassing. In the cases where they're going after someone in particular for making these rips, they usually have a dossier of evidence, like surveillance/transaction history showing that the same individual has been repeatedly recording movies, and watermarks correlating the screenings that the person has been in to files showing up on torrent sites.
Good example, because this is exactly what websites are doing with LLM companies, who are doing their damnest to evade the blocks. Which brings us back around to "trespassing" or the CFAA or whatever.
That argument is pretty much dead after https://en.wikipedia.org/wiki/Van_Buren_v._United_States and https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn
I'll leave all other jurisdictions up to you.
You put in 2+2 - the right figures. The machine says 4 - the right answer. If you put in the wrong figures, like 3+3, will the machine still say 4? It's easy to make a machine that always says 4.
The people who asked him that question, however, probably got a different scam demonstrated to them every every. Remember the Mechanical Turk? Babbage's reply paints him very honestly. It shows that he couldn't even conceive that someone might try to trick the royal court (or whoever it was) into accepting a fake device.
If it couldn't understand it, it was "foreign" for the longest time.
I also noticed a couple of months ago that YouTube seems to have quietly rolled out a new auto-transcription model that can make reasonable guesses at where capitalization, punctuation, and sentence boundaries should go. It seems to have degraded even more rapidly than the old one, falling victim to the same kinds of transcription errors. Although the new one has a different hallucination in silence and noise that it wasn't able to classify (which, incidentally, its ability to recognize things like music and applause seems worse than the old one's): where the old model would have hallucinated the word "foreign", the new one thinks it's hearing the word "heat", often repeated ("Heat. Heat.").
To be fair, there is a difference between when subtitles match the source language and when they don't. Former are often verbatim.
Netflix sometimes takes the cake with what I consider the most outrageous option: writing "[in English]" when they mean "in whatever language the protagonist considers native", which is mind-bogglingly wrong and hilarious at the same time.
They do this with the English subtitles of the German production "Die Kaiserin" ("The Empress"): whenever Sisi is speaking in another language, say French, the subtitles will say "[in French] I love you...", and when she switches back to German they will say "[in English] I love you...". WTF, Netflix? Note this is unrelated to understanding German; it's mostly Netflix looking down on its customers and assuming they cannot comprehend there are people in the world for whom their native tongue is different to the viewer's native tongue.
This has happened in more shows, enough to know it's not a fluke, though Netflix is inconsistent about it.
They trained the model on every YouTube video they could, and hoped the aggregate was useful data.
My revelation was that machine translation needs a corpus of bilingual documents to learn from, and if the language is sufficiently obscure, there may not be any bilingual documents except for the Bible, which missionaries have translated into just about every language on Earth.
violets are blue
unregistered hypercam 2
Silence is golden,
Translated by Nancy,
To copyright, we aren't beholden
"[ sub by sk cn2 ]"
or
"Anyways, thanks for watching! Please subscribe and like! Thanks for watching! Bye!"
or
"This is the end of the video. Thank you for watching. If you enjoyed this video, please subscribe to the channel. Thank you."
leaving personal comments, jokes, reactions, intros in subtitles is very common in eastern cultures.
Turkish readers will probably remember “esekadam iyi seyirler diler” :)
I suppose the cause is the same, generally subtitle creators adding all kinds of stuff during the credits that is NOT a transcript.
Seems to me it could have been filtered out relatively easily during training, by clipping the first and last few minutes of all audios. But I guess that's just in hindsight.
Whisper also likes to transcribe cut off speech or unintelligible noise as "Thank you". I have no idea where that is coming from, but I guess it's a very polite model...
I can see how this might show that subtitles from online sub communities are used, or that maybe even original subtitles from e.g. DVDs are used. But isn't it already known and admitted (and allowed?) that AI uses all sorts of copyrighted material to train models?
I disagree with this conclusion. I've used e.g. the opensubtitles dataset for some data-analysis in the past. It's a huge dataset, freely available and precisely intended for such use. Now, if all the data in the opensubtitles dataset is legal, is another point.
So one might argue that using this opensubtitles dataset, makes one complicit to the illegal activities of opensubtitles themselves, IDK: IANAL.
Indeed, the captioning is copyrighted work and you are not legally allowed to copy and redistribute it.
> But isn't it already known and admitted (and allowed?)
No, and I don't see where you got that from. Meta [1], OpenAI [2] and everybody else is being sued as we speak.
1: https://petapixel.com/2025/01/10/lawsuit-alleges-mark-zucker...
2: https://www.reuters.com/legal/litigation/openai-hit-with-new...
Using copyrighted materials and then meaningfully transforming it isn’t infringement. LLMs only recreate original work in the same way I am when I wrote the first sentence of this paragraph because it probably exists word for word somewhere else too
It’s been determined by the judge in the Meta case that training on the material is fair use. The suit in that case is ongoing to determine the extent of the copyright damages from downloading the material. I would not be surprised if there is an appeal to the fair use ruling but that hasn’t happened yet, as far as I know. Just saying that there is good reason for them to think it’s been allowed because it kind of has; that can be reversed but it happened.
There hasn't been any trials yet about the millions of copyrighted books, movies and other content they evidently used.
> But isn't it already known and admitted (and allowed?)
You seemed to be confused about why this person believed that:
> No, and I don't see where you got that from.
And I wrote a comment intended to dispel your confusion. The above commenter thought that it was allowed because a judge said it was allowed; that can be appealed but that's the reason someone thinks it's allowed.
Trial court rulings aren't binding precedent even on the same court in different cases, so its quite possible that different cases at the trial level can reach different conclusions on fair use on fairly similar facts, given the lack of appellate precedent directly on point with AI training.
A single verdict about a specific case (13 authors vs META) does not mean it's legal for companies to steal IP from other companies which has evidently been going on for some years now.
Those other companies have lawyers powerful enough to change jurisdiction in many countries in order to "protect their IP".
The contention is that the specific translated text appears largely from illegal translations (i.e., fansubs) and not from authorized translations. And from a legal perspective, that would basically mean there's no way they could legally have appropriated that material.
> But isn't it already known and admitted (and allowed?) that AI uses all sorts of copyrighted material to train models?
Technically, everything is copyrighted. But your question is really about permission. Some of the known corpuses for AI training include known pirate materials (e.g., libgen), but it's not known whether or not the AI companies are filtering out those materials from training. There's a large clutch of cases ongoing right now about whether or not AI training is fair use or not, and the ones that have resolved at this point have done so on technical grounds rather than answering the question at stake.
In other words there are activities that are legal or not depending on whether you have authorization from the state. That describes many things. For instance you synthesize meth without a license from the DEA/FDA, you're a "drug cartel" or whatever. But if you do it with a license you're a "pharmaceutical company", and you're not making "meth", you're making "desoxyn".
Legally, why wouldn't they be able to do the piracy parts in one of those jurisdictions and then ship the outputs back to the mothership?
- they indeed seem to have trained on movies/subtitles
- you absolutely positively must use Voice Activity Detection (VAD) in front of whisper
It would produce seemingly ok output until you started paying attention.
One example, it insisted that Biggie Smalls sings "Puttin five carrots in my baby girl ear". (its "carats").
It's apparently not useful in transcription as it don't reason [sic].
That's an example I gave after having used Whisper, the topic of discussion.
I suspected as others mentioned, these were extracted from torrents movies.
Well now I know how I’m going to start filling awkward silences in meetings.
Big data. Machine learning. Blockchain. Artificial intelligence. Digital manufacturing. Big data analysis. Quantum communication and…Internet of things.
This time the hype cycle won’t be a massive exaggerated disappointment, for real this time.
But honestly, this is the AI equivalent of “please send for translating” in Welsh on a Welsh street sign.
``` text = "helo helo hello ." target_phrase = "ترجمة نانسي قنقر" replacement = ""
updated_text = text. Replace(target_phrase, replacement)
print(updated_text) ```
GaggiX•6mo ago