One is when copyrighted material is "pirated" for use in training, i.e. you torrent "the pile" instead of paying to acquire the books.
The other is when a user uses an LLM to generate a work that violates copyright.
Training itself isn't a violation, that's common sense. I am aware of lots of copyrighted things, and I could generate a work that violates copyright. My knowing this in and of itself isn't a violation.
The fact that an LLM agrees to help someone violate copyright is a failure mode, on par with telling them how to make meth or whatever other things their creators don't want them doing. There's a good argument for hardening them against requests to generate copyrighted content, and this already happens.
The situation is similar with image generation. An artist can draw a picture of Mickey Mouse without any issue. But if you pay an artist to draw you the same picture, that would also be a violation.
With generative tools, the users are not themselves writers or artists using tools - they are effectively commissioners, commissioning custom artwork from an LLM and paying the operator.
If someone built a machine that you put a quartner in, cranked a handle, and then printed out pictures of the Disney character you choose, then Disney is right in demanding them to stop (or more likely, force a license deal). Whatever technology drives the machine, whether an AI model or an image database or a mechanical turk, is largely immaterial.
I don't believe that's correct. The issue is not money changing hands but rather the reproduction itself. Even if I give it away for free I'm still violating IP law.
There's also a fundamental issue with your argument - LLMs aren't recognized as having legal agency. If I pay an artist to violate IP law then the artist, being a human, is presumably at fault in addition to myself. Same for a company (owned by people).
But tools are different. If I vandalize someone's car with a hammer the hardware store isn't at fault for selling it to me. I'm at fault for how I chose to use the tool that I purchased (or rented access to in the case of a hosted LLM).
> If someone built a machine that you put a quartner in
This is a flawed example because the machine was designed with the specific intention of reproducing a copyrighted work. That is different from a general purpose tool which can potentially be misused by the wielder.
It's not a perfect analogy by any means but it does serve to illustrate the difference in intent between distributing a particular work versus creating something that happens to incorporate copyrighted material verbatim but doesn't have any inherent need to or purpose in doing so.
It depends. Is it just under copyright, or is the featured location trademarked too? Is the photograph for commercial purposes? Is the featured location generally accepted as being part of a cityscape / landscape?
* Eiffel tower: https://wiki.gettyimages.com/897/
* Millennium-wheel: https://wiki.gettyimages.com/british-airways-london-eye-mill...
* Pro sports venues: https://wiki.gettyimages.com/pro-sport-stadiums-and-venues/
* Hollywood sign: https://wiki.gettyimages.com/hollywood-sign/ https://www.youtube.com/watch?v=KUdQ7gxU6Rg
I don't think anyone denies that frontier models were trained on copyrighted material - it's well documented and public knowledge. (and a separate legal question regarding fair-use and acquisition)
I also don't think anyone denies that a model that strongly fits the training data approximates the copy-paste function. (Or at the very least, if A then B, consistently)
In practice, training resembles lossy compression of the data. Technically one could frame an LLM as a database of compressed training inputs.
This paper argues and demonstrates that "extraction is evidence of memorization" which affirms the above.
In terms of LLM output (the valuable product customers are paying for) this is familiar, albeit grey, legal territory.
https://en.wikipedia.org/wiki/Substantial_similarity
When a customer pays for an AI service, they're paying for access to a database of compressed training data - the additional layers of indirection sometimes produce novel output, and many times do not.
Unless you advocate for discarding the whole regime of intellectual property or you can argue for a better model of IP laws, the question stands: why shouldn't LLM services trained on copyrighted material be held responsible when their product violates "substantial similarity" of said copyrighted works? Why should failure to do so be immune from legal action?
If by “you” you mean Google or OpenAI or Microsoft, etc., you use your much much deeper pockets to pay lawyers to act in your interests.
All authors, publishers, etc. are outgunned. Firepower is what resolves civil cases in one party’s favor and a day in court is easily a decade or more away.
Page 9: There is no deterministic path from model memorization to outputs of infringing works. While we’ve used probabilistic extraction as proof of memorization, to actually extract a given piece of 50 tokens of copied text often takes hundreds or thousands of prompts. Using the adversarial extraction method of Hayes et al. [54], we’ve proven that it can be done, and therefore that there is memorization in the model [16, 27]. But this is where, even though extraction is evidence of memorization, it may become important that they are not identical processes (Section 2). Memorization is a property of the model itself; extraction comes into play when someone uses the model [27]. This paper makes claims about the former, not the latter. Nevertheless, it’s worth mentioning that it’s unlikely anyone in the real world would actually use the model in practice with this extraction method to deliberately produce infringing outputs, because doing so would require huge numbers of generations to get non-trivial amounts of text in practice
I think this is the key insight. It differs from something like say, JPEG (de)compression, in that it also produces novel but sensible combinations of both copyrighted and non-copyrighted data, independent of their original context. In fact, I'd argue that is its main purpose. To describe it as just a lossy compressed natural-language-queryable database as a result would be reductive to its function and a mischaracterization. It can recall extended segments of its training data as demonstrated by the paper, yes, but it also cannot plagiarize the entirety of a given source data, as also described by the paper.
> why shouldn't LLM services trained on copyrighted material be held responsible when their product violates "substantial similarity" of said copyrighted works?
Because they on their own are not producing the output that is substantially similar. They (possibly) do it on user input. You could make a case that they should perform filtering and detection, but I'm not sure that's a good idea, since the user might totally have the rights to create a substantially similar work to something copyrighted, such as when they themselves own the rights to that thing. At which point, you can only hold the user themselves responsible. I guess detection on its own might be reasonable to require, in order to provide the user with the capability to not incriminate themselves, should that indeed not be their goal.
This isn't to say they shouldn't be held responsible for pirating these copyrighted bits of content in the first place though. And if they perform automated generation of substantially similar content, that would still be problematic following this logic. Not thinking of chain-of-thought here mind you, but something more silly, like writing a harness to scrape sentiment and reactively generate things based on that. Or to use, idk, weather or current time and their own prompts as the trigger.
Let me give you a possibly terrible example. Should Blizzard be held accountable in Germany, when users from there on the servers located on there stand in a shape of a nazi swastika ingame, and then publish screenshots and screen recordings of this on the internet? I don't think so. User action played crucial role in the reproduction of the hate symbol in question there. Conversely, LLMs aren't just spouting off whatever, they're prompted. The researchers in the paper had to put in focused efforts to perform extraction. Despite popular characterization, these are not copycat machines, and they're not just pulling out all their answers out of a magic basket cause we all ask obvious things answered before on the internet. Maybe if the aforementioned detections were added, people would finally stop coping about them this way.
jrm4•6h ago
It's simple. If you put the works into the LLM, it can later make immediately identifiable, if imperfect, copies of the work. If you didn't put the work in, it wouldn't be able to do that.
The fact that you can't "see the copy" inside is wildly irrelevant.
orionsbelt•5h ago
If you ask OpenAI to generate an image of your dog as Superman, it will often start to do so, and then it will realize it is copyrighted, and stop. This seems sensible to me.
Isn’t it the ultimate creative result that is copyright infringement, and not merely that a model was trained to understand something very well?
jrm4•5h ago
Remember, we can only target humans. So we're not likely to target your guy; but we ARE likely to target "the guy that definitely fed a complete unauthorized copy of the thing into the LLM."
regularfry•4h ago
If I download harry_potter_goblet_fire.txt off some dodgy site, then let's assume that owner of that site has infringed copyright by distributing it. If I upload it again to some other dodgy site, I would also infringe copyright in a similar same way. But that would be naughty so I'm not going to do that.
Let's say instead that I feed it into a bunch of janky pytorch scripts with a bunch of other text files, and out pops a bunch of weights. Twice.
The first model I build is a classifier. Its output is binary: is this text about wizards, yes/no.
The second model I build is an LLM. Its output is text, and (as in the article) you can get imperfect reproductions of parts of the training file out of it with the right prompts.
Now, I upload both those sets of weights to HuggingFace.
How many times am I supposed to have infringed copyright?
Is it:
A) Twice (at least), because the act of doing anything whatsoever with harry_potter_goblet_fire.txt without permission is verboten;
B) Once, because only one of the models is capable of reproducing the original (even if only approximately);
C) Zero, because neither model is capable of a reproduction that would compete with the original;
or
D) Zero, because I'm not the distributor of the file, and merely processing it - "format shifted" from the book, if you like - is not problematic in itself.
Logically I can see justifications for any of B) (tenuously), C), or D). Obviously publishers would want us to think that A) is right, but based on what? I see a lot of moral outrage, but very little actual argument. That makes me think there's nothing there.
bodhi•4h ago
You have, I assume, a licensed copy of Harry Potter. That license restricts you from doing certain activities, like making (distributing? Lets go with distributing) derived works. Your models are derived works. Thus when you distribute your models, you’re violating the licence terms you “agreed” to when you acquired your copy of Harry Potter.
This is no judgement by me about whether that’s reasonable or not, just my understanding of the mechanics.
perching_aix•4h ago
Or at least that's how I read it. I'm sure GP will clarify shortly.
fc417fc802•1h ago
SubiculumCode•1h ago
michaelt•4h ago
If I download an MP3 to my computer, the act of copying it to my hard disk is making a copy. The act of playing it copies it from my hard disk to RAM, then from RAM to the audio output; these are also copies.
That means you either need: A license (e.g. terms and conditions when you buy a legal MP3) or an "implied license" (this is what makes it legal to play CDs) or a right explicitly conferred by law (e.g. 17 U.S. Code § 117(a) lets you copy computer programs into RAM to run them) or a right of "fair use" (which allows e.g. Google copy web pages into their index)
So I'm afraid you've infringed copyright countless times - if you run several thousand epochs of training, you've infringed copyright several thousand times.
You may note that this is completely at odds with reality, and says almost nothing can happen without a copyright lawyer's permission. A cynic would say this suits the interests of copyright lawyers, if not reality, and nobody else gets to create legal theories about copyright infringement.
MengerSponge•5h ago
nick__m•5h ago
michaelt•4h ago
Rockets? Pretty cool. Wernher von Braun? Not cool.
fc417fc802•1h ago
bandrami•1h ago
But me drawing Superman would absolutely be violating DC's copyright. They probably wouldn't care since my drawing would suck, but that's not the legal issue.
perching_aix•5h ago
Good thing they were public (?) works, wouldn't wanna get sued [0] for possibly being a two legged copyright infringement. Or should I say having been, since naturally I immediately erased all of these works from my mind just days after these tests, even without any legal impetus.
Edit: thinking about it a bit more, you also remind me to our midterm tests from the same class. We had to produce multiple page long essays on the spot, analyzing a select work... from memory. Bonus points for being able to quote from it, of course. Needless to say, not many original thoughts were featured in those essays, not in mine, not in others' - the explicit expectation was that you'd peruse the current and older textbooks to "learn (memorize) the analysis" from, and then you'd "write about it in your own words", but still using technical terms. They were pretty much just tests in jargon use, composition, and memorization, which is definitely a choice of all time for a class on literature. But I think it draws an interesting perspective. At no point did we ever have to actually demonstrate a capability in literary analysis of our own, or was that capability ever graded, for example. But if you only read our essays, you'd think we were great at it. It was mimicry. Doesn't mean we didn't end up developing such a capability though.
[0] https://youtu.be/-JlxuQ7tPgQ
thaumasiotes•3h ago
WinForms is still around. There have been further technologies, but as far as I can tell the current state of things is basically just a big tire fire and about the best you can do is to ignore all of them and develop in WinForms.
Is there a successor now?
perching_aix•3h ago
singleshot_•3h ago
"Subject to sections 107 through 122, the owner of copyright under this title has the exclusive rights to do and to authorize any of the following:
(1) to reproduce the copyrighted work in copies"
perching_aix•3h ago
Edit: I guess the proper term is public domain works, not just public works. Maybe that's our issue here.
rockemsockem•5h ago
LLMs are OBVIOUSLY not a replacement for the books and works that they're trained on, just like Google books isn't.
tossandthrow•4h ago
crmd•58m ago
1. For example earlier this month: https://www.reddit.com/r/fantasyromance/comments/1ktrwxj/fan...
anothernewdude•5h ago
tossandthrow•4h ago
cortesoft•4h ago
fluidcruft•4h ago
The only real question is whether it's possible to prevent the system from generating the copyrighted content.
A strange analogy would be some sort of magical BluRay that plays novel movies unless you enter the decryption key. And somehow you would have to prevent entering those keys.
echelon•4h ago
Not so fast! That hasn't been tested in court or given any sort of recommendation by any of the relevant IP bodies.
And to play devil's advocate here: your brain also contains an enormous amount of copyrighted content. I'm glad the lawyers aren't lobotomizing us and demanding licensing fees on our memories.
I'm pretty sure if you asked me to sit under an MRI and recall scenes from movies like "Jurassic Park", my visual cortex would reconstruct scenes with some amount of fidelity to the original. I shouldn't owe that to Universal. (In a perfect world, they would owe me for imprinting and storing their memetic information in my mind. Ad companies and brands should for sure.)
If I say, "One small step for man", I'm pretty confident that the lot of you would remember Armstrong's exact voice, the exact noise profile of the recording, with the precise equipment beeps. With almost exacting recall.
I'm also pretty sure your brains can remember a ginormous amount of music. I know I can. And when I'm given a prediction task (eg. listening to a song I already know), I absolutely know what's coming before it hits.
fluidcruft•2h ago
Movies are maybe less intuitive because most people won't reproduce a movie. There supposedly are people with "photographic memories" for example. And infringement is possibly not so much the capacity as what is actually done. Someone with a photographic memory could duplicate a movie but unless they actually do their capacity is not infringement. But you also have to consider that if it was viewed "lawfully" in the first place then the copyright holder has decided to let people view them. So the act of a brain seeing Jurassic Park is what the copyright holder authorized. Generally I don't think copyright holders have agreed to have LLM ingest their works.
Plays and opera are maybe more similar, because people to copy productions etc. But those don't feature encryption so the blob of unknown data doesn't feature in the analogy.
fc417fc802•1h ago
But then you're no longer talking about "contains copyrighted material" you're talking about "actively reproduces copyrighted material in an immediately recognizable form".
> Generally I don't think copyright holders have agreed to have LLM ingest their works.
Aren't you essentially speculating about whether or not the training data was obtained in compliance with IP law?
echelon•48m ago
You can train on Disney, but you can't produce Disney outputs.
Likewise, if you create some new IP using AI, the US will confer copyright so long as there was sufficient user input (eg. inpainting, editing, turning it into a long-form movie, etc.) Those new, sufficiently novel works should also be copyrightable.
Copyright should compel people to make good stuff and reward them when they do so. But it shouldn't hinder technological progress. There's a balance that can be struck.
jonplackett•4h ago
Last I remember, whether it is ‘transformative’ is what’s important.
https://en.m.wikipedia.org/wiki/Transformative_use
Eg. Google got away with ‘transforming’ books with Google books.
https://www.sgrlaw.com/google-books-fair-transformative-use/
karaterobot•4h ago
> Our results complicate current disputes over copyright infringement, both by rejecting easy claims made by both sides about how models work and by demonstrating that there is no single answer to the question of how much a model memorizes
I wonder if this is the sort of article that people will claim supports their side, even that it ends the debate without a knockout blow to the other side, when the actual article itself isn't making any such claim.
I'm sure you read the entire article before commenting, but I would strongly recommend everyone else does as well.
bdbenton5255•4h ago
flir•4h ago
Size of training data: fuck knows, but The Pile alone is 880Gb. Public github's gonna be measured in Tb. A Common Crawl is about 250Tb.
There's physically not enough space in there to store everything it was trained on. The vast majority of the text the chatbot was exposed to cannot be pulled out of it, as this paper makes clear.
I'm guessing that the cases where great lumps of copyright text can be extracted verbatim are down to repetition in the training data? There's probably a simple fix for that.
(I'm only talking about training here. The initial acquisition of the data clearly involved massive copyright infringement).
jrm4•14m ago
morkalork•3h ago
dylan604•3h ago
I think the exact examples used in the past were Indiana Jones, but the point is the same.
hnthrowaway_989•2h ago
I don't have an alternative argument against AI art either, but I don't think you are going to like this outcome.