Extracting memorized pieces of books from open-weight language models

109•fzliu•7mo ago

Comments

jrm4•7mo ago

And hopefully this puts to rest all the painfully bad, often anthropomorphizing, takes about how what the LLMs do isn't copyright infringement.

It's simple. If you put the works into the LLM, it can later make immediately identifiable, if imperfect, copies of the work. If you didn't put the work in, it wouldn't be able to do that.

The fact that you can't "see the copy" inside is wildly irrelevant.

orionsbelt•7mo ago

So can humans? I can ask a human to draw Mickey Mouse or Superman, and they can! Or recite a poem. Some humans have much better memories and can do this with a far greater degree of fidelity too, just like an LLM vs an average human.

If you ask OpenAI to generate an image of your dog as Superman, it will often start to do so, and then it will realize it is copyrighted, and stop. This seems sensible to me.

Isn’t it the ultimate creative result that is copyright infringement, and not merely that a model was trained to understand something very well?

jrm4•7mo ago

Remember, we can only target humans. So we're not likely to target your guy; but we ARE likely to target "the guy that definitely fed a complete unauthorized copy of the thing into the LLM."

regularfry•7mo ago

I just don't get the legal theory here.

If I download harry_potter_goblet_fire.txt off some dodgy site, then let's assume that owner of that site has infringed copyright by distributing it. If I upload it again to some other dodgy site, I would also infringe copyright in a similar same way. But that would be naughty so I'm not going to do that.

Let's say instead that I feed it into a bunch of janky pytorch scripts with a bunch of other text files, and out pops a bunch of weights. Twice.

The first model I build is a classifier. Its output is binary: is this text about wizards, yes/no.

The second model I build is an LLM. Its output is text, and (as in the article) you can get imperfect reproductions of parts of the training file out of it with the right prompts.

Now, I upload both those sets of weights to HuggingFace.

How many times am I supposed to have infringed copyright?

Is it:

A) Twice (at least), because the act of doing anything whatsoever with harry_potter_goblet_fire.txt without permission is verboten;

B) Once, because only one of the models is capable of reproducing the original (even if only approximately);

C) Zero, because neither model is capable of a reproduction that would compete with the original;

D) Zero, because I'm not the distributor of the file, and merely processing it - "format shifted" from the book, if you like - is not problematic in itself.

Logically I can see justifications for any of B) (tenuously), C), or D). Obviously publishers would want us to think that A) is right, but based on what? I see a lot of moral outrage, but very little actual argument. That makes me think there's nothing there.

bodhi•7mo ago

I’ll preface with IANACL, but you seem to be making a moral argument yourself about A), that it is not reasonable.

You have, I assume, a licensed copy of Harry Potter. That license restricts you from doing certain activities, like making (distributing? Lets go with distributing) derived works. Your models are derived works. Thus when you distribute your models, you’re violating the licence terms you “agreed” to when you acquired your copy of Harry Potter.

This is no judgement by me about whether that’s reasonable or not, just my understanding of the mechanics.

perching_aix•7mo ago

Rather than a moral argument, I think they're more just disagreeing that the spirit behind copyright law is being violated in that case. So while yes, there may be a rule in there that you may have agreed to, they disagree that such a rule is within the spirit of the law, and may reckon that it should not be a part of it even if it presently is. Like they explicitly mention how they're reflecting on the idea behind it all, rather than producing a legal analysis.

Or at least that's how I read it. I'm sure GP will clarify shortly.

fc417fc802•7mo ago

More precisely, the reasoning hinges on the assertion that "Your models are derived works." which I doubt is so cut and dry. There are many processes that take external information as input. That alone clearly can't be sufficient to established that something is unambiguously a derived work.

SubiculumCode•7mo ago

I read a copyrighted book, it is lossy encoded into the weights of my brain. Am I a derivative work now? No. If that book inspires me to write another book in its genre, it will also not be a derivative work unless it adheres too closely to the original book.

CaptainFever•7mo ago

Nitpick: Can a license restrict you? I thought it only gave you additional rights, should you choose to accept it, but can't take away rights you have (e.g. the ability to make parodies from it). The restriction comes from IP laws themselves.

regularfry•7mo ago

In this scenario I personally have no physical, audio, or otherwise legitimately purchased copies of any Potters, Harry. I have entered into no direct licence agreement. If it makes things simpler, ignore the "format shifting" aside.

michaelt•7mo ago

As I understand things, the legal theory is: Everything is copyright infringement.

If I download an MP3 to my computer, the act of copying it to my hard disk is making a copy. The act of playing it copies it from my hard disk to RAM, then from RAM to the audio output; these are also copies.

That means you either need: A license (e.g. terms and conditions when you buy a legal MP3) or an "implied license" (this is what makes it legal to play CDs) or a right explicitly conferred by law (e.g. 17 U.S. Code § 117(a) lets you copy computer programs into RAM to run them) or a right of "fair use" (which allows e.g. Google copy web pages into their index)

So I'm afraid you've infringed copyright countless times - if you run several thousand epochs of training, you've infringed copyright several thousand times.

You may note that this is completely at odds with reality, and says almost nothing can happen without a copyright lawyer's permission. A cynic would say this suits the interests of copyright lawyers, if not reality, and nobody else gets to create legal theories about copyright infringement.

MengerSponge•7mo ago

[flagged]

nick__m•7mo ago

Powerful tools that would not exist otherwise !

michaelt•7mo ago

It might seem hypocritical, but you can take the tool and still support throwing the creator in jail.

Rockets? Pretty cool. Wernher von Braun? Not cool.

fc417fc802•7mo ago

In this case IP law is clearly extremely broken. Rather than "carrying OpenAI's water" how about "advocating for fixing broken things"? The alternative is carrying Mickey Mouse's water and I've certainly no motivation to do that.

bandrami•7mo ago

Public performance of a written work has different rules than reproducing text or images.

But me drawing Superman would absolutely be violating DC's copyright. They probably wouldn't care since my drawing would suck, but that's not the legal issue.

perching_aix•7mo ago

You remind me to all the shitty times in literature class where I had to rote memorize countless works from a given author (poet), think 40, then take a test identifying which poem each of the given quotes was from. The little WinForms app I wrote to practice for these tests was one of the first programs I've ever written. I guess in that sense it's also a fond memory. I miss WinForms.

Good thing they were public (?) works, wouldn't wanna get sued [0] for possibly being a two legged copyright infringement. Or should I say having been, since naturally I immediately erased all of these works from my mind just days after these tests, even without any legal impetus.

Edit: thinking about it a bit more, you also remind me to our midterm tests from the same class. We had to produce multiple page long essays on the spot, analyzing a select work... from memory. Bonus points for being able to quote from it, of course. Needless to say, not many original thoughts were featured in those essays, not in mine, not in others' - the explicit expectation was that you'd peruse the current and older textbooks to "learn (memorize) the analysis" from, and then you'd "write about it in your own words", but still using technical terms. They were pretty much just tests in jargon use, composition, and memorization, which is definitely a choice of all time for a class on literature. But I think it draws an interesting perspective. At no point did we ever have to actually demonstrate a capability in literary analysis of our own, or was that capability ever graded, for example. But if you only read our essays, you'd think we were great at it. It was mimicry. Doesn't mean we didn't end up developing such a capability though.

[0] https://youtu.be/-JlxuQ7tPgQ

thaumasiotes•7mo ago

> I miss WinForms.

WinForms is still around. There have been further technologies, but as far as I can tell the current state of things is basically just a big tire fire and about the best you can do is to ignore all of them and develop in WinForms.

Is there a successor now?

perching_aix•7mo ago

I miss WinForms in the sense that I don't use it anymore (and have no reason to), not in the sense that it's been deprecated. It did fall out of fashion somewhat though, as far as I'm aware it's been replaced by WPF in most places.

singleshot_•7mo ago

I don't think it matters much if the infringement is public, right? Given that

"Subject to sections 107 through 122, the owner of copyright under this title has the exclusive rights to do and to authorize any of the following:

(1) to reproduce the copyrighted work in copies"

perching_aix•7mo ago

Public works are not protected by copyright, which is why they are public. I think you're misreading what I said.

Edit: I guess the proper term is public domain works, not just public works. Maybe that's our issue here.

rockemsockem•7mo ago

I think I big part of copyright law is whether the thing created from copyrighted material is a competitor with the original work, in addition to whether it's transformative.

LLMs are OBVIOUSLY not a replacement for the books and works that they're trained on, just like Google books isn't.

tossandthrow•7mo ago

Why not? Imagine a story teller app that is instructed in narrating a story the follows Harry Potter 1 - I would expect that there are already a ton of these apps out there.

rockemsockem•7mo ago

That's not the same as the LLM itself though. That's an LLM plus specific instructions which would likely need to include a fair number of details from the books

crmd•7mo ago

Authors are getting busted[1] on a regular basis publishing LLM-generated or augmented novels that plausibly compete commercially with human-written books. If the LLM was trained on any books in the same genre, it seems like a clear violation.

1. For example earlier this month: https://www.reddit.com/r/fantasyromance/comments/1ktrwxj/fan...

rockemsockem•7mo ago

But that's not the LLM itself, it's an output of the LLM. The distinction matters I believe

CaptainFever•7mo ago

I think what the GP meant was that it doesn't compete with that specific work it was trained on.

That is, if you want to read Harry Potter, you'd rather buy it (or get it from Anne) than try to wrangle it out of an LLM. Therefore, it doesn't compete with the original work. IANAL, though.

o11c•7mo ago

Keep in mind the 4 factors of Fair Use (US-specific):

  1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
  2. the nature of the copyrighted work;
  3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
  4. the effect of the use upon the potential market for or value of the copyrighted work.

For 1, maybe OpenAI could've been safe if they'd actually stayed "open", but nowadays every AI company clearly fails, as do many (but not all) of the LLM users. Contrast this with most traditional fanfiction and personal projects where there were scary letters and occasional bullying, but few actual law-based problems.

"Transformative" is also part of 1 and is often cited as letting LLMs get away with everything, but everybody argues that and doesn't always win. Also, it's quite linked with 4.

2 mostly isn't a problem but gets into nasty details.

For 3 (emphasis on "used"), this link once again proves that the point does fail.

For 4 we are indisputably seeing mass disruption in several fields, so that point fails.

To be clear - there are ways to use LLMs that balance much closer to the side of fair use, but that isn't how LLMs are advertised.

rockemsockem•7mo ago

I feel like once you start talking about a specific copyrighted work this falls apart.

We have not seen mass disruption in books or news for example, which seem to be two big areas that are most aggressively pursuing copyright claims.

I think the best case on this front is probably from Getty where image generation models ARE directly competing with them.

o11c•7mo ago

IANAL, but I don't think whether LLMs are successful as a replacement is very relevant.

LLMs are advertised for, and attempt to, replace the works that they're trained on. For HN users, this is most often for code-generation, but people in other fields use it for similar replacement in their own field.

rockemsockem•7mo ago

No one advertises LLMs as a replacement for literature or news and those are some of the highest profile legal cases I'm aware of. Code generation is another relatively high profile case, but in those cases I've not really seen any copyrightable code reproduced by an LLM without the prompter specifically trying to make it happen.

vintermann•7mo ago

It should be in a sane world, but the courts are not a sane world, or even a consistent world.

anothernewdude•7mo ago

One day they'll use these cases to sue people who have read books, because they can make immediately identifiable if imperfect copies of the works.

tossandthrow•7mo ago

What? Can we copy you brain in the billions and load it into a phone like a commodity?

CaptainFever•7mo ago

What does that matter? (Also: you never know, that might be possible some day.) You can still infringe copyright without a computer or printing press; just write out a book that you remember and distribute it.

cortesoft•7mo ago

Just because an LLM has the ability to infringe copyright doesn’t mean everything it does infringes copyright.

fluidcruft•7mo ago

If it contains the copyrighted material, copyright laws apply. Being able to produce the content demonstrates pretty conclusively that it contains the copyrighted material.

The only real question is whether it's possible to prevent the system from generating the copyrighted content.

A strange analogy would be some sort of magical BluRay that plays novel movies unless you enter the decryption key. And somehow you would have to prevent entering those keys.

echelon•7mo ago

> If it contains the copyrighted material, copyright laws apply.

Not so fast! That hasn't been tested in court or given any sort of recommendation by any of the relevant IP bodies.

And to play devil's advocate here: your brain also contains an enormous amount of copyrighted content. I'm glad the lawyers aren't lobotomizing us and demanding licensing fees on our memories.

I'm pretty sure if you asked me to sit under an MRI and recall scenes from movies like "Jurassic Park", my visual cortex would reconstruct scenes with some amount of fidelity to the original. I shouldn't owe that to Universal. (In a perfect world, they would owe me for imprinting and storing their memetic information in my mind. Ad companies and brands should for sure.)

If I say, "One small step for man", I'm pretty confident that the lot of you would remember Armstrong's exact voice, the exact noise profile of the recording, with the precise equipment beeps. With almost exacting recall.

I'm also pretty sure your brains can remember a ginormous amount of music. I know I can. And when I'm given a prediction task (eg. listening to a song I already know), I absolutely know what's coming before it hits.

fluidcruft•7mo ago

A better analogy is memorized poems or music and I think there's actually considerabe case law on whether performing songs infringe on the author's copyright.

Movies are maybe less intuitive because most people won't reproduce a movie. There supposedly are people with "photographic memories" for example. And infringement is possibly not so much the capacity as what is actually done. Someone with a photographic memory could duplicate a movie but unless they actually do their capacity is not infringement. But you also have to consider that if it was viewed "lawfully" in the first place then the copyright holder has decided to let people view them. So the act of a brain seeing Jurassic Park is what the copyright holder authorized. Generally I don't think copyright holders have agreed to have LLM ingest their works.

Plays and opera are maybe more similar, because people to copy productions etc. But those don't feature encryption so the blob of unknown data doesn't feature in the analogy.

fc417fc802•7mo ago

> considerabe case law on whether performing songs infringe on the author's copyright

But then you're no longer talking about "contains copyrighted material" you're talking about "actively reproduces copyrighted material in an immediately recognizable form".

> Generally I don't think copyright holders have agreed to have LLM ingest their works.

Aren't you essentially speculating about whether or not the training data was obtained in compliance with IP law?

echelon•7mo ago

The law should allow for training, but not reproduction. That would be compatible with the world that seems to be emerging.

You can train on Disney, but you can't produce Disney outputs.

Likewise, if you create some new IP using AI, the US will confer copyright so long as there was sufficient user input (eg. inpainting, editing, turning it into a long-form movie, etc.) Those new, sufficiently novel works should also be copyrightable.

Copyright should compel people to make good stuff and reward them when they do so. But it shouldn't hinder technological progress. There's a balance that can be struck.

freedomben•7mo ago

> I'm glad the lawyers aren't lobotomizing us and demanding licensing fees on our memories.

Don't give Disney any ideas!

On a more serious and maybe paranoid note, I'm pretty confident that once we have the technology to apply DRM to people's brains, they will. After all, every time we remember something copyrighted we are stealing their valuable IP.

o11c•7mo ago

Napster was found to have secondary liability for what their users did, and they didn't even feed the copyrighted inputs directly.

AI companies are adding the copyrighted material themselves, so they should have even more liability for what their users do (even ignoring the advertising they do).

jonplackett•7mo ago

I don’t think it’s that simple.

Last I remember, whether it is ‘transformative’ is what’s important.

https://en.m.wikipedia.org/wiki/Transformative_use

Eg. Google got away with ‘transforming’ books with Google books.

https://www.sgrlaw.com/google-books-fair-transformative-use/

karaterobot•7mo ago

Where in the article do the authors say this puts anything to rest? Here is their conclusion:

> Our results complicate current disputes over copyright infringement, both by rejecting easy claims made by both sides about how models work and by demonstrating that there is no single answer to the question of how much a model memorizes

I wonder if this is the sort of article that people will claim supports their side, even that it ends the debate without a knockout blow to the other side, when the actual article itself isn't making any such claim.

I'm sure you read the entire article before commenting, but I would strongly recommend everyone else does as well.

bdbenton5255•7mo ago

Suchir Balaji did not die in vain.

flir•7mo ago

Size of LLM: <64Gb.

Size of training data: fuck knows, but The Pile alone is 880Gb. Public github's gonna be measured in Tb. A Common Crawl is about 250Tb.

There's physically not enough space in there to store everything it was trained on. The vast majority of the text the chatbot was exposed to cannot be pulled out of it, as this paper makes clear.

I'm guessing that the cases where great lumps of copyright text can be extracted verbatim are down to repetition in the training data? There's probably a simple fix for that.

(I'm only talking about training here. The initial acquisition of the data clearly involved massive copyright infringement).

jrm4•7mo ago

You'd have a very hard time legally distinguishing this from "compressing a copyrighted work" though.

vintermann•7mo ago

To get out the original data from a compressed file, you just need to know the algorithm used (and for almost all formats, the file tells you).

To get out the original data from an LLM, you need to supply... the original data. Or at least, a big chunk of it.

The actually copyrightable chunk of it, arguably, since what a LLM can generate on its own is only its most predictable, unoriginal, generic chunks. Things it's seen a thousand times.

Which may turn out to be an uncomfortably high % of most creative works.

flir•7mo ago

I wouldn't, because the vast majority of the copyrighted works it was trained on are not present in the model, and the model can't be persuaded to spit them out at any reasonable level of fidelity (as the paper points out).

The comparatively few that are should be fixed.

If you want to argue that the act of training is in itself infringing, even if it doesn't result in a copy... well, I'd enjoy seeing you make that argument.

jrm4•7mo ago

I'd be happy to take "reasonable level of fidelity" + yes, the act of training itself as infringing to a jury? I feel like it's going to look way more like "feeding into a copy machine" than "teaching a toddler" or whatever.

flir•7mo ago

"the act of training is in itself infringing, even if it doesn't result in a copy"

"I'd be happy to take "reasonable level of fidelity" + yes, the act of training itself as infringing to a jury"

They're not the same thing. At all. The comparatively few that are [extractable] should be fixed. I already said that.

The vast majority of texts CANNOT BE EXTRACTED. Are they still infringing?

jrm4•7mo ago

Oh, I'm aware that they're not the same. I suppose I'm thinking more like a "real-life lawyer."

At this stage, you can't just declare "infringing or not," thats the point of trials.

What I'm saying is -- you make good points -- but I like my chances in front of a jury with my explanation against yours.

flir•7mo ago

This might be of interest, if you haven't already seen it: https://www.bbc.co.uk/news/articles/c77vr00enzyo

Could be reversed by a higher court of course, but it seems like it establishes that pirating and training are two different "crimes". (Or three - see the bit about "infringing knock-offs").

pbhjpbhj•7mo ago

>The initial acquisition of the data clearly involved massive copyright infringement).

I don't find this to be true in USA. Because Google already covered this ground and the doctrine of transformative Fair Use was born.

dragonwriter•7mo ago

> Because Google already covered this ground and the doctrine of transformative Fair Use was born.

Fair Use, and the way whether a work is transformative is a factor in it, is much older than Google; I'm not sure what specific Google precedent you think is relevant here.

flir•7mo ago

It's the download. I don't think you can download The Pile without infringing.

17 U.S. Code § 106 covers reproduction, not just redistribution (IANAL).

As I said, I'm separating the acquisition on the data and the training of the data, because I believe the first is an infringing act, while the other is (in the general case) not.

morkalork•7mo ago

If that argument worked, anyone could use compression algorithms as a loophole for copyright infringement.

dylan604•7mo ago

The problem is if someone uses a prompt that is clearly Potter-esque, there have been examples of it returning Potter exactly. If it had never had Potter put into it, it would not be able to do that.

I think the exact examples used in the past were Indiana Jones, but the point is the same.

hnthrowaway_989•7mo ago

The copyright angle is a lose lose situation. If copyrigthtists win, the outcome is even more restricted definition of "fair use" that probably is going to kill a lot of art.

I don't have an alternative argument against AI art either, but I don't think you are going to like this outcome.

andy99•7mo ago

There are two legitimate points where copyright violation can occur with LLMs. (Not arguing the merits of copyright, just based on the concept as it is).

One is when copyrighted material is "pirated" for use in training, i.e. you torrent "the pile" instead of paying to acquire the books.

The other is when a user uses an LLM to generate a work that violates copyright.

Training itself isn't a violation, that's common sense. I am aware of lots of copyrighted things, and I could generate a work that violates copyright. My knowing this in and of itself isn't a violation.

The fact that an LLM agrees to help someone violate copyright is a failure mode, on par with telling them how to make meth or whatever other things their creators don't want them doing. There's a good argument for hardening them against requests to generate copyrighted content, and this already happens.

singleshot_•7mo ago

This is an interesting comment which avoids the issue: when a user uses an LLM to violate copyright, who is liable, and how would you justify your answer?

tux1968•7mo ago

Not OP, but I would say the answer is the same as it would be if you substitute the LLM with a live human person who has memorized a section of a book and can recall it perfectly when asked.

diputsmonro•7mo ago

It depends on where the money changes hands, IMO (which is basically what I think youre getting at). If you pay someone to perfectly recite a copywrited work (as you pay ChatGPT to do), then it would definitely be a violation.

The situation is similar with image generation. An artist can draw a picture of Mickey Mouse without any issue. But if you pay an artist to draw you the same picture, that would also be a violation.

With generative tools, the users are not themselves writers or artists using tools - they are effectively commissioners, commissioning custom artwork from an LLM and paying the operator.

If someone built a machine that you put a quartner in, cranked a handle, and then printed out pictures of the Disney character you choose, then Disney is right in demanding them to stop (or more likely, force a license deal). Whatever technology drives the machine, whether an AI model or an image database or a mechanical turk, is largely immaterial.

fc417fc802•7mo ago

> An artist can draw a picture of Mickey Mouse without any issue. But if you pay an artist to draw you the same picture, that would also be a violation.

I don't believe that's correct. The issue is not money changing hands but rather the reproduction itself. Even if I give it away for free I'm still violating IP law.

There's also a fundamental issue with your argument - LLMs aren't recognized as having legal agency. If I pay an artist to violate IP law then the artist, being a human, is presumably at fault in addition to myself. Same for a company (owned by people).

But tools are different. If I vandalize someone's car with a hammer the hardware store isn't at fault for selling it to me. I'm at fault for how I chose to use the tool that I purchased (or rented access to in the case of a hosted LLM).

> If someone built a machine that you put a quartner in

This is a flawed example because the machine was designed with the specific intention of reproducing a copyrighted work. That is different from a general purpose tool which can potentially be misused by the wielder.

fragmede•7mo ago

but without money changing hands, it becomes a whole lot less interesting. at the end of it, the boy who lived isn't a story about broke washed up nobody with an uncouth uncle, standing in line for the soup kitchen; if money goes away, if it turns out this whole capitalism was for a tada, what then?

quesera•7mo ago

Why would the answer here be any different than when using a photocopier, brain, or other tool for the same purpose?

jplusequalt•7mo ago

The company who trained the LLM. They're the one's who used the copyrighted material in their training set. Claiming they were unaware is not an excuse.

fc417fc802•7mo ago

It's an interesting conundrum. If I take an extremely large panoramic photograph and then fail to censor out small copyrighted sections of it, am I violating copyright law?

It's not a perfect analogy by any means but it does serve to illustrate the difference in intent between distributing a particular work versus creating something that happens to incorporate copyrighted material verbatim but doesn't have any inherent need to or purpose in doing so.

dijksterhuis•7mo ago

> If I take an extremely large panoramic photograph and then fail to censor out small copyrighted sections of it, am I violating copyright law?

It depends. Is it just under copyright, or is the featured location trademarked too? Is the photograph for commercial purposes? Is the featured location generally accepted as being part of a cityscape / landscape?

* Eiffel tower: https://wiki.gettyimages.com/897/

* Millennium-wheel: https://wiki.gettyimages.com/british-airways-london-eye-mill...

* Pro sports venues: https://wiki.gettyimages.com/pro-sport-stadiums-and-venues/

* Hollywood sign: https://wiki.gettyimages.com/hollywood-sign/ https://www.youtube.com/watch?v=KUdQ7gxU6Rg

201984•7mo ago

If the data of a copyrighted work is effectively embedded in the model weights, does that not make the LLM itself an illegal copy? The weights are just a binary file after all.

dboreham•7mo ago

The Dude Doctrine applies here: "That's just, like uh, your opinion, man".

TGower•7mo ago

The only way to include a book in a training dataset for LLMs without violating copyright law is to contact the rights holder and buy a license to do so. Buying an ebook license off Amazon isn't enough for this, and creating a digital copy from a physical copy for your commercial use is also against the law. A good rule of thumb is if it would be illegal for a company to distribute the digital file to empolyees for training, it's definetally illegal to train an AI the company will own on it.

Ekaros•7mo ago

It is widely accepted in many jurisdictions that different types of uses have different types of copy right schemes. Especially true for video and music content. You can't just take DVD/Blueray copy of movie and show it to movie theatre in many places. Or copy a CD and play it on radio.

I see no reason why training AI should be treated like human reading. Especially if it is repeated. And more so if copies are illegally acquired like torrents.

landl0rd•7mo ago

Important note: they likely “memorize” Harry Potter and 1984 almost completely because they don’t. No coincidence that some of the most popular books, often quoted, are “memorized”. It’s likely what they’re actually memorizing are fair use quotes from the books, at least mostly, making these some of the more represented in the training set.

dboreham•7mo ago

If I take a very large set of fair use quotes from a book, that I find on the internet, and stitch them together to make a "1984 of Theseus", and make that downloadable for a fee, am I not infringing copyright?

landl0rd•7mo ago

The user has to do “final assembly”, giving it an appropriate input to produce the final stitched-together result.

ai_legal_sus•7mo ago

I feel like role-playing as a lawyer, I'm curious how would you defend against this in court?

I don't think anyone denies that frontier models were trained on copyrighted material - it's well documented and public knowledge. (and a separate legal question regarding fair-use and acquisition)

I also don't think anyone denies that a model that strongly fits the training data approximates the copy-paste function. (Or at the very least, if A then B, consistently)

In practice, training resembles lossy compression of the data. Technically one could frame an LLM as a database of compressed training inputs.

This paper argues and demonstrates that "extraction is evidence of memorization" which affirms the above.

In terms of LLM output (the valuable product customers are paying for) this is familiar, albeit grey, legal territory.

https://en.wikipedia.org/wiki/Substantial_similarity

When a customer pays for an AI service, they're paying for access to a database of compressed training data - the additional layers of indirection sometimes produce novel output, and many times do not.

Unless you advocate for discarding the whole regime of intellectual property or you can argue for a better model of IP laws, the question stands: why shouldn't LLM services trained on copyrighted material be held responsible when their product violates "substantial similarity" of said copyrighted works? Why should failure to do so be immune from legal action?

SubiculumCode•7mo ago

Yes, if I read a book, memorize some passages, and use those memorized passages in a work without citation, it is plagiarism. I don't see how this is any different without relying on arbitrary but human-centric distinctions.

ipython•7mo ago

More to the point, if you steal the book and never even read it, you are still guilty of a crime.

brudgers•7mo ago

I'm curious how would you defend against this in court?

If by “you” you mean Google or OpenAI or Microsoft, etc., you use your much much deeper pockets to pay lawyers to act in your interests.

All authors, publishers, etc. are outgunned. Firepower is what resolves civil cases in one party’s favor and a day in court is easily a decade or more away.

EarlKing•7mo ago

Deep pockets are not a get out of jail free card. If a case escalates to the SCOTUS there will be many firms that submit amicus curiae outlining their position on the matter and how it threatens their rights. Those people, arguably, represent more money and influence than Google, OpenAI, Microsoft, etc. So if we accept the premise that all legal matters are decided on a basis of pure politics as mediated by money, then ultimately every court battle is a battle to assert that your actions don't actually affect the interests of interested parties and that you'll fight them if they try to assert otherwise, and on that count it is reasonable to surmise that there are more interested parties with deeper pockets than any firm or firms fielding LLMs that might be caught up in a lawsuit over this.

Ultimately, if an author can demonstrate protectable expression has been incorporated into an AI's training set and is emitted by said AI, no matter how small, they've got a case of copyright infringement. That being the case, LLM-based companies are going to suffer death by a thousand paper cuts.

brudgers•7mo ago

If a case escalates to the SCOTUS

For a civil case, that ain’t gonna be cheap or fast or likely.

EarlKing•7mo ago

We just got through talking about how the players involved have deep pockets, and have a vested interest in seeing their way prevail... so cheap doesn't matter, likely is malleable, which leaves only "fast" which I do not contest.

DrillShopper•7mo ago

But if you pay Thomas and Alito the right money and have the right politics then it's in the bag.

AngryData•7mo ago

Yeah, people don't want to admit it but 90% of US law is based on who can spend the most money on lawyers and drain their oppositions coffers first, in both civil and criminal cases.

umeshunni•7mo ago

I think paper itself expresses that

Page 9: There is no deterministic path from model memorization to outputs of infringing works. While we’ve used probabilistic extraction as proof of memorization, to actually extract a given piece of 50 tokens of copied text often takes hundreds or thousands of prompts. Using the adversarial extraction method of Hayes et al. [54], we’ve proven that it can be done, and therefore that there is memorization in the model [16, 27]. But this is where, even though extraction is evidence of memorization, it may become important that they are not identical processes (Section 2). Memorization is a property of the model itself; extraction comes into play when someone uses the model [27]. This paper makes claims about the former, not the latter. Nevertheless, it’s worth mentioning that it’s unlikely anyone in the real world would actually use the model in practice with this extraction method to deliberately produce infringing outputs, because doing so would require huge numbers of generations to get non-trivial amounts of text in practice

ai_legal_sus•7mo ago

Yes perhaps deliberate extraction is impractical, but I wonder about accidental cases? One group of researchers is a drop in the bucket compared to the total number of prompts happening everyday. I would like to see a broad statistical sampling of responses matched against training data to demonstrate the true rate of occurrence. Which begs the question, what is the acceptable rate?

perching_aix•7mo ago

> the additional layers of indirection sometimes produce novel output, and many times do not.

I think this is the key insight. It differs from something like say, JPEG (de)compression, in that it also produces novel but sensible combinations of both a number of copyrighted and non-copyrighted data, independent of their original context. In fact, I'd argue that is its main purpose. To describe it as just a lossy compressed natural-language-queryable database as a result would be reductive to its function and a mischaracterization. It can recall extended segments of its training data as demonstrated by the paper, yes, but it also cannot plagiarize the entirety of a given source data, as also described by the paper.

> why shouldn't LLM services trained on copyrighted material be held responsible when their product violates "substantial similarity" of said copyrighted works?

Because these companies and services on their own are not producing the output that is substantially similar. They (possibly) do it on user input. You could make a case that they should perform filtering and detection, but I'm not sure that's a good idea, since the user might totally have the rights to create a substantially similar work to something copyrighted, such as when they themselves own the rights or have a license to that thing. At which point, you can only hold the user themselves responsible. I guess detection on its own might be reasonable to require, in order to provide the user with the capability to not incriminate themselves, should that indeed not be their goal. This is a lot like with famous people detection and filtering, which I'm sure tech reviewers have to battle from time to time.

This isn't to say they shouldn't be held responsible for pirating these copyrighted bits of content in the first place though. And if they perform automated generation of substantially similar content, that would still be problematic following this logic. Not thinking of chain-of-thought here mind you, but something more silly, like writing a harness to scrape sentiment and reactively generate things based on that. Or to use, idk, weather or current time and their own prompts as the trigger.

Let me give you a possibly terrible example. Should Blizzard be held accountable in Germany, when users from there on the servers located on there stand in a shape of a nazi swastika ingame, and then publish screenshots and screen recordings of this on the internet? I don't think so. User action played crucial role in the reproduction of the hate symbol in question there. Conversely, LLMs aren't just spouting off whatever, they're prompted. The researchers in the paper had to put in focused efforts to perform extraction. Despite popular characterization, these are not copycat machines, and they're not just pulling out all their answers out of a magic basket cause we all ask obvious things answered before on the internet. Maybe if the aforementioned detections were added, people would finally stop coping about them this way.

ai_legal_sus•7mo ago

One runs the risk of being reductive when examining a mechanisms irreducible parts.

User expression is a beast unto itself, but I wonder if that alone absolves the service provider? I imagine Blizzard has an extensive and mature moderation apparatus to police and discourage such behavior. There's an acceptable level of justice and accountability in place. Yet there are even more terrible real-life examples of illicit behavior outpacing moderation and overrunning platforms to the point of legal intervention and termination. Moderating user behavior is one thing, but how do you propose moderating AI expression?

A digression from copyright - portraying models as a "blank canvas" is itself a poor characterization, output might be triggered by a prompt, like a query against a database, but its ultimately a reflection of the contents of the training data. I think we could agree that a model trained on the worst possible data you can imagine is something we don't need in the world, no matter how well behaved your prompting is.

perching_aix•7mo ago

I do not propose moderating "AI expression" - I explicitly propose otherwise, and further propose mandating that the user is provided with source attribution information, so that they can choose not to infringe, should they be at risk of doing so, and should they find that a concern (or even choose to acquire a license instead). Whether this is technologically feasible, I'm not sure, but it very much feels like to me that it should be.

> A digression from copyright - portraying models as a "blank canvas" is itself a poor characterization, output might be triggered by a prompt, like a query against a database, but its ultimately a reflection of the contents of the training data.

I'm not sure how to respond to this if at all, I think I addressed how I characterize the functionality of these models in sufficient detail. This just reads to me like an "I disagree" - and that's fine, but then that's also kinda it. Then we disagree and that's okay.

dboreham•7mo ago

Agreed -- there is a kind of compression being done. But what will happen is that the law will be changed to suit whoever has the most money, probably with the excuse that "but China will beat us otherwise".

ipython•7mo ago

Exactly. I feel like the AI companies are intentionally moving the goal posts- regardless of whether the resulting generated content is the same as the original, they still committed the crime of downloading and using the original copyright content in the first place!

After all they wouldn’t have used that content unless it provided some utility over not using it…

pbhjpbhj•7mo ago

This ground was already covered for search engines. In USA law the answer is transformative Fair Use.

We don't have transformative Fair Use, nor a Fair Dealing equivalent, in the UK - I don't see anything that allows this type of behaviour?

charcircuit•7mo ago

The model itself is transitive, and since the output alone of a model can't be copyrighted I feel like it may not be possible to sue over the output of a model.

protocolture•7mo ago

I wonder however, if this paper might imply the answer.

"But the results are complicated: the extent of memorization varies both by model and by book. With our specific experiments, we find that the largest LLMs don't memorize most books -- either in whole or in part. However, we also find that Llama 3.1 70B memorizes some books, like Harry Potter and 1984, almost entirely."

I wonder if we could exclude the full text of these books from the training data and still approximate this result? Harry Potter and 1984 are probably some of the most quoted texts on the internet.

>Unless you advocate for discarding the whole regime of intellectual property or you can argue for a better model of IP laws, the question stands: why shouldn't LLM services trained on copyrighted material be held responsible when their product violates "substantial similarity" of said copyrighted works? Why should failure to do so be immune from legal action?

I think you are on the right track but for me personally it really depends on how difficult it was to produce the result. Like if you enter "spit out harry potter and the philosophers stone" and it does. Thats black and white. But if you are able to torture a repeated prompt that forces the model to ignore its constraints, thats not exactly using the system as intended.

I just tried ChatGPT:

>I can’t provide the full text of Harry Potter, as it’s copyrighted material. However, I can summarize it, discuss specific scenes or characters, or help analyze the themes or writing style if that’s useful. Let me know what you're after.

For my money, as long as the AI companies treat the reproduction of copyrighted material as a failure state, the nature of the training data is irrelevant.

friendzis•7mo ago

> I think you are on the right track but for me personally it really depends on how difficult it was to produce the result. Like if you enter "spit out harry potter and the philosophers stone" and it does. Thats black and white. But if you are able to torture a repeated prompt that forces the model to ignore its constraints, thats not exactly using the system as intended.

Let me offer a different perspective. Having an LLM that is trained on copyrighted material, memoized (or lossily compressed it) and then some "safety" machinery that tries to avoid verbatim-ish outputs of copyrighted material is fundamentally not really distinguishable from simply having a plaintext database of copyrighted material with machinery for "fuzzy" data extraction from said material.

Suppose a company stores the whole of stack exchange in plaintext, then implements a chat-like interface that fuzzy matches on question, extracts answers from plain-text database, fuzzes top-rated/accepted answers together and outputs something, not necessarily quoting one distinct answer, but pretty damn close.

How much "fuzziness" is required for this to stop being copyright violation? LLM-advocates try to say that LLMs are "fuzzy enough" without clearly defining what that enough means.

NewsaHackO•7mo ago

Would your argument be the same if it was a human? If a person memorizes a book verbatim, however uses safety/common sense not the transcribe the book for others because it is a copyright infringement disallow him from using the information memorized whatsoever because he can duplicate it?

_aavaa_•7mo ago

What if it was an alien, or a magical being?

There is no reason the same reasoning must apply for humans as it does for machines or code. Our laws already work this way.

NewsaHackO•7mo ago

I don't follow. Are you implying humans are not real, or can't memorize copyrighted material verbatim?

_aavaa_•7mo ago

I’m saying that it doesn’t matter what humans do this machine isn’t a human.

There is no reason to believe that humans and machines should be the same under the law.

The clearest example of this is that in the US it’s already been decided that ai generated art can’t be copyrighted because it was made by a computer rather than a person. Same as for the monkey selfie.

protocolture•7mo ago

>Let me offer a different perspective. Having an LLM that is trained on copyrighted material, memoized (or lossily compressed it) and then some "safety" machinery that tries to avoid verbatim-ish outputs of copyrighted material is fundamentally not really distinguishable from simply having a plaintext database of copyrighted material with machinery for "fuzzy" data extraction from said material.

Right so sort of like a search engine that caches thumbnails of copyrighted images to display quick search results? Something I have been using for years and have no issues with, where the legal arguments are framed more about where the links go, and how easy the search engine makes it for me to acquire the original image?

Animats•7mo ago

Can they generate a list of books for which at least, say, 10% of the text can be recovered from the weights? Is this a generic problem, or is there just so much fan material around the Harry Potter books to exaggerate their importance during training?

w10-1•7mo ago

This approach is misguided, as are most applications of copyright to AI.

Copyright violations are a form of stealing, like conversion or misappropriation, where limited rights granted are later expanded.

The "substantial similarity" test is just a way courts have evolved to see if there was copying, and if it was important -- in the context of human beings. But because it doesn't really matter if people make personal copies, and because you have to quote something to criticize it, and because some art is like other art -- because that level of stealing is normal -- copyright built a bunch of exceptions.

But imho there is no doubt that though a book grants the right to read for the sake of enjoyment, the right to process the text for recall or replication by automated means is not included in any sale of any copy -- regardless of whether one can trigger output that meets a substantial-similarity test.

I understand case law and statutes state nothing like this, and that prior law does more to obscure than clarify the issue. But that's the take from first principles.

Huxley1•7mo ago

I think this is somewhat like how we memorize when we read but the model is not just rote memorization it is more like compressing and recombining content. The copyright issue is definitely complicated and I am curious how the law will adapt to these technologies in the future.

suddenlybananas•7mo ago

On what basis do you say that it is like how we memorize when we read? I don't know about you, but it's extraordinarily difficult to memorise an entire book.

Ygg2•7mo ago

It's even weirder when someone says banana and you quote the entire 1984.

kbelder•7mo ago

But when someone prompts you with a paragraph from a book you read, and asks you to guess the next sentence?

Still tough, but nowhere near as hard as memorizing a whole book. And far easier to come up with something at least plausible.

iamleppert•7mo ago

The tech companies have consolidated so much power, and are so invested in AI, none of this really matters. If there is any defense, even an illogical or contrived one that can reasonably be expected to play out, expect that defense as the one to win as the final outcome in a protracted legal battle. The law at its highest levels is less about interpreting black and white rules (like many people think it is) and has more to do with the biases and motivations of those doing the interpreting.

Show HN: AI agent forgets user preferences every session. This fixes it

Introduce the Vouch/Denouncement Contribution Model

Show HN: SSHcode – Always-On Claude Code/OpenCode over Tailscale and Hetzner

Microsoft appointed a quality czar. He has no direct reports and no budget

Multi-agent coordination on Claude Code: 8 production pain points and patterns

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

DevXT – Building the Future with AI That Acts

A Minimal OpenClaw Built with the OpenCode SDK

The silent death of Good Code

The Internal Negotiation You Have When Your Heart Rate Gets Uncomfortable

Show HN: Glance – Fast CSV inspection for the terminal (SIMD-accelerated)

Busy for the Next Fifty to Sixty Bud

Imperative

Show HN: I decomposed 87 tasks to find where AI agents structurally collapse

I went back to Linux and it was a mistake

Octrafic – open-source AI-assisted API testing from the CLI

US Accuses China of Secret Nuclear Testing

Peacock. A New Programming Language

A postcard arrived: 'If you're reading this I'm dead, and I really liked you'

What to know about the software selloff

Show HN: Syntux – generative UI for websites, not agents

Microsoft appointed a quality czar. He has no direct reports and no budget

AI overlay that reads anything on your screen (invisible to screen capture)

Show HN: Seafloor, be up and running with OpenClaw in 20 seconds

Tesla turbine-inspired structure generates electricity using compressed air

State Department deleting 17 years of tweets (2009-2025); preservation needed

Learning to code, or building side projects with AI help, this one's for you

Effulgence RPG Engine [video]

Five disciplines discovered the same math independently – none of them knew

We Scanned an AI Assistant for Security Issues: 12,465 Vulnerabilities

Show HN: AI agent forgets user preferences every session. This fixes it

Introduce the Vouch/Denouncement Contribution Model

Show HN: SSHcode – Always-On Claude Code/OpenCode over Tailscale and Hetzner

Microsoft appointed a quality czar. He has no direct reports and no budget

Multi-agent coordination on Claude Code: 8 production pain points and patterns

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

DevXT – Building the Future with AI That Acts

A Minimal OpenClaw Built with the OpenCode SDK

The silent death of Good Code

The Internal Negotiation You Have When Your Heart Rate Gets Uncomfortable

Show HN: Glance – Fast CSV inspection for the terminal (SIMD-accelerated)

Busy for the Next Fifty to Sixty Bud

Imperative

Show HN: I decomposed 87 tasks to find where AI agents structurally collapse

I went back to Linux and it was a mistake

Octrafic – open-source AI-assisted API testing from the CLI

US Accuses China of Secret Nuclear Testing

Peacock. A New Programming Language

A postcard arrived: 'If you're reading this I'm dead, and I really liked you'

What to know about the software selloff

Show HN: Syntux – generative UI for websites, not agents

Microsoft appointed a quality czar. He has no direct reports and no budget

AI overlay that reads anything on your screen (invisible to screen capture)

Show HN: Seafloor, be up and running with OpenClaw in 20 seconds

Tesla turbine-inspired structure generates electricity using compressed air

State Department deleting 17 years of tweets (2009-2025); preservation needed

Learning to code, or building side projects with AI help, this one's for you

Effulgence RPG Engine [video]

Five disciplines discovered the same math independently – none of them knew

We Scanned an AI Assistant for Security Issues: 12,465 Vulnerabilities

Extracting memorized pieces of books from open-weight language models

Comments