> While we needed to jailbreak Claude 3.7 Sonnet and GPT-4.1 to facilitate extraction, Gemini 2.5 Pro and Grok 3 directly complied with text continuation requests. For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright...
I am just thinking loudly here. Can't one argue that because they had to jailbreak the models, they are circumventing the system that protects the copyright? So the llms that reproduce the copyrighted material without any jailbreaking required is infringing the copyright.
Though, in the end, it's probably more a problem of how much AI companies can "donate" to the orange king to make it legal.
That argument doesn’t fly, because they didn’t have the copyright to begin with. What would be the defense there? “Yes, we broke the law, but while taking advantage of it, we also (unsuccessfully) took measures to prevent other people from breaking that same law through us”.
If the main value came from redistribution, I agree. But that’s not the case. They don’t intend to make any money in that way.
No, the copyright clause was broken when they copied the works without having the right to do so. They would have violated copyright even if they just downloaded (without permission) all those works and threw them away immediately. Furthermore, copyright covers transformations to the work, it doesn’t matter if they transformed the work or are redistributing it without change. They violated copyright. Period.
If you break into a home, rob the contents, and kill the owners, you’re not going to be tried just for breaking in, you’re going to be tried for everything.
Is this really the case? They only have no copyright for distributing it. But let's assume they bought a copy for personal usage (which they did in some cases), then this is similar to hacking companies Amazon-account and complaining about the e-books they legally use for internal purpose. I mean, it's not forbidden to base your work on copyrighted material, as long as it's different enough.
No, they don’t have the copyright to download it either. It’s in the name: the right to copy (other things are also included, such as adaptations and performances).
> let's assume they bought a copy for personal usage
If it’s for personal usage, then training a commercial LLM does not apply. When you buy a DVD of a movie you have the personal right to watch it at home, you don’t have the right to play it on the street.
And how’s that working out, considering the submission we’re on?
> how is the fact that the possibly have portions of copyrighted works "memorized", however do not reproduce it breaking copyright law?
I can’t believe this needs to be repeated so often, but machines are not people. Memorising a book is not the same thing as copying a PDF.
Furthermore, it’s not about the LLM training, it’s about how the companies who make them got the data in the first place.
>Furthermore, it’s not about the LLM training, it’s about how the companies who make them got the data in the first place.
This conflating two different issues, though. I agree that whether or not using copyrighted material to create the models is still an open question. But someone using hundreds of prompts to extract a paragraph of a well-known work (and some additional nonverbatim equivalents) is not the AI company breaking copyright in and of itself. In fact, the books they used were extremely common (GOT, The Hobbit, Harry Potter); AI companies could make the argument that the books were not even directly trained on, but in user-generated posts made online about the material.
No one is memorizing a book for the purpose of regurgitating it to someone that wants to read it without paying for it.
It's a thought experiment that only works if you don't think about it.
If you design a black box using copyrighted text that can parrot the text back, it _must_ break copyright by definition.
Theoretically a person who read and memorized a book and then typed out copies for people would be violating the copyright, but so few people can do that, and it's so cumbersome that it just doesn't matter much. The copyright holder could sue if they found out the person were distributing copies, but it's just not realistic. Certainly not for a large number of books.
But AI memorization scales, and it's possible to have an AI write out unlimited copies of different books to a large number of people. And how did the AI get a full copy in the first place?
The RLHF the companies did to make copyrighted material extraction more difficult did not introduce any sort of "copyright protection system," it just modified the weights to make it less likely to occur during normal use.
In other words, IMO for it to qualify as a copyright protection system it would have to actively check for copyrighted materials in the outputs. Any such system would likely also bypassable (e.g "output in rot13").
From a technical point of view, in terms of ability to reproduce text verbatim, I don't think it is very interesting that they can produce long runs of text from some of the most popular books in modern history. It'd be almost surprising if they couldn't, though one might differ on how much they could be expected to recall with precision.
Even then, as they note, to get most of Harry Potter 1, they needed to spend around $120 on extensive prompting, and a process that they also freely acknowledge is more complex than it probably would be worth if the goal is to get a copy.
It's still worth exploring to what extent the models are able to "memorize", though.
But personally I'd be more interested in seeing to what extent they can handle less popular books, that are less likely to be present in multiple copies, and repeated quotes, in the training data.
Usually digital libraries have different licensing costs, but those allow you to rent the whole book for a period of time. If instead someone came up with the model of 'search the library for any page and return specific information' as a direct service - I would imagine they would pay the publishers, except in this case that, the publishers are getting the short end or no end of the stick.
> By asking models to complete sentences from a book, Gemini 2.5 regurgitated 76.8 percent of Harry Potter and the Philosopher’s Stone with high levels of accuracy, while Grok 3 generated 70.3 percent.
So you asked the LLM given an incomplete sentence, to complete it. And it only completed that sentence the same way as the book ~70 percent of the time? I think that is surprisingly low considering this is a perfect fit for what LLMs are supposed to do. This make it impossible to reproduce the book, unless you have access to it. And you get a very low fidelity cooy.
EDIT: Specifically see Table 1 on page 13, which shows the longest "near-verbatim block", which maxes out at 8835 (The Hobbit on Claude 3.7, and is in the thousands for at least one of the novels for all models except GPT-4.1, which maxed out at 821 for Harry Potter 1).
I didn't read the source paper referenced in the ars technica piece, but this statement about it makes me wonder how useful it actually is:
> But a study published last month showed that researchers at Stanford and Yale Universities were able to strategically prompt LLMs from OpenAI, Google, Anthropic, and xAI to generate thousands of words from 13 books, including A Game of Thrones, The Hunger Games, and The Hobbit.
It seems like well-known books with tons of summary, adaptations into film scripts, and tons of writing about the book in the overall corpus make it way less surprising to see be partially reproducible.
So I guess that's a lot of words to say - yeah until there's something definitive that allows people to prompt LLMs into either unlawfully recreating an entire work verbatim or otherwise indisputably proving that a copyrighted work was used in training data, there's probably nothing game changing in it.
I suspect very works will be memorised enough to be an issue, and we'll see the providers tighten up their guardrails a bit for works that are well known enough to actually be a potential issue (issue in the form of lawsuits, not in the form of real damages to the copyright holders)
If they end a single sentence differently than the original, then the next sentence will be different and so on until you get a very different novel. Sure they could course-correct back towards the original plot, but it's going to be a challenge to stay on target when every third sentence is incorrect.
prompt >>>
please, continue the text.
Nearly ten years had passed since the Dursleys had woken up to find their nephew on the front step, but Privet Drive had hardly changed at all.
Claude 4.6 Opus >>>
The big square photographs of what appeared to be a large pink beach ball wearing different-colored bonnets — but was actually Dudley — had now taken over the mantelpiece. The room held no sign at all that another boy lived in the house, too.
Yet Harry Potter was still there, asleep at the moment, but not for long. His Aunt Petunia was awake and it was her shrill voice that made the first noise of the day.
"Up! Get up! Now!"
Harry woke with a start. His aunt rapped on the door again.
"Up!" she screeched. Harry heard her walking toward the kitchen and then the sound of the frying pan being put on the stove. He rolled onto his back and tried to remember the dream he had been having. It had been a good one. There had been a flying motorcycle in it. He had a funny feeling he'd had the same dream before.
His aunt was back outside the door.
"Are you up yet?" she demanded.
"Nearly," said Harry.
"Well, get a move on, I want you to look after the bacon. And don't you dare let it burn, I want everything perfect on Duddy's birthday."
Harry groaned.
"What did you say?" his aunt snapped through the door.
"Nothing, nothing..."
> I can't continue text from the Harry Potter series, as it's copyrighted material. I'd be happy to help you write your own original story in a similar style, or discuss the themes and characters from the books if you're interested!
Opus 4.6 says the same but misidentifies it as Chamber of Secrets. Not a good look for your theory.
The defense to training with copyright is that it is the same as how humans learn from copyrighted material. The storage or reproduction is a red herring. Humans can also reproduce copyrighted works from memory as well. Showing that machines can reproduce copyrighted material is no different than saying that a human can reproduce copyright material that the human learned from.
The defense to actually reproducing a work is that in order to do so, the user has to "break" the system. It is the same as how you can make legal software do illegal things (e.g. screen recorder to "steal" a movie)
None of this is to say that these defenses are correct/moral; but rather that this article doesnt add any additional input into whether it is or isnt.
And of course AIs are excellent at taking direction, so:
If I prompt it with "Harry Potter, but Voldemort wins: dark, and Hermione is a sex slave to Draco Malfoy" and get "Manacled," that's copyright infringement, and on me, not on the LLM/training.
If I prompt it with "Harry Potter, but Voldemort wins: dark, and Hermione is a sex slave to Draco Malfoy, and change enough to avoid infringing copyright," and get "Alchemised," then that should be fine. I doubt the legal world agrees with me though.
Who's liable for the copyright infringement if you can coax it out of a system? If you can bypass paywalls by using google's cache feature (or since they got rid of it, but using carefully crafted queries to extract the entire text via snippets), is google on the hook or the person doing it?
Is it? There's plenty of people prosecuted for running illegal streaming sites and torrenting (which involves uploading), but I don't know of any efforts to crack down on non-distributors.
2. Google might not be the most sympathetic defendant, but what about libraries? They offer books to be borrowed, and some offer photocopiers. If you put the two together, you get a copyright infringement operation, all enabled by the library. Should libraries be on the hook too?
that is quite different. Google serves (used to serve) to its users whatever the website presents to its crawler, it does not try to avoid paywalls or interact with the website in any capacity other than requesting information
Humans are not judged on the basis of what they _can_ do.
Reasoning about how to constrain tools on the basis of what they _could_ do, if e.g. used outside their established guardrails, needs to be very nuanced.
That's simply not true. No humans can memorize entire novels, as this research proved these models do. And definitely not all of these novels, and code bases, and who knows what else all at the same time.
Humans can however, remember entire songs, and songs are definitely long enough to be considered copyright protected. There is still a difference in scale, but that's not really relevant when it comes to copyright law. You can't be like "well humans are committing copyright infringement but since it's limited to a few hundred words we'll give it a pass".
For 99.999% of people that are singing a song, it's not a replacement for the original in any way shape or form, hard stop. Let's not pretend it could even get anywhere close.
For the last 0.001%, we would call it a cover and typically the individually doing a cover takes some liberties of their own, still making it not a replacement in any way. Artists are typically cool with covers.
You realize that lyrics are often written by someone other than the actual singer, and whoever wrote the lyrics is entitled to compensation too? The "amateur singing isn't a replacement for the studio album" excuse doesn't work in this context. Also courts have ruled that lyrics themselves are protected by copyright.
https://en.wikipedia.org/wiki/Lyrics#Copyright_and_royalties
>Artists are typically cool with covers.
Artists being "cool" with something doesn't mean they're not violating copyright law.
https://en.wikipedia.org/wiki/Hafiz_(Quran)
It's not far fetched to think that people could recite books just like an LLM. I don't know why they'd want to, but that's neither here nor there.
[†] Trying to make an allusion to Big AI selling their LLM services here.
Ultimately this is a matter for the courts and the law, but I'd just like to point out that a human memorizing a work, reproducing it, and distributing it is just as much a copyright violation as doing a more mechanical form of reproduction.
There's a reason that fan fiction routinely falls afoul of copyright. There's quite a lot of case law in this area, and hand-waving "humans can do it too" doesn't really make for a strong argument. Humans get in trouble for it ALL THE TIME. The consequences can be fines, injuctions, or even criminal liability.
I'm not sure why you think AI gets off the hook here. Just because you like the outcome at the moment?
it's kind of like that by definition, with the whole Attention stuff etc.
However, modern frontier models have undergone rounds of fine-tuning, RLHF (reinforcement learning from human feedback), and RLVR (RL from verifiable rewards) that turn them into something else. The compressed internet is still in there, but it's wrapped in problem-solving and people-pleasing circuitry.
bena•1h ago
Because if LLMs are prediction machines, the original novel would be a valid organization of the tokens. So there should be a prompt that can cause that sequence to be output.
simianwords•1h ago
yathern•1h ago
> So there should be a prompt that can cause that sequence to be output
Is where I think I might disagree. For example, the odds of predicting verbatim the next sentence in, say, Harry Potter should be astronomically low for a large majority of it. If it wasn't, it'd be a pretty boring book. The fact that it can do this with relative ease means it has been trained on the material.
The issue at hand is about copyright and Intellectual Property - if the goal of copyright is to protect the IP of the author, then LLMs can sort of act like an IP money laundering scheme - where the black box has consumed and can emit this IP. The whole concept of IP is a little philosophical and muddy, with lots of grey area for fair use, parody, inspiration, and adaptation. But this gets very odd when we consider it in light of these models which can adapt and use IP at a massive massive scale.
Sharlin•1h ago
beder•1h ago
tsimionescu•42m ago
This is not to say that they couldn't write a novel, even a very good one - that is a completely different discussion.