Not 42% of the book.
It's a pretty big distinction.
What is the distinction between understanding and memorization? What is the chance that understanding results in memorization (may be in case of humans)?
It should break copyright laws as written now but too much money involved.
not just next token.
This is like: tell it a random sentence in the book, it will give you the next sentence 42% of time.
Guess the next word: Not all heros wear _____
https://en.wikipedia.org/wiki/Artificial_intelligence_and_co...
> See for example OpenAI's comment in the year of GPT-2's release: OpenAI (2019). Comment Regarding Request for Comments on Intellectual Property Protection for Artificial Intelligence Innovation (PDF) (Report). United States Patent and Trademark Office. p. 9. PTO–C–2019–0038. “Well-constructed AI systems generally do not regenerate, in any nontrivial portion, unaltered data from any particular work in their training corpus”
https://copyrightalliance.org/kadrey-v-meta-hearing/
> During the hearing, Judge Chhabria said that he would not take into account AI licensing markets when considering market harm under the fourth factor, indicating that AI licensing is too “circular.” What he meant is that if AI training qualifies as fair use, then there is no need to license and therefore no harmful market effect.
I know this is arguing against the point that this copyright lobbyist is making, but I hope so much that this is the case. The “if you sample, you must license” precedent was bad, and it was an unfair taking from the commons by copyright holders, imo.
The paper this post is referencing is freely available:
Could it be plausible that an LLM had ingested parts of the book via scrapping web pages like this and not the full copyrighted book and get results similar to those of the linked study?
[1] https://www.goodreads.com/work/quotes/4640799-harry-potter-a...
[2] ~30 portions x 68 pages
https://www.reddit.com/r/DataHoarder/comments/1entowq/i_made...
https://github.com/shloop/google-book-scraper
The fact that Meta torrented Books3 and other datasets seems to be by self-admission by Meta employees who performed the work and/or oversaw those who themselves did the work, so that is not really under dispute or ambiguous.
https://torrentfreak.com/meta-admits-use-of-pirated-book-dat...
The pictures are the same. All roads lead to Rome, so they say.
They also use data from the previous models, so I'm not sure how "clean" it really is
Which of the major commercial models discloses its dataset? Or are you just trusting some unfalsifiable self-serving PR characterization?
https://www.wired.com/story/new-documents-unredacted-meta-co...
While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere. Leave Harry Potter for a different "Harry Potter LLM".
Train scientific LLMs to the level of a good early 20th century English major and then use science texts and research papers for the remainder.
It has copyright implications - if Claude can recollect 42% of a copyrighted product without attribution or royalties, how did Anthropic train it?
> Train scientific LLMs to the level of a good early 20th century English major and then use science texts and research papers for the remainder
Plenty of in-stealth companies approaching LLMs via this approach ;)
For those of us who studied the natural sciences and CS in the 2000s and early 2010s, there was a bit of a trend where certain PIs would simply translate German and Russian papers from the early-to-mid 20th century and attribute them to themselves in fields like CS (especially in what became ML).
Should it be? Different question.
First of all, we don't really know how the brain works. I get that you're being a snarky physicalist, but there's plenty of substance dualists, panpsychsts, etc. out there. So, some might say, this is a reductive description of what happens in our brains.
Second of all, yes, if you tried to publish Harry Potter (even if it was from memory), you would get in trouble for copyright violation.
My question is… is that in itself a violation of copyright?
If not then as long as LLMs don’t make a publication it shouldn’t be a copyright violation right? Because we don’t understand how it’s encoded in LLMs either. It is literally the same concept.
If you compressed a copy of HP as a .rar, you couldn't read that as is, but you could press a button and get HP out of it. To distribute that .rar would clearly be a copyright violation.
Likewise, you can't read whatever of HP exists in the LLM model directly, but you seemingly can press a bunch of buttons and get parts of it out. For some models, maybe you can get the entire thing. And I'm guessing you could train a model whose purpose is to output HP verbatim and get the book out of it as easily as de-compressing a .rar.
So, the question in my mind is, how similar is distributing the LLM model, or giving access to it, to distributing a .rar of HP. There's likely a spectrum of answers depending on the LLM
I can record myself reciting the full Harry Potter book then distribute it on YouTube.
Could do the exact same thing with an LLM. The potential for distribution exists in both cases. Why is one illegal and the other not?
Not legally you can't. Both of your examples are copyright violations
At this point you've created an entirely new copy in an audio/visual digital format and took the steps to make it available to the masses. This would almost certainly cross the line into violating copyright laws.
> Could do the exact same thing with an LLM. The potential for distribution exists in both cases. Why is one illegal and the other not?
To my knowledge, the legality of LLMs are still being tested in the courts, like in the NYT vs Microsoft/OpenAI lawsuit. But your video copy and distribution on YouTube would be much more similar to how LLMs are being used than your initial example of reading and memorizing HP just by yourself.
if you trained an LLM on real copyrighted data, benchmarked it, wrote up a report, and then destroyed the weight, that's transformative use and legal in most places.
if you then put up that gguf on HuggingFace for anyone to download and enjoy, well... IANAL. But maybe that's a bit questionable, especially long term.
Personally I’m assuming the worst.
That being said, Harry Potter was such a big cultural phenomenon that I wonder to what degree might one actually be able to reconstruct the books based solely on publicly accessible derivative material.
To address this point, and not other concerns: the benefits would be (1) pop culture knowledge and (2) having a variety of styles of edited/reasonably good-quality prose.
> the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time
As I understand it, it means if you prompt it with some actual context from a specific subset that is 42% of the book, it completes it with 50 tokens from the book, 50% of the time.
So 50 tokens is not really very much, it's basically a sentence or two. Such a small amount would probably generally fall under fair use on its own. To allege a true copyright violation you'd still need to show that you can chain those together or use some other method to build actual substantial portions of the book. And if it only gets it right 50% of the time, that seems like it would be very hard to do with high fidelity.
Having said all that, what is really interesting is how different the latest Llama 70b is from previous versions. It does suggest that Meta maybe got a bit desperate and started over-training on certain materials that greatly increased its direct recall behaviour.
That’s what I was thinking as I read the methodology.
If they dropped the same prompt fragment into Google (or any search engine) how often would they get the next 50 tokens worth of text returned in the search results summaries?
It sounds like a ridiculous way to measure it. Producing 50-token excerpts absolutely doesn't translate to "recall X percent of Harry Potter" for me.
(Edit: I read this article. Nothing burger if its interpretation of the original paper is correct.)
To clarify, they look at the probability a model will produce a verbatim 50-token excerpt given the preceding 50 tokens. They evaluate this for all sequences in the book using a sliding window of 10 characters (NB: not tokens). Sequences from Harry Potter have substantially higher probabilities of being reproduced than sequences from less well-known books.
Whether this is "recall" is, of course, one of those tricky semantic arguments we have yet to settle when it comes to LLMs.
Sure. But imagine this: In a hypothetical world where LLMs never ever exist, I tell you that I can recall 42 percent of the first Harry Potter book. What would you assume I can do?
It's definitely not "this guy can predict next 10 characters with 50% accuracy."
Of course the semantic of 'recall' isn't the point of this article. The point is that Harry Potter was in the training set. But I still think it's a nothing burger. It would be very weird to assume Llama was trained on copyright-free materials only. And afaik there isn't a legal precedent saying training on copyrighted materials is illegal.
It can produce the next sentence or two, but I suspect it can’t reproduce anything like the whole text. If you were to recursively ask for the next 50 tokens, the first time it’s wrong the output would probably cease matching because you fed it not-Harry-Potter.
It seems like chopping Harry Potter up into 2 sentences at a time on post it’s and tossing those in the air. It does contain Harry Potter, in a way, but without the structure is it actually Harry Potter?
Generally speaking, exceptions to copyright are based on the appropriateness of the amount of copied content for the given allowed use, so the shorter it is, the more likely it is for copying to be permitted. European copyright law isn't much different from fair use in that respect.
Where it does differ is that the allowed uses are more explicitly enumerated. So Meta would have to argue e.g. based on the exception for scientific works specifically, rather than more general principles.
There's also the question of how many bits of originality there actually are in Harry Potter. If trained strictly on text up to the publishing of the first book, how well would it compress it?
EDIT Actually, on rereading, I see I replied to the wrong comment.
This does not appear to happen with other models they tested to the same degree
I’m personally more in favor of significantly reducing the length of the copy right. I think 20-30 years is an interesting range. Artist get roughly a career length of time to profit off their creations, but there is much less incentive for major corporations to buy and horde IP.
At the moment, there's also a huge difference between who does and who doesn't pay. If I put the HP collection on my website, you betcha Joanne Rowling's team is going to try to take it down. However, because OpenAI designed an AI system where content cannot be removed from its knowledge base and because their pockets are lined with cash for lawyers, it's practically free to violate whatever copyright rules it wants.
As a full-time professional musician, I'm convinced I'll benefit much more from its deprecation than continuing to flog it into posterity. I don't think I know any musicians who believe that IP is career-relevant for them at this point.
(Granted, I play bluegrass, which has never fit into the copyright model of music in the first place)
It's sold 120 million copies over 30 years. I've gotta think literally every passage is quoted online somewhere else a bunch of times. You could probably stitch together the full book quote-by-quote.
LLMs have limited capacity to memorize, under ~4 bits per parameter[1][2], and are trained on terabytes of data. It's physically impossible for them to memorize everything they're trained on. The model memorized chunks of Harry Potter not just because it was directly trained on the whole book, which the article also alludes to:
> For example, the researchers found that Llama 3.1 70B only memorized 0.13 percent of Sandman Slim, a 2009 novel by author Richard Kadrey. That’s a tiny fraction of the 42 percent figure for Harry Potter.
In case it isn't obvious, both Harry Potter and Sandman Slim are parts of books3 dataset.
[1] -- https://arxiv.org/abs/2505.24832 [2] -- https://arxiv.org/abs/2404.05405
https://www.theguardian.com/technology/2025/jan/10/mark-zuck...
Sure there are just ~75,000 words in HP1, and there are probably many times that amount in direct quotes online. However the quotes aren’t even distributed across the entire text. For every quote of charming the snake in a zoo there will be a thousand “you’re a wizard harry”, and those are two prominent plot points.
I suspect the least popular of all direct quotes from HP1 aren’t using the quotes in fair use, and are just replicating large sections of the novel.
Or maybe it really is just so popular that super nerds have quoted the entire novel arguing about the aspects of wand making, or the contents of every lecture.
archiveofourown.org has 500 thousand, some, but probably not the majority, of that are duplicated from fanfiction.net. 37 thousand of these are over 40 thousand words.
I.e. harry potter and its derivatives presumably appear a million times in the training set, and its hard to imagine a model that could discuss this cultural phenomena well without knowing quite a bit about the source material.
> Or maybe Meta added third-party sources—such as online Harry Potter fan forums, consumer book reviews, or student book reports—that included quotes from Harry Potter and other popular books.
> “If it were citations and quotations, you'd expect it to concentrate around a few popular things that everyone quotes or talks about,” Lemley said. The fact that Llama 3 memorized almost half the book suggests that the entire text was well represented in the training data.
And yes, I read the article before commenting. I don't appreciate the baseless insinuation to the contrary.
It's essentially the same thing, they are copying from a source that is violating copyright, whether that's a pirated book directly or a pirated book via fanficton.
Is this specific fact required to make my beliefs consistent... Yes I think it is, but if you disagree with me in other ways it might not be important to your beliefs.
Legally (note: not a lawyer) I'm generally of the opinion that
A) Torrenting these books was probably copyright infringement on Meta's part. They should have done so legally by scanning lawfully acquired copies like Google did with Google Books.
B) Everything else here that Meta did falls under the fair use and de minimis exceptions to copyrights prohibition on copying copyrighted works without a license.
And if it was copying significant amounts of a work that appeared only once in its training set into the model the de minimis argument would fall apart.
Morally I'm of the opinion that copyright law's prohibition on deeply interacting with our cultural artifacts by creating derivative works is incredibly unfair and bad for society. This extends to a belief that the communities that do this should not be excluded from technological developments because there entire existence is unjustly outlawed.
Incidentally I don't believe that browsing a site that complies with the DMCA and viewing what it lawfully serves you constitutes piracy, so I can't agree with your characterization of events either. The fanfiction was not pirated just because it was likely unlawful to produce in the US.
Accusations of not reading the article are fair when someone brings up a “related” anecdote that was in the article. It’s not fair when someone is just disagreeing.
- the first result is a pdf of the full book
- the second result is a txt of the full book
- the third result is a pdf of the complete harry potter collection
- the fourth result is a txt of the full book (hosted on github funny enough)
Further down there are similar copies from the internet archive and dozens of other sites. All in the first 2-3 pages.
I get that copyright is a problem, but let's not pretend that an LLM that autocompletes a couple lines from harry potter with 50% accuracy is some massive new avenue to piracy. No one is using this as a substitute for buying the book.
Well, luckily the article points out what people are actually alleging:
> There are actually three distinct theories of how training a model on copyrighted works could infringe copyright:
> Training on a copyrighted work is inherently infringing because the training process involves making a digital copy of the work.
> The training process copies information from the training data into the model, making the model a derivative work under copyright law.
> Infringement occurs when a model generates (portions of) a copyrighted work.
None of those claim that these models are a substitute to buying the books. That's not what the plaintiffs are alleging. Infringing on a copyright is not only a matter of privacy (piracy is one of many ways to infringe copyright)
Is that fair use, or is that compression of the verbatim source?
No, it really couldn't. In fact, it's very persuasive evidence that Llama is straight up violating copyright.
It would be one thing to be able to "predict" a paragraph or two. It's another thing entirely to be able to predict 42% of a book that is several hundred pages long.
Repeat for every copyrighted work and you end up with publishers reasonably arguing meta would not be able to produce their LLM without copyrighted work, which they did not pay for.
It's an argument for the courts, of course.
There is no morale and justice ground to leverage on when the system is designed to create wealth bottleneck toward a few recipients.
Harry Potter is a great piece of artistic work, and it's nice that her author could make her way out of a precarious position. But not having anyone in such a situation in the first place would be what a great society should strive to produce.
Rowling already received more than all she needs to thrive I guess. I'm confident that there are plenty of other talented authors out there that will never have such a broad avenue of attention grabbing, which is okay. But that they are stuck in terrible economical situations is not okay.
The copyright loto, or the startup loto are not that much different than the standard loto, they just put so much pression on the player that they get stuck in the narrative that merit for hard efforts is the key component for the gained wealth.
First-order systems drive outcomes. "Did it make money?" "Did it increase engagement?" "Did it scale?" These are tight, local feedback loops. They work because they close quickly and map directly to incentives. But they also hide a deeper danger: they optimize without questioning what optimization does to the world that contains it.
Second-order cybernetics reason about systems. It doesn’t ask, "Did I succeed?" It asks, "What does it mean to define success this way?" "Is the goal worthy?"
That’s where capital breaks.
Capitalism is not simply incapable of reflection. In fact, it's structured to ignore it. It has no native interest in what emerges from its aggregated behaviors unless those emergent properties threaten the throughput of capital itself. It isn't designed to ask, "What kind of society results from a thousand locally rational decisions?" It asks, "Is this change going to make more or less money?"
It's like driving by watching only the fuel gauge. Not speed, not trajectory, or whether the destination is the right one. Just how efficiently you’re burning gas. The system is blind to everything but its goal. What looks like success in the short term can be, and often is, a long-term act of self-destruction.
Take copyright. Every individual rule, term length, exclusivity, royalty, can be justified. Each sounds fair on its own. But collectively, they produce extreme wealth concentration, barriers to creative participation, and a cultural hellscape. Not because anyone intended that, but because the emergent structure rewards enclosure over openness, hoarding over sharing, monopoly over multiplicity.
That’s not a bug. That's what systems do when you optimize only at the first-order level. And because capital evaluates systems solely by their extractive capacity, it treats this emergent behavior not as misalignment but as a feature. It canonizes the consequences.
A second-order system would account for the result by asking, "Is this the kind of world we want to live in?" It would recognize that wealth generated without regard to distribution warps everything it touches: art, technology, ecology, and relationships.
Capitalism, as it currently exists, is not wise. It does not grow in understanding. It does not self-correct toward justice. It self-replicates. Cleverly, efficiently, with brutal resilience. It's emergently misaligned and no one is powerful enough to stop it.
it conjures up pictures of two dragons fighting each other instead of attacking us, but make no mistake they are only fighting for the right to attack us. whoever wins is coming for us afterwards
Those are completely different phenomena. Removing copyright will not suddenly open the floodgates of creativity because anyone can already create anything.
But - and this is the key point - most work is me-too derivative anyway. See for example the flood of magic school novels which were clearly loosely derivative of Harry Potter.
Same with me-too novels in romantasy. Dystopian fiction. Graphic novels. Painted art. Music.
It's all hugely derivative, with most people making work that is clearly and directly derivative of other work.
Copyright doesn't stop this, because as a minimum requirement for creative work, it forces it to be different enough.
You can't directly copy Harry Potter, but if you create your own magic school story with some similar-ish but different-enough characters and add dragons or something you're fine.
In fact under capitalism it is much harder to sell original work than to sell derivative work. Capitalism enforces exactly this kind of me-too creative staleness, because different-enough work based on an original success is less of a risk than completely original work.
Copyright is - ironically - one of the few positive factors that makes originality worthwhile. You still have to take the risk, but if the risk succeeds it provides some rewards and protections against direct literal plagiarism and copying that wouldn't exist without it.
Another key point is that you might download a Llama model and implicitly get a ton of copyright-protected content. Versus with a search engine you’re just connected to the source making it available.
And would the LLM deter a full purchase? If the LLM gives you your fill for free, then maybe yes. Or, maybe it’s more like a 30-second preview of a hit single, which converts into a $20 purchase of the full album. Best to sue the LLM provider today and then you can get some color on the actual consumer impact through legal discovery or similar means.
Music artists get in trouble for using more than a sample without permission — imagine if they just used 45% of a whole song instead…
I’m amazed AI companies haven’t been sued to oblivion yet.
This utter stupidity only continues because we named a collection of matrices “Artificial Intelligence” and somehow treat it as if it were a sentient pet.
Amassing troves of copyrighted works illegally into a ZIP file wouldn’t be allowed. The fact that the meaning was compressed using “Math” makes everyone stop thinking because they don’t understand “Math”.
LLMs are in reality the artifacts of lossy compression of significant chunks of all of the text ever produced by humanity. The "lossy" quality makes them able to predict new text "accurately" as a result.
>compressed using “Math”
This is every compression algorithm.
What's the work here? If it's the output of the LLM, you have to feed in the entire book to make it output half a book so on an ethical level I'd say it's not an issue. If you start with a few sentences, you'll get back less than you put in.
If the work is the LLM itself, something you don't distribute is much less affected by copyright. Go ahead and play entire songs by other artists during your jam sessions.
A ZIP file of a book is also in direct competition of the book, because you could open the ZIP file and read it instead of the book.
A model that can take 50 tokens and give you a greater than 50% probability for the 50 next tokens 42% of the time is not in direct competition with the book, since starting from the beginning you'll lose the plot fairly quickly unless you already have the full book, and unlike music sampling from other music, the model output isn't good enough to read it instead of the book.
AI can reproduce individual sentences 42% of the time but it can't reproduce a summary.
the question however us, is that in the design if AI tools or us that a limitation of current models? what if future models get better at this and are able to produce summaries?
Under the hood they are 100% deterministic, modulo quantization and rounding errors.
So yes, it is very much possible to use LLMs as a lossy compressed archive for texts.
It's just a form of compression.
If I train an autoencoder on an image, and distribute the weights, that would obviously be the same as distributing the content. Just because the content is commingled with lots of other content doesn't make it disappear.
Besides, where did the sections of text from the input works that show up in the output text come from? Divine inspiration? God whispering to the machine?
Possibly copying the content to train the model could be infringing if it doesn't fall under fair use, but the weights themselves are not simply compressed content. For one thing, they are probabilistic, so you wouldn't get the same content back every time like you would with a compression algorithm.
Your second point concedes the argument.
You don't seem to be in a very good position to judge what is and is not obtuse.
I see this absolute non-argument regurgitated ad infinitum in every single discussion on this topic, and at this point I can't help but wonder: doesn't it say more about the person who says it than anything else?
Do you really consider your own human speech no different than that of a computer algorithm doing a bunch of matrix operations and outputting numbers that then get turned into text? Do you truly believe ChatGPT deserves the same rights to freedom of speech as you do?
The question is whether the model weights constitute of copy of the work. I contend that they do not, or they did, than so do the analogous weights (reinforced neural pathways) in your brain, which is clearly absurd and is intended to demonstrate the absurdity of considering a probabilistic weighting that produces similar text to be a copy.
No, but it gives you the right to quote a line from a movie or TV show without being charged with copyright infringement. You argued that an LLM deserves that same right, even if you didn't realize it.
> than so do the analogous weights (reinforced neural pathways) in your brain
Did your brain consume millions of copyrighted books in order to develop into what it is today? Would your brain be unable to exist in its current form if it had not consumed those millions of books?
An LLM is not a person and does not deserve any rights. People have rights, including the right to use tools like LLMs without having to grease the palm of every grubby rights holder (or their great-great-grandchild) just because it turns out their work was so trite and predictable it could be reproduced by simply guessing the next most likely token.
Well, if you have no idea how LLMs work, you could've just said so.
this is literally why i don't like to work on proprietary code. because when i need to create a similar solution for someone else i have to go out of my way to make sure i do it differently. people have been sued over this.
There is nothing inherently probabilistic in a neural network. The neural net always outputs the exact same value for the same input. We typically use that value in a larger program as a probability of a certain token, but that is not required to get data out. You could just as easily determinsitically take the output with the highest value, and add some extra rule for when multiple outputs have the exact same (e.g. pick the one from the output neuron with the lowest index).
> Llama 3 70B was trained on 15 trillion tokens
That's roughly a 200x "compression" ration; compared to 3-7x for tradtional lossless text compression like bzip and friends.
LLM don't just compress, they generalize. If they could only recite Harry Potter perfectly but couldn’t write code or explain math, they wouldn’t be very useful.
Anyway, it is not the same. While one points you to pirated source on specific request, other use it to creating other content not just on direct request. As it was part of training data. Nihilists would then point out that 'people do the same' but they don't as we do not have same capabilities of processing the content.
Dropping the novels into a machine‑learning corpus is a fundamentally different act. The text is not being resold, and the resulting model is not advertised as “official Harry Potter.” The books are just statistical nutrition. One ingredient among millions. Much like a human writer who reads widely before producing new work. No consumer is choosing between “Rowling’s novel” and “the tokens her novel contributed to an LLM,” so there’s no comparable displacement of demand.
In economic terms, the merch market is rivalrous and zero‑sum; the training market is non‑rivalrous and produces no direct substitute good. That asymmetry is why copyright doctrine (and fair‑use case law) treats toy knock‑offs and corpus building very differently.
No one is claiming this.
The corporations developing LLMs are doing so by sampling media without their owners' permission and arguing this is protected by US fair use laws, which is incorrect - as the late AI researcher Suchir Balaji explained in this other article:
Vibe-arguing "because corporations111" ain't it.
https://copyrightalliance.org/faqs/what-is-fair-use/
The purpose and character of the use, including whether such use is of a commercial nature or is for non-profit educational purposes; (commercial least wiggle room) The nature of the copyrighted work; (fictional work least wiggle room) The amount and substantiality of the portion used in relation to the copyrighted work as a whole; (42% is considered a huge fraction of a book) and The effect of the use upon the potential market for or value of the copyrighted work. (Best argument as it’s minimal as a piece of entertainment. Not so as a cultural icon. Someone writing a book report or fan fiction may be less likely to buy a copy. )
Those aren’t the only factors, but I’m more interested in the counter argument here than trying to say they are copyright infringing.
If you photocopy a book you haven't paid for, you've infringed copyright. If you scan it, you've infringed copyright. If you OCR the scan, you've infringed copyright.
There's legal precedent in going after torrenters and z-lib etc.
So when Zuckerberg told the Meta team to do the same, he was on the wrong side of precedent.
Arguing otherwise is literally arguing that huge corporations are somehow above laws that apply to normal people.
Obviously some people do actually believe this. Especially the people who own and work for huge corporations.
But IMO it's far more dangerous culturally and politically than copyright law is.
> The amount and substantiality of the portion used in relation to the copyrighted work as a whole; (42% is considered a huge fraction of a book)
For AI models as they currently exist… I'm not sure about typical or average, but Llama 3 is 15e12 tokens for all models sizes up to 409 billion parameters (~37 tokens per parameter), so a 100,000 token book (~133,000 words) is effectively contributing about 2700 parameters to the whole model.
The average book is condensed into a summary of that book, and of the style of that book. This is also why, when you ask it for specific details of stuff in the training corpus, what you get back only sounds about right rather than being an actual quote, and why LLMs need to have access to a search engine to give exact quotes.
Conversely, for this part:
> The effect of the use upon the potential market for or value of the copyrighted work. (Best argument but as it’s minimal as a piece of entertainment. Not so as a cultural icon. Someone writing a book report or fan fiction may be less likely to buy a copy. )
The current uses alone should make it clear that the effect on the potential market is catastrophic.
People are using them to write blogs (directly from the LLM, not a human who merely used one as a copy-editor), and to generate podcasts. My experiments suggest current models are still too flawed to be worth listening to them over e.g. the opinion of a complete stranger who insists they've "done their own research": https://github.com/BenWheatley/Timeline-of-the-near-future
LLMs are not yet good enough to write books, but I have tried using them to write short stories to keep track of capabilities, and o1 is already better than similar short stories on Reddit (not "good", just "better"): https://github.com/BenWheatley/Studies-of-AI/blob/main/Story...
But things do change, and I fully expect the output of various future models (not necessarily Transformer based) to increase the fraction of humans whose writings they surpass. I'm not sure what counts as "professional writer", but the U.S. Bureau of Labor Statistics says there's 150,000 "Writers and Authors"* out of a total population of about 340 million, so when AI is around the level of the best 0.04% of the population then it will start cutting into such jobs.
On the basis that current models seem (to me) to write software at about the level of a recent graduate, and with the potentially incorrect projection that this is representative across domains, and there are about 1.7 million software developers and 100k new software developer graduates each year, LLMs today would be be around the 100k worst of the 1.7 million best out of 340 million people — i.e. all software developers are the top 0.5% of the population, LLMs are on-par with the bottom 0.03 of that. (This says nothing much about how soon the models will improve).
But of course, some of that copyrighted content is about software development, and we're having conversations here on HN about the trouble fresh graduates are having and if this is more down to AI, the change of US R&D taxation rules (unlikely IMO, I'm in Germany and I think the same is happening here), or the global economy moving away from near-zero interest rates.
* https://www.bls.gov/ooh/media-and-communication/writers-and-...
The LLMs I've used don't randomly start spouting Harry Potter quotes at me, they only bring it up if I ask. They aren't aiming to undermine copyright. And they aren't a very effective tool for it compared to the very well developed networks for pirating content. It seems to be a non-issue that will eventually be settled by the raw economic force that LLMs are bringing to bear on society in the same way that the movie industry ultimately lost the battle against torrents and had to compete with them.
Having said that I think the cat is very much out of the bag on this one and, personally, I think that LLMs should be allowed to be trained on whatever.
Actually no that could be copyright infringement. Badly signing a recent pop song in public also qualifies as copyright infringement. Public performances count as copying here.
For commercial purposes only. If someone sells a recreation of the Harry Potter book, it’s illegal regardless whether it was by memory, directly copying the book, or using an LLM. It’s the act of broadcasting it that’s infringing on copyright, not the content itself.
But just for clarification, selling a recreation isn’t required for copyright infringement. The copying itself can be problematic so you can’t defend yourself by saying you haven’t yet sold any of the 10,000 copies you just printed. There are some exceptions that allow you to make copies for specific purposes, skip protection on a portable CD player for example, but that doesn’t apply to the 10k copies situation.
Although frankly, as has been pointed out many times, the law is also stupid in what it prohibits and that should be fixed first as a priority. Its done some terrible damage to our culture. My family used to be part of a community choir until it shut down basically for copyright reasons.
This kind of argument keeps popping up usually to justify why training LLMs on protected material is fair, and why their output is fair. It's always used in a super selective way, never accounting for confounding factors, just because superficially it sort of supports that idea.
Exceptional humans are exceptional, rare. When they learn, or create something new based on prior knowledge, or just reproduce the original they do it with human limitations and timescales. Laws account for these limitations but still draw lines for when some of this behavior is not permitted.
The law didn't account for a computer "software" that can ingest the entirety of human creation that no human could ever do, then reproduce the original or create an endless number of variations in a blink of an eye.
Recent decisions such as Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith have walked a fine line with this. I feel like the supreme court got this one wrong because the work is far more notable as a Warhol than as a copy of a photograph, perhaps that substitution rule should be a two way street. If the original work cannot substitute for the copy, then clearly the copy must be transformative.
LLMs generating works verbatim might be an infringement of copyright (probably not), distributing those verbatim works without a licence certainly would be. In either case, it is probably considered a failure of the model, Open AI have certainly said that such reproductions shouldn't happen and they consider it a failure mode when it does. I haven't seen similar statements from other model producers, but it would not surprise me if this were the standard sentiment.
Humans looking at works and producing things in a similar style is allowed, indeed this is precisely what art movements are. The same transformative threshold applies. If you draw a cartoon mouse, that's ok, but if people look at it and go "It's Mickey mouse" then it's not. If it's Mickey to tiki Tu meke, it clearly is Mickey but it is also clearly transformative.
Models themselves are very clearly transformative. Copyright itself was conceived at a time when generated content was not considered possible so the notion of the output of a transformative work being a non transformative derivative of something else was never legally evaluated.
The threshold for transformative for fictional works is fairly high unfortunately. Fan fiction and reasonably distinct works with excessive inspiration are both copyright infringing. https://en.wikipedia.org/wiki/Tanya_Grotter
> Models themselves are very clearly transformative.
A near word for word copy of large sections of a work seems nowhere near that threshold. An MP3 isn’t a 1:1 copy of a piece of music but the inherent differences are irrelevant, a neural network containing and allowing the extraction of information looks a lot like lossy compression.
You don't get to say that. Copyright protects the author of a work, but does not bind them to enforce it in any instance. Unlike a trademark, a copyright holder does not lose their protection by allowing unlicensed usage.
It is wholly at the copyright holders discretion to decide which usages they allow and which they do not.
So it's fine as long as it's old piracy? How did you arrive to that conclusion?
You are completely missing the point. Have you read the actual article, because piracy isn't mention a single time.
Herman Goldstine wrote "One of his remarkable abilities was his power of absolute recall. As far as I could tell, von Neumann was able on once reading a book or article to quote it back verbatim; moreover, he could do it years later without hesitation. He could also translate it at no diminution in speed from its original language into English. On one occasion I tested his ability by asking him to tell me how A Tale of Two Cities started. Whereupon, without any pause, he immediately began to recite the first chapter and continued until asked to stop after about ten or fifteen minutes."
Maybe it’s just an unavoidable side effect of extreme intelligence?
If the book was obtained legitimately, letting an LLM read it is not an issue.
They know. LLM is a novel compression format for text(holographic memory or whatever). The question is whether the rest of the world accept this technology as it is or not.
While limited quoting can (and usually is) considered fair use, quoting significant portions of a book (much less 42% of it) has never been fair use, in the U.S., Europe, or any other nation.
Yes, information wants to be free, yada yada. That means facts. Whether creative works are free is up to their creators.
Edit: seems the first part is about a memory about being bullied by Duddley. The second is where he's been elected to the quidditch team. Possibly they are just boring passages, compared to the surrounding ones. So probably just training bias.
I'm gonna bet that Llama 3.1 can recall a significant portion of Pride and Prejudice too.
With examples of this magnitude, it's normal and entirely expected this can happen - as it does with people[0] - the only thing this is really telling us is that the model doesn't understand its position in the society well enough to know to shut up; that obliging the request is going to land it, or its owners, into trouble.
In some way, it's actually perverted.
EDIT: it's even worse than that. What the research seems to be measuring is that the models recognize sentence-sized pieces of the book as likely continuations of an earlier sentence-sized piece. Not whether it'll reproduce that text when used straightforwardly - just whether there's an indication it recognizes the token patterns as likely.
By that standard, I bet there's over a billion people right now who could do that to 42% of first Harry Potter book. By that standard, I too memorized the Bible end-to-end, as had most people alive today, whether or not they're Christian; works this popular bleed through into common language usage patterns.
--
[0] - Even more so when you relax your criteria to accept occasional misspell or paraphrase - then each of us likely know someone who could piece together a chunk of HP book from memory.
Yes, there is no problem when a person reads some book and recalls pieces[0] of it in a suitable context. How would that in any way address when certain people create and distribute commercial software, providing it that piece as input, to perform such recall on demand and at scale, laundering and/or devaluing copyright, is unclear.
Notably, the above is being done not just to a few high-profile authors, but to all of us no matter what we do (be it music, software, writing, visual art).
What’s even worse, is that imaginably they train (or would train) the models to specifically not output those things verbatim specifically to thwart attempts to detect the presence of said works in training dataset (which would naturally reveal the model and its output being a derivative work).
Perhaps one could find some way of justifying that (people justified all sorts of stuff throughout history), but let it be something better than “the model is assumed to be a thinking human when it comes to IP abuse but unthinking tool when it comes to using it for personal benefit”.
[0] Of course, if you find me a single person on this planet capable of recalling 42% of any Harry Potter book, I’d be very impressed if I ever believed it.
edit: never mind, I’ll just ask ChatGPT
What they are actually saying: Given one correct quoted sentence, the model has 42% chance of predicting the next sentence correctly.
So, assuming you start with the first sentence and tell it to keep going, it has a 0.42^n odds of staying on track, where n is the n-th sentence.
It seems to me, that if they didn't keep correcting it over and over again with real quotes, it wouldn't even get to the end of the first page without descending into wild fanfiction territory, with errors accumulating and growing as the length of the text progressed.
EDIT: As the article states, for an entire 50 token excerpt to be correct the probability of each output has to be fairly high. So perhaps it would be more accurate to view it as 0.985^n where n is the n-th token. Still the same result long term. Unless every token is correct, it will stray further and further from the correct source.
It would be nice to know that at least our literature might survive the technological singularity.
Right now they're working on recreating the famous sequence with the troll in the dungeon. It might cost them another few billion in training, but the end results will speak for themselves.
aspenmayer•22h ago
If you've seen as many magnet links as I have, with your subconscious similarly primed with the foreknowledge of Meta having used torrents to download/leech (and possibly upload/seed) the dataset(s) to train their LLMs, you might scroll down to see the first picture in this article from the source paper, and find uncanny the resemblance of the chart depicted to a common visual representation of torrent block download status.
Can't unsee it. For comparison (note the circled part):
https://superuser.com/questions/366212/what-do-all-these-dow...
Previously, related:
Extracting memorized pieces of books from open-weight language models - https://news.ycombinator.com/item?id=44108926 - May 2025