Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book

https://www.understandingai.org/p/metas-llama-31-can-recall-42-percent

191•aspenmayer•7mo ago

Comments

aspenmayer•7mo ago

If you've seen as many magnet links as I have, with your subconscious similarly primed with the foreknowledge of Meta having used torrents to download/leech (and possibly upload/seed) the dataset(s) to train their LLMs, you might scroll down to see the first picture in this article from the source paper, and find uncanny the resemblance of the chart depicted to a common visual representation of torrent block download status.

Can't unsee it. For comparison (note the circled part):

https://superuser.com/questions/366212/what-do-all-these-dow...

Previously, related:

Extracting memorized pieces of books from open-weight language models - https://news.ycombinator.com/item?id=44108926 - May 2025

deafpolygon•7mo ago

It will generate a correct next token 42% of the time when prompted with a 50 token quote.

Not 42% of the book.

It's a pretty big distinction.

deviation•7mo ago

A... massive distinction.

asplake•7mo ago

“… well enough to reproduce 50-token excerpts at least half the time”

chiph2o•7mo ago

This means that if we start with 50% of the book then there is 42% chance that we can recreate the remaining 50%.

What is the distinction between understanding and memorization? What is the chance that understanding results in memorization (may be in case of humans)?

ipaddr•7mo ago

It stores how often characters will come next based on how often they happen in copyright material. It can reproduce parts because those values are a fingerprint.

It should break copyright laws as written now but too much money involved.

j16sdiz•7mo ago

next _50_ tokens 42% of the time

not just next token.

This is like: tell it a random sentence in the book, it will give you the next sentence 42% of time.

amlib•7mo ago

That said, I bet that if you could lower the inference temperature such chances would improve by a lot.

WhatsName•7mo ago

Given the method and how the english language works, isn't that the expected outcome for any text that isnt highly technical?

Guess the next word: Not all heros wear _____

aspenmayer•7mo ago

As there is no reason to believe that Harry Potter is axiomatic to our culture in the way that other concepts are, it is strange to me that the LLMs are able to respond in this way, and not at all expected. Why do you think this outcome is expected? Are the LLMs somehow encoding the same content in such a way that they can be prompted to decode it? Does it matter legally how LLMs are doing what they do technically? This is pertinent to the court case that Meta is currently party to.

https://en.wikipedia.org/wiki/Artificial_intelligence_and_co...

> See for example OpenAI's comment in the year of GPT-2's release: OpenAI (2019). Comment Regarding Request for Comments on Intellectual Property Protection for Artificial Intelligence Innovation (PDF) (Report). United States Patent and Trademark Office. p. 9. PTO–C–2019–0038. “Well-constructed AI systems generally do not regenerate, in any nontrivial portion, unaltered data from any particular work in their training corpus”

https://copyrightalliance.org/kadrey-v-meta-hearing/

> During the hearing, Judge Chhabria said that he would not take into account AI licensing markets when considering market harm under the fourth factor, indicating that AI licensing is too “circular.” What he meant is that if AI training qualifies as fair use, then there is no need to license and therefore no harmful market effect.

I know this is arguing against the point that this copyright lobbyist is making, but I hope so much that this is the case. The “if you sample, you must license” precedent was bad, and it was an unfair taking from the commons by copyright holders, imo.

The paper this post is referencing is freely available:

https://arxiv.org/abs/2505.12546

fuzzbazz•7mo ago

From a quick web search I can find that there are book review sites that allow users to enter and rate verbatim "quotes" from books. This one [1] contains ~2000 [2] portions of a sentence, a paragraph or several paragraphs of Harry Potter and the Sorcerer's Stone.

Could it be plausible that an LLM had ingested parts of the book via scrapping web pages like this and not the full copyrighted book and get results similar to those of the linked study?

[1] https://www.goodreads.com/work/quotes/4640799-harry-potter-a...

[2] ~30 portions x 68 pages

aspenmayer•7mo ago

Sure, why not? lol

https://www.reddit.com/r/DataHoarder/comments/1entowq/i_made...

https://github.com/shloop/google-book-scraper

The fact that Meta torrented Books3 and other datasets seems to be by self-admission by Meta employees who performed the work and/or oversaw those who themselves did the work, so that is not really under dispute or ambiguous.

https://torrentfreak.com/meta-admits-use-of-pirated-book-dat...

redox99•7mo ago

Books3 was used in Llama1. We don't know if they used it later on.

aspenmayer•7mo ago

My comparison was illustrative and analogous in nature. The copyright cartel is making a fruit of the poisonous tree type of argument. Whatever Meta are doing with LLMs is doing the heavy lifting that parity files used to do back in the Usenet days. I wouldn’t be surprised if BitTorrent or other similar caching and distribution mechanisms incorporate AI/LLMs to recognize an owl on the wire, draw the rest just in time in transit, and just send the diffs, or something like that.

The pictures are the same. All roads lead to Rome, so they say.

aprilthird2021•7mo ago

All of the major AI models these days use "clean" datasets stripped of copyrighted material.

They also use data from the previous models, so I'm not sure how "clean" it really is

dragonwriter•7mo ago

> All of the major AI models these days use "clean" datasets stripped of copyrighted material.

Which of the major commercial models discloses its dataset? Or are you just trusting some unfalsifiable self-serving PR characterization?

aprilthird2021•7mo ago

It's from my personal experience in the industry

aspenmayer•7mo ago

What are your thoughts on the origin of the LLaMA leak? It's interesting that the training data was torrented, and so was the leak. Perhaps we will never know? For the OSINT folks, not a lot to go on, or maybe a lot, depending?

https://en.wikipedia.org/wiki/Llama_(language_model)#Leak

https://archived.moe/g/thread/91848262#p91850335

https://github.com/meta-llama/llama/pull/73/files

aprilthird2021•7mo ago

I don't really know much about that, sorry

aspenmayer•7mo ago

I didn’t ask for info, I asked for your views. I gave you all the info anyone has publicly, so you have enough to comment.

I suspect that it was a limited hangout self-own by Meta to claim that they aren’t responsible, and then they are doing research on a leaked LLM that they developed, but then was leaked, so they can claim that the subsequent research is not tainted by the fruit of the poisonous tree legal doctrine. Or, their torrent client or other software on the same machine had 0-days and they got hacked by someone on the Books3 swarm or knowledgeable of what IPs were connecting to it.

I appreciate your posts and I am replying to you to humbly ask you to post more. :P

aprilthird2021•7mo ago

I'm not really sure what you are insinuating? You think Meta leaked LAMA so they could claim, legally that they are in the clear for copyright violation? Sorry, I just don't really get what you want me to opine about.

If that is what you are asking, I don't think that's what happened. It's far more likely that it was just leaked or grabbed by a hacker

aspenmayer•7mo ago

I just thought the whole situation was interesting. You commented about the current LLM research being clean, while being based on prior LLMs which were perhaps less clean, so I thought that it was a curious coincidence how torrents kept popping up.

pclmulqdq•7mo ago

All written text is copyrighted, with few exceptions like court transcripts. I own the copyright to this inane comment. I sincerely doubt that all copyrighted material is scrubbed.

Tepix•7mo ago

Your brief comment is hardly copyrightable. Which makes your point moot.

paxys•7mo ago

Meta has trained on LibGen so we don't really need to speculate.

https://www.wired.com/story/new-documents-unredacted-meta-co...

aprilthird2021•7mo ago

This is in fact mentioned and addressed in the article. Also, there is pretty clear cut evidence Meta used pirated book data sets knowingly to train the earlier Llama models

giardini•7mo ago

As I've said several times, the corpus is key: LLMs thus far "read" most anything, but should instead have well-curated corpora. "Garbage In, Garbage Out!(GIGO)" is the saying.

While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere. Leave Harry Potter for a different "Harry Potter LLM".

Train scientific LLMs to the level of a good early 20th century English major and then use science texts and research papers for the remainder.

alephnerd•7mo ago

> While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere

It has copyright implications - if Claude can recollect 42% of a copyrighted product without attribution or royalties, how did Anthropic train it?

> Train scientific LLMs to the level of a good early 20th century English major and then use science texts and research papers for the remainder

Plenty of in-stealth companies approaching LLMs via this approach ;)

For those of us who studied the natural sciences and CS in the 2000s and early 2010s, there was a bit of a trend where certain PIs would simply translate German and Russian papers from the early-to-mid 20th century and attribute them to themselves in fields like CS (especially in what became ML).

weird-eye-issue•7mo ago

Why are you talking about Claude and Anthropic?

cshimmin•7mo ago

It’s not unreasonable to suspect they are doing the same. The article starts with a description of a lawsuit NY Times brought against OpenAI for similar reasons. The big difference is that research presented here is only possible with open weight models. OAI and Anthropic don’t make the base models available, so it’s easier to hide the fact that you’ve used copyrighted material by instruction post-training. And I’m not sure you can get the logprobs for specific tokens from their APIs either (which is what the researchers did to make the figures and come up with a concrete number like 42%)

alephnerd•7mo ago

Good call! I brain farted and wrote Claude/Anthropic instead of Meta/Llama.

ninetyninenine•7mo ago

So if I memorized Harry Potter the physical encoding which definitely exists in my brain is a copyright violation?

teaearlgraycold•7mo ago

I think humans get a special exception in cases like this

otabdeveloper4•7mo ago

No they don't. Commercial intent is what is prosecuted in IP law.

shrewduser•7mo ago

maybe if you re wrote it from memory.

lithiumii•7mo ago

You are not selling or distributing copies of your brain.

harry8•7mo ago

If you perform it from memory in public without paying royalties then yes, yes it is.

Should it be? Different question.

dvt•7mo ago

> the physical encoding which definitely exists in my brain is a copyright violation

First of all, we don't really know how the brain works. I get that you're being a snarky physicalist, but there's plenty of substance dualists, panpsychsts, etc. out there. So, some might say, this is a reductive description of what happens in our brains.

Second of all, yes, if you tried to publish Harry Potter (even if it was from memory), you would get in trouble for copyright violation.

ninetyninenine•7mo ago

Right but the physical encoding already exists in my brain or how can I reproduce it in the first place? We may not know how the encoding works but we do know that an encoding exists because a decoding is possible.

My question is… is that in itself a violation of copyright?

If not then as long as LLMs don’t make a publication it shouldn’t be a copyright violation right? Because we don’t understand how it’s encoded in LLMs either. It is literally the same concept.

bitmasher9•7mo ago

I don’t think the lawyers are going to buy arguments that compare LLMs with human biology like this.

Jaygles•7mo ago

To me the primary difference between the potential "copy" that exists in your brain and a potential "copy" that exists in the LLM, is that you can't make copies and distribute your brain to billions of people.

If you compressed a copy of HP as a .rar, you couldn't read that as is, but you could press a button and get HP out of it. To distribute that .rar would clearly be a copyright violation.

Likewise, you can't read whatever of HP exists in the LLM model directly, but you seemingly can press a bunch of buttons and get parts of it out. For some models, maybe you can get the entire thing. And I'm guessing you could train a model whose purpose is to output HP verbatim and get the book out of it as easily as de-compressing a .rar.

So, the question in my mind is, how similar is distributing the LLM model, or giving access to it, to distributing a .rar of HP. There's likely a spectrum of answers depending on the LLM

ninetyninenine•7mo ago

> that exists in the LLM, is that you can't make copies and distribute your brain to billions of people.

I can record myself reciting the full Harry Potter book then distribute it on YouTube.

Could do the exact same thing with an LLM. The potential for distribution exists in both cases. Why is one illegal and the other not?

davidcbc•7mo ago

> I can record myself reciting the full Harry Potter book then distribute it on YouTube

Not legally you can't. Both of your examples are copyright violations

briffid•7mo ago

Recording yourself is not a violation, only publishing on Youtube. Content generated with LLMs are not a violation. Publishing the content you generated might be.

davidcbc•7mo ago

Generating the content for the user is the distribution regardless of what the user does with it

Jaygles•7mo ago

> I can record myself reciting the full Harry Potter book then distribute it on YouTube.

At this point you've created an entirely new copy in an audio/visual digital format and took the steps to make it available to the masses. This would almost certainly cross the line into violating copyright laws.

> Could do the exact same thing with an LLM. The potential for distribution exists in both cases. Why is one illegal and the other not?

To my knowledge, the legality of LLMs are still being tested in the courts, like in the NYT vs Microsoft/OpenAI lawsuit. But your video copy and distribution on YouTube would be much more similar to how LLMs are being used than your initial example of reading and memorizing HP just by yourself.

numpad0•7mo ago

if you trained an LLM on real copyrighted data, benchmarked it, wrote up a report, and then destroyed the weight, that's transformative use and legal in most places.

if you then put up that gguf on HuggingFace for anyone to download and enjoy, well... IANAL. But maybe that's a bit questionable, especially long term.

beowulfey•7mo ago

Only if you charge someone to reproduce it for them

JKCalhoun•7mo ago

The end of "Fahrenheit 451" set a horrible precedent. Damn you, Bradbury!

epgui•7mo ago

> It has copyright implications - if Claude can recollect 42% of a copyrighted product without attribution or royalties, how did Anthropic train it?

Personally I’m assuming the worst.

That being said, Harry Potter was such a big cultural phenomenon that I wonder to what degree might one actually be able to reconstruct the books based solely on publicly accessible derivative material.

esafak•7mo ago

That's got nothing to do with it. It's all about copyright. Can it reproduce its training data verbatim? If so, Meta is in hot water.

strangescript•7mo ago

I read harry potter, and you ask me about a page, and I can recite it verbatim, did I just commit copyright infringement?

__loam•7mo ago

This is an extremely common strawman argument. We're not discussing human memory.

bitmasher9•7mo ago

I pay for a service. The service recites a novel to me. The service would need permission to do this or it is copyright infringement.

lucianbr•7mo ago

Are you selling your ability to recite stuff? Then certainly.

strangescript•7mo ago

there are plenty of open source LLMs trained on harry potter, is that fine?

davidcbc•7mo ago

giardini•7mo ago

But if it's corpora do NOT include the Harry Potter books then Meta is NOT in hot water,! So take the Harry Potter books out of the corpora. What is lost? Nothing IMO useful other than the ability to discuss Harry Potter books. BFD.

Jap2-0•7mo ago

> While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere

To address this point, and not other concerns: the benefits would be (1) pop culture knowledge and (2) having a variety of styles of edited/reasonably good-quality prose.

evertedsphere•7mo ago

what is that bar (= token span) on the right common to the first three models

zmmmmm•7mo ago

It's important to note the way it was measured:

> the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time

As I understand it, it means if you prompt it with some actual context from a specific subset that is 42% of the book, it completes it with 50 tokens from the book, 50% of the time.

So 50 tokens is not really very much, it's basically a sentence or two. Such a small amount would probably generally fall under fair use on its own. To allege a true copyright violation you'd still need to show that you can chain those together or use some other method to build actual substantial portions of the book. And if it only gets it right 50% of the time, that seems like it would be very hard to do with high fidelity.

Having said all that, what is really interesting is how different the latest Llama 70b is from previous versions. It does suggest that Meta maybe got a bit desperate and started over-training on certain materials that greatly increased its direct recall behaviour.

Aurornis•7mo ago

> So 50 tokens is not really very much, it's basically a sentence or two. Such a small amount would probably generally fall under fair use on its own.

That’s what I was thinking as I read the methodology.

If they dropped the same prompt fragment into Google (or any search engine) how often would they get the next 50 tokens worth of text returned in the search results summaries?

raincole•7mo ago

(Disclaimer: haven't read the original paper)

It sounds like a ridiculous way to measure it. Producing 50-token excerpts absolutely doesn't translate to "recall X percent of Harry Potter" for me.

(Edit: I read this article. Nothing burger if its interpretation of the original paper is correct.)

tanaros•7mo ago

Their methodology seems reasonable to me.

To clarify, they look at the probability a model will produce a verbatim 50-token excerpt given the preceding 50 tokens. They evaluate this for all sequences in the book using a sliding window of 10 characters (NB: not tokens). Sequences from Harry Potter have substantially higher probabilities of being reproduced than sequences from less well-known books.

Whether this is "recall" is, of course, one of those tricky semantic arguments we have yet to settle when it comes to LLMs.

raincole•7mo ago

> one of those tricky semantic arguments we have yet to settle when it comes to LLMs

Sure. But imagine this: In a hypothetical world where LLMs never ever exist, I tell you that I can recall 42 percent of the first Harry Potter book. What would you assume I can do?

It's definitely not "this guy can predict next 10 characters with 50% accuracy."

Of course the semantic of 'recall' isn't the point of this article. The point is that Harry Potter was in the training set. But I still think it's a nothing burger. It would be very weird to assume Llama was trained on copyright-free materials only. And afaik there isn't a legal precedent saying training on copyrighted materials is illegal.

xnx•7mo ago

This sounds almost like "Works every time (50% of the time)."

hsbauauvhabzb•7mo ago

Except the odds of it happening even 50% of the time is less likely than winning the lottery multiple times. All while illegally ingesting copywrite material without (and presumably against the wishes of) the consent of the copywrite holder.

bee_rider•7mo ago

Even if it is recalling it 50 tokens at a time, the half of the book is in some sense in there, right?

zmmmmm•7mo ago

yeah ... it's going to depend how the issue is framed. However a "copy" of something where there is no way to practically extract the original from it probably has a pretty good argument that it's not really a "copy". For example, a regular dictionary probably has 99% of harry potter in it. Is it a copy?

vintermann•7mo ago

I'd say no. More than half of as-yet unwritten books will be in there too, because I bet will will compress text of a freshly published book much better than 50% (and newer models could even compress new books to one fiftieth of their size, which is more like that 1 in 50 tokens suggests)

bee_rider•7mo ago

That seems like a reasonably easy test to run, right? All you need is a bit of prose that was known not to have been written beforehand. Actually, the experiment could be run using the paper itself!

everforward•7mo ago

I don’t think this paper proves that, and I don’t think it is in a traditional sense.

It can produce the next sentence or two, but I suspect it can’t reproduce anything like the whole text. If you were to recursively ask for the next 50 tokens, the first time it’s wrong the output would probably cease matching because you fed it not-Harry-Potter.

It seems like chopping Harry Potter up into 2 sentences at a time on post it’s and tossing those in the air. It does contain Harry Potter, in a way, but without the structure is it actually Harry Potter?

TeMPOraL•7mo ago

Not necessarily. Information is always spread between what we'd normally consider "storage medium" and "reader"; the degree to which that is is a controllable parameter.

Consider e.g.:

- Digital expansion of PI to sufficient decimal places contains both parts of the work and full work in full. The trick is you have to know where to find it - and it's that knowledge that's actually equivalent to the work itself.

- Any kind of compression that uses a dictionary that's separate from the compressed artifact, shifts some of the information into a dictionary file, or if it's a common dictionary, into compressor/decompressor itself.

In the case from the study, the experimenter actually has to supply most of the information required to pull Harry Potter out of the model - they need to make specific prompts with quotes from the book, and then observe which logits correspond to the actual continuation of those quotes. The experimenter is doing information-loaded selection multiple times: at prompting, and at identifying logits. This by itself doesn't really prove the model memorized the book, only just that it saw fragments from it - in cases those fragments are book-specific (e.g. using proper names from the HP world) instead of generic English sentences.

kelipso•7mo ago

Almost the entire book is in there. From the paper, if you give it a 100 token prompt, it will produce the next 50 tokens with more than 1% probability so that the produced tokens cover 91% of the book. And as the title says, it also produces next 50 tokens with more than 50% probability, so produced tokens cover 42% of the book. Bet it gets close to 100% as you reduce the probability.

Also they went through the book at 10 token strides. Like..a bit tortured way to reproduce the book (basically impossible to actually reproduce the book) but it shows that the content is in there.

Now whether this is derivative work, copyright violation or whatever is debatable. Probably gets similar numbers for a bunch of other books too. They should have done the Bible and probably get way higher numbers, but that won’t go viral.

bee_rider•7mo ago

I think I agree with this take. The book is in there in some sense, whether or not it is a copyright violation is debatable.

Honestly, I get why these debates happen—it is practical to establish whether or not this emerging tech is illegal under current law. But it’s also like… well, obviously current law wasn’t written with this sort of application in mind.

Whether or not we think LLMs are basically good or bad, they are clearly quite impactful. It would be a nice time to have a functional legislature to address this directly.

dTal•7mo ago

If I give you a vision algorithm that, given every other frame of a Harry Potter movie, can accurately predict the interstitials - would you say that half that Harry Potter movie is "in" it?

amlib•7mo ago

Congratulations, you've just invented a video codec with motion estimation. The motion experts group wants their share on some bullshit royalties/patents though, better pay up because they are very litigious and won't go soft on you because you are not a big tech corporation :)

amanaplanacanal•7mo ago

Fair use is a four part test, and the amount if copying is only one of the four parts.

adrianN•7mo ago

Fair use is not a thing in every jurisdiction. In Germany for example there are cases where three words („wir sind Papst“) fall under copyright.

yorwba•7mo ago

Germany does not have something called "fair use," but it does have provisions for uses that are fair. For example your use of the three words to talk about their copyrighted status is perfectly legal in Germany. That somebody wasn't allowed to use them in a specific way in the past doesn't mean that nobody is allowed to use them in any way.

adrianN•7mo ago

Of course, but „it’s a short quote so you can use it“ is not true (at least in Germany).

yorwba•7mo ago

To be pedantic, short quotes (as opposed to short copied fragments that are not used as quotes) are explicitly one of the allowed uses (Zitierbefugnis). You can even quote entire works "in an independent scientific work for the purpose of explaining its content"! https://www.gesetze-im-internet.de/englisch_urhg/englisch_ur...

Generally speaking, exceptions to copyright are based on the appropriateness of the amount of copied content for the given allowed use, so the shorter it is, the more likely it is for copying to be permitted. European copyright law isn't much different from fair use in that respect.

Where it does differ is that the allowed uses are more explicitly enumerated. So Meta would have to argue e.g. based on the exception for scientific works specifically, rather than more general principles.

vintermann•7mo ago

All this study really says, is that models are really good at compressing the text of Harry Potter. You can't get Harry Potter out of it without prompting it with the missing bits - sure, impressively few bits, but is that surprising, considering how many references and fair use excerpts (like discussion of the story in public forums) it's seen?

There's also the question of how many bits of originality there actually are in Harry Potter. If trained strictly on text up to the publishing of the first book, how well would it compress it?

fiddlerwoaroof•7mo ago

The alternate here is that Harry Potter is written with sentences that match the typical patterns of English and so, when you prompt with a part of the text, the LLM can complete it with above-random accuracy

fiddlerwoaroof•7mo ago

Or else, LLMs show that copyright and IP are ridiculous concepts that should be abolished

vintermann•7mo ago

Anything that can tell you what the typical patterns of English is, is going to be a language model by definition.

fiddlerwoaroof•7mo ago

My point is that this might just prove that Harry Potter is the sort of prose “fancy autocomplete” would produce and not all that original.

EDIT Actually, on rereading, I see I replied to the wrong comment.

arthurcolle•7mo ago

You could prove this much better by looking at something like this: https://cookbook.openai.com/examples/using_logprobs

seydor•7mo ago

The claim of the paper is not so much that the model is reproducing content illegally but that harry Potter has been used to train the model.

This does not appear to happen with other models they tested to the same degree

om8•7mo ago

> 50 tokens is not really very much Yes! And also llama3.1’s tokens are different from Qwen and llama1 tokens. That’s the first model where meta started to use very large vocab_size.

jxjnskkzxxhx•7mo ago

Suppose for simplicity that every sentence in the book is 50 tokens or shorter.

According to the stated methodology, I could give the LLM sentence 1 and have 42% chance of getting sentence 2 recalled. Then I could give it sentence 2 and have 42% chance of getting sentence 3. Therefore, the LLM contains 42% of the book in some sense.

I disagree this is "not really very much". If a person could do this you would undoubtedly conclude that the person read the book.

In fact the number 42% even understates the severity of the matter. Superficially it makes it sound that the LLM only contains less than half of the book. In reality the process I described applies to 100% of the sentences. Additionally I'm guessing that the 58% times where the 50 tokens arent recalled correctly, the outputted token probably have the same meaning as the correct one.

TeMPOraL•7mo ago

Except it's not what happened, per the article. Instead, they walked down the logits, which is more like asking someone to give 10-20 best guesses for next word, and should one of them match the secret answer, telling them which one is it and asking them to go on with the next word. Seems like a substantially easier task, and most of information is coming from researchers making a choice at every step.

thomastjeffery•7mo ago

An LLM is not a database. There is no significant amount of information in a model that can be accessed 100% of the time. This is because it's a mystery to the user what collection of tokens will lead to a specific output. To get a predictable result from an LLM 50% of the time is very significant.

This doesn't tell us for certain whether or not the model was trained on a full copy of the book. It's possible that 50-token long passages from 42% of the book were, incidentally, quoted verbatim in various parts of the training data. Considering the popularity of both the book itself, and derivative fan-fiction, I would not be surprised. I would be less surprised to learn that it was indeed trained on a full copy of the book, if not several.

The more meaningful point here is that the ability to reproduce half a book is the same sort of overt derivative work that is definitely considered copyright infringement in other circumstances. A lossy copy is still a copy. If we are to hold LLMs to the same standard as other content, this isn't very easy to defend.

Personally, I see this as a good opportunity to reevaluate copyright on the whole. I think we would be better off without it.

htk•7mo ago

Hmm, couldn't this be used as a benchmark for quantization algorithms?

graphememes•7mo ago

I really wish we could get rid of copyright. It's going to hold us back long term.

bitmasher9•7mo ago

We cannot get ride of it without finding a way to pay the creators that generate copyrighted works.

I’m personally more in favor of significantly reducing the length of the copy right. I think 20-30 years is an interesting range. Artist get roughly a career length of time to profit off their creations, but there is much less incentive for major corporations to buy and horde IP.

atrus•7mo ago

We barely pay creators as it is for generating copyrighted works. Nearly every copywritten work is available on the internet, for free, right now. And creators are still getting paid, albeit poorly, but that's a constant throughout history.

Tepix•7mo ago

How does that favor a longer copyright? It’s not like these old works make a lot of money (with very few exceptions). And making money after 30 years is hardly a motivating factor.

jeroenhd•7mo ago

The thing about creators is that most of them are paid extremely poorly, and some of them get insanely rich. Joanne Rowling has received more money than a reasonable person could use for her wizard books, but millions of bloggers feeding much more data into AI training sets will never see a cent for their work. For starting authors selling books, this can easily be the difference between writing another book or giving up and taking up another job.

At the moment, there's also a huge difference between who does and who doesn't pay. If I put the HP collection on my website, you betcha Joanne Rowling's team is going to try to take it down. However, because OpenAI designed an AI system where content cannot be removed from its knowledge base and because their pockets are lined with cash for lawyers, it's practically free to violate whatever copyright rules it wants.

jMyles•7mo ago

I do not think it's creators that are the constituency holding up deprecation.

As a full-time professional musician, I'm convinced I'll benefit much more from its deprecation than continuing to flog it into posterity. I don't think I know any musicians who believe that IP is career-relevant for them at this point.

(Granted, I play bluegrass, which has never fit into the copyright model of music in the first place)

JoshTriplett•7mo ago

I do too. But in the meantime, as long as it continues being used against anyone, it should be applied fairly. As long as anyone has to respect software licenses, for instance, then AIs should too. It doesn't stop being a problem just because it's done at larger scale.

numpad0•7mo ago

Sure, you just get constantly sued for obstruction of business instead, and there will be no fair use clauses, free software licenses, or right to repair to fight back. It'll be all proprietary under NDA. Is that what you want?

asciisnowman•7mo ago

On the other hand, it’s surprising that Llama memorized so much of Harry Potter and the Sorcerer's Stone.

It's sold 120 million copies over 30 years. I've gotta think literally every passage is quoted online somewhere else a bunch of times. You could probably stitch together the full book quote-by-quote.

mvdtnz•7mo ago

But also we know for a fact that Meta trained their models on pirated books. So there's no need to invent a hare brained scheme of stitching together bits and pieces like that.

kouteiheika•7mo ago

No, assuming that just because it was in the training data it must be memorized is hare brained.

LLMs have limited capacity to memorize, under ~4 bits per parameter[1][2], and are trained on terabytes of data. It's physically impossible for them to memorize everything they're trained on. The model memorized chunks of Harry Potter not just because it was directly trained on the whole book, which the article also alludes to:

> For example, the researchers found that Llama 3.1 70B only memorized 0.13 percent of Sandman Slim, a 2009 novel by author Richard Kadrey. That’s a tiny fraction of the 42 percent figure for Harry Potter.

In case it isn't obvious, both Harry Potter and Sandman Slim are parts of books3 dataset.

[1] -- https://arxiv.org/abs/2505.24832 [2] -- https://arxiv.org/abs/2404.05405

mvdtnz•7mo ago

No, we know it because it was established in court from Meta internal communications.

https://www.theguardian.com/technology/2025/jan/10/mark-zuck...

kouteiheika•7mo ago

I'm confused. Nowhere in my post have I said that they didn't?

bitmasher9•7mo ago

Probably not?

Sure there are just ~75,000 words in HP1, and there are probably many times that amount in direct quotes online. However the quotes aren’t even distributed across the entire text. For every quote of charming the snake in a zoo there will be a thousand “you’re a wizard harry”, and those are two prominent plot points.

I suspect the least popular of all direct quotes from HP1 aren’t using the quotes in fair use, and are just replicating large sections of the novel.

Or maybe it really is just so popular that super nerds have quoted the entire novel arguing about the aspects of wand making, or the contents of every lecture.

davidcbc•7mo ago

If I collect HP quotes from the internet and then stitch them together into a book, can I legally sell access it?

tjpnz•7mo ago

How many could do it from memory?

dankwizard•7mo ago

I can recall about 12% of the first Harry Potter book so it's interesting to see Llama is only 4x smarter than me. I will catch up.

hsbauauvhabzb•7mo ago

How many r’s are there in strawberry?

jofzar•7mo ago

There are 3 R's in strawberry just like in Harry Potter!

gpm•7mo ago

I think it's important to recognize here that fanfiction.net has 850 thousand distinct pieces of Harry Potter fanction on it. Fifty thousand of which are more than 40k words in length. Many of which (no easy way to measure) directly reproducing parts of the original books.

archiveofourown.org has 500 thousand, some, but probably not the majority, of that are duplicated from fanfiction.net. 37 thousand of these are over 40 thousand words.

I.e. harry potter and its derivatives presumably appear a million times in the training set, and its hard to imagine a model that could discuss this cultural phenomena well without knowing quite a bit about the source material.

aprilthird2021•7mo ago

Did you read the article? This exact point is made and then analyzed.

> Or maybe Meta added third-party sources—such as online Harry Potter fan forums, consumer book reviews, or student book reports—that included quotes from Harry Potter and other popular books.

> “If it were citations and quotations, you'd expect it to concentrate around a few popular things that everyone quotes or talks about,” Lemley said. The fact that Llama 3 memorized almost half the book suggests that the entire text was well represented in the training data.

gpm•7mo ago

The article fails to mention or understand the volume of content here. Every, literally every, part of these books is quoted and "talked about" (in the sense of used in unlicensed derivative works).

And yes, I read the article before commenting. I don't appreciate the baseless insinuation to the contrary.

davidcbc•7mo ago

Even assuming you are correct, which I'm skeptical of, does this make it better?

It's essentially the same thing, they are copying from a source that is violating copyright, whether that's a pirated book directly or a pirated book via fanficton.

gpm•7mo ago

Generally I think it matters a great deal to get the facts right when discussing something with nuance.

Is this specific fact required to make my beliefs consistent... Yes I think it is, but if you disagree with me in other ways it might not be important to your beliefs.

Legally (note: not a lawyer) I'm generally of the opinion that

A) Torrenting these books was probably copyright infringement on Meta's part. They should have done so legally by scanning lawfully acquired copies like Google did with Google Books.

B) Everything else here that Meta did falls under the fair use and de minimis exceptions to copyrights prohibition on copying copyrighted works without a license.

And if it was copying significant amounts of a work that appeared only once in its training set into the model the de minimis argument would fall apart.

Morally I'm of the opinion that copyright law's prohibition on deeply interacting with our cultural artifacts by creating derivative works is incredibly unfair and bad for society. This extends to a belief that the communities that do this should not be excluded from technological developments because there entire existence is unjustly outlawed.

Incidentally I don't believe that browsing a site that complies with the DMCA and viewing what it lawfully serves you constitutes piracy, so I can't agree with your characterization of events either. The fanfiction was not pirated just because it was likely unlawful to produce in the US.

1123581321•7mo ago

Agreed. It’s an obtuse quote by Lemley who can’t picture the enormous quantity of associations and crawled data, or at least wants to minimize the quantity. It’s hardly discussion-ending.

Accusations of not reading the article are fair when someone brings up a “related” anecdote that was in the article. It’s not fair when someone is just disagreeing.

paxys•7mo ago

As an experiment I searched Google for "harry potter and the sorcerer's stone text":

- the first result is a pdf of the full book

- the second result is a txt of the full book

- the third result is a pdf of the complete harry potter collection

- the fourth result is a txt of the full book (hosted on github funny enough)

Further down there are similar copies from the internet archive and dozens of other sites. All in the first 2-3 pages.

I get that copyright is a problem, but let's not pretend that an LLM that autocompletes a couple lines from harry potter with 50% accuracy is some massive new avenue to piracy. No one is using this as a substitute for buying the book.

aprilthird2021•7mo ago

> let's not pretend that an LLM that autocompletes a couple lines from harry potter with 50% accuracy is some massive new avenue to piracy. No one is using this as a substitute for buying the book.

Well, luckily the article points out what people are actually alleging:

> There are actually three distinct theories of how training a model on copyrighted works could infringe copyright:

> Training on a copyrighted work is inherently infringing because the training process involves making a digital copy of the work.

> The training process copies information from the training data into the model, making the model a derivative work under copyright law.

> Infringement occurs when a model generates (portions of) a copyrighted work.

None of those claim that these models are a substitute to buying the books. That's not what the plaintiffs are alleging. Infringing on a copyright is not only a matter of privacy (piracy is one of many ways to infringe copyright)

paxys•7mo ago

People aren't alleging this, the author of the article is.

theK•7mo ago

I think that last scenario seems to be the most problematic. Technically it is the same thing that piracy via torrent does, distributing a small piece of a copyrighted material without the copyright holders consent.

OtherShrezzing•7mo ago

I think the argument is less about piracy and more that the model(s output) is a derivative work of Harry Potter, and the rights holder should be paid accordingly when it’s reproduced.

paxys•7mo ago

That may be relevant in the NYT vs OpenAI case, since NYT was supposedly able to reproduce entire articles in ChatGPT. Here Llama is predicting one sentence at a time when fed the previous one, with 50% accuracy, for 42% of the book. That can easily be written off as fair use.

gpm•7mo ago

I'm pretty sure books.google.com does the exact same with much better reliability... and the US courts found that to be fair use. (Agreeing with parent comment)

pclmulqdq•7mo ago

If there is a circuit split between it and NYT vs OAI, the Google Books ruling (in the famously tech-friendly ninth circuit) may also find itself under review.

echelon•7mo ago

> Here Llama is predicting one sentence at a time when fed the previous one, with 50% accuracy, for 42% of the book. That can easily be written off as fair use.

Is that fair use, or is that compression of the verbatim source?

TeMPOraL•7mo ago

It doesn't let you recover the text without knowing it in advance, so no.

You can't in particular iterate it sentence by sentence; you're unlikely to go past sentence 2 this way before it starts giving you back it's own ideas.

The whole thing is a sleigh of hand, basically. There's 42% of the book there, in tiny pieces, which you can only identify if you know what you're looking for. The model itself does not.

gamblor956•7mo ago

That can easily be written off as fair use.

No, it really couldn't. In fact, it's very persuasive evidence that Llama is straight up violating copyright.

It would be one thing to be able to "predict" a paragraph or two. It's another thing entirely to be able to predict 42% of a book that is several hundred pages long.

reedciccio•7mo ago

Is it Llama violating the "copyright" or is it the researcher pushing it to do so?

lern_too_spel•7mo ago

If you distribute a zip file of the book, are you violating copyright, or is it the person who unzips it?

TeMPOraL•7mo ago

If you walk through the N-gram database with a copy of Harry Potter in hand and observe that for N=7, you can find any piece of it in the database with above-average frequency, does that mean N-gram database is violating copyright?

gamblor956•7mo ago

If the database is sharing those pieces, it might be yes.

Copyright takes into account the use for such the copying is done. Commercial use will almost always be treated as not fair use, with limited exceptions.

TeMPOraL•7mo ago

I'd say no, because you can't reasonably access and order those pieces without already having the work at your side to use as a reference.

lern_too_spel•7mo ago

Not unless you can reproduce large portions of Harry Potter verbatim from the database. If the 7-grams are taken only from Harry Potter, that is very likely.

gamblor956•7mo ago

You are.

The creation of the unzipped file is not treated as a separate copy so the recipient would not be violating copyright just by unzipping the file you provided.

int_19h•7mo ago

If it can predict the next sentence reliably, that sentence then becomes part of the context, so if you just continue inference, it would eventually produce the entire text verbatim, no?

geysersam•7mo ago

If the assertion in the parent comment is correct "nobody is using this as a substitute to buying the book" why should the rights holders get paid?

riffraff•7mo ago

The argument is meta used the book so the LLM can be considered a derivative work in some sense.

Repeat for every copyrighted work and you end up with publishers reasonably arguing meta would not be able to produce their LLM without copyrighted work, which they did not pay for.

It's an argument for the courts, of course.

w0m•7mo ago

The argument is whether the LLM training on the copyrighted work is Fair Use or not. Should META pay for the copyright on works it ingests for training purposes?

sabellito•7mo ago

Facebook are using the contents of the book to make money.

psychoslave•7mo ago

The main issue on an economical point of view is that copyright is not the framework we need for social justice and everyone florishing by enjoying pre-existing treasures of human heritage and fairly contributing back.

There is no morale and justice ground to leverage on when the system is designed to create wealth bottleneck toward a few recipients.

Harry Potter is a great piece of artistic work, and it's nice that her author could make her way out of a precarious position. But not having anyone in such a situation in the first place would be what a great society should strive to produce.

Rowling already received more than all she needs to thrive I guess. I'm confident that there are plenty of other talented authors out there that will never have such a broad avenue of attention grabbing, which is okay. But that they are stuck in terrible economical situations is not okay.

The copyright loto, or the startup loto are not that much different than the standard loto, they just put so much pression on the player that they get stuck in the narrative that merit for hard efforts is the key component for the gained wealth.

kelseyfrog•7mo ago

Capitalism is allergic to second-order cybernetics.

First-order systems drive outcomes. "Did it make money?" "Did it increase engagement?" "Did it scale?" These are tight, local feedback loops. They work because they close quickly and map directly to incentives. But they also hide a deeper danger: they optimize without questioning what optimization does to the world that contains it.

Second-order cybernetics reason about systems. It doesn’t ask, "Did I succeed?" It asks, "What does it mean to define success this way?" "Is the goal worthy?"

That’s where capital breaks.

Capitalism is not simply incapable of reflection. In fact, it's structured to ignore it. It has no native interest in what emerges from its aggregated behaviors unless those emergent properties threaten the throughput of capital itself. It isn't designed to ask, "What kind of society results from a thousand locally rational decisions?" It asks, "Is this change going to make more or less money?"

It's like driving by watching only the fuel gauge. Not speed, not trajectory, or whether the destination is the right one. Just how efficiently you’re burning gas. The system is blind to everything but its goal. What looks like success in the short term can be, and often is, a long-term act of self-destruction.

Take copyright. Every individual rule, term length, exclusivity, royalty, can be justified. Each sounds fair on its own. But collectively, they produce extreme wealth concentration, barriers to creative participation, and a cultural hellscape. Not because anyone intended that, but because the emergent structure rewards enclosure over openness, hoarding over sharing, monopoly over multiplicity.

That’s not a bug. That's what systems do when you optimize only at the first-order level. And because capital evaluates systems solely by their extractive capacity, it treats this emergent behavior not as misalignment but as a feature. It canonizes the consequences.

A second-order system would account for the result by asking, "Is this the kind of world we want to live in?" It would recognize that wealth generated without regard to distribution warps everything it touches: art, technology, ecology, and relationships.

Capitalism, as it currently exists, is not wise. It does not grow in understanding. It does not self-correct toward justice. It self-replicates. Cleverly, efficiently, with brutal resilience. It's emergently misaligned and no one is powerful enough to stop it.

frm88•7mo ago

This is a brilliant analysis. Thank you.

em-bee•7mo ago

and as a consequence the fight of AI vs copyright is one of two capitalists fighting each other. it's not about liberating copyright but about shuffling profits around. regardless of who wins that fight society loses.

it conjures up pictures of two dragons fighting each other instead of attacking us, but make no mistake they are only fighting for the right to attack us. whoever wins is coming for us afterwards

thomastjeffery•7mo ago

The AI companies want two things:

1. Strong copyright to prevent competition from undercutting their related businesses.

2. Exclusive rights to totally ignore the copyright of everyone that made the content they use to train models.

I personally would much prefer we take the opportunity to abolish copyright entirely: for everyone, not just a handful of corporations. If derivative work is so valuable to our society (I believe it is), then I should be free to derive NVIDIA's GPU drivers without permission.

snickerer•7mo ago

Very clear and precise line of thoughts. Thank you for that post.

TheOtherHobbes•7mo ago

Copyright doesn't "produce a cultural hellscape." That's just nonsense. Capitalism does because it has editorial control over narratives and their marketing and distribution.

Those are completely different phenomena. Removing copyright will not suddenly open the floodgates of creativity because anyone can already create anything.

But - and this is the key point - most work is me-too derivative anyway. See for example the flood of magic school novels which were clearly loosely derivative of Harry Potter.

Same with me-too novels in romantasy. Dystopian fiction. Graphic novels. Painted art. Music.

It's all hugely derivative, with most people making work that is clearly and directly derivative of other work.

You can't directly copy Harry Potter, but if you create your own magic school story with some similar-ish but different-enough characters and add dragons or something you're fine.

In fact under capitalism it is much harder to sell original work than to sell derivative work. Capitalism enforces exactly this kind of me-too creative staleness, because different-enough work based on an original success is less of a risk than completely original work.

Copyright is - ironically - one of the few positive factors that makes originality worthwhile. You still have to take the risk, but if the risk succeeds it provides some rewards and protections against direct literal plagiarism and copying that wouldn't exist without it.

thomastjeffery•7mo ago

Everything is derivative. This boundary you are defending between originality and slop is extremely subjective at best. What harm is slop anyway? If originality is so objectively valuable, then why should its value be systemically enforced?

At the intersection of capitalism and copyright, I see a serious problem. Collaboration is encapsulated by competition. Because simple derivative work is illegal, all collaboration must be done in teams. Copyright defines every work of art as an island, whose value is not the art itself, but the moat that surrounds it. It should be no surprise that giant anticompetitive corporations reflect this structure. The core value of copyright is not creativity: it's rent-seeking.

Without copyright, we could collaborate freely. Our work would not be required to compete at all! Instead of victory over others' work, our goal could be success!

Aloisius•7mo ago

We know what the world looks like without copyright and that world has far fewer works created and very few artists who can do it full-time absent patronage or independent wealth.

Banning the nonsense that is character copyright and shortening copyright back down to a reasonable length of time (say, 20 years) would still enable the creation of more culturally-relevant derivative works without pauperizing every artist.

thomastjeffery•7mo ago

How could we possibly know that? Copyright has existed since before the industrial revolution even started. What you described is not really that far from reality today: most artists are not really making a living. The words "starving artist" have not even begun to lose their meaning. Every artist I know has been failed by copyright. The value a copyright creates is not applied to the art: it's applied to the moat around the art. The only certain beneficiaries are the giant corporations that use their collected moats to drown out small competition, including artists.

Aloisius•7mo ago

The copyright laws that existed prior to the industrial revolution only existed only in a small number of countries. A large swath of the planet had no equivalent.

Even British Colonial America had no copyright, save a handful of exceptions, as the Statute of Anne did not apply to the colonies.

simianwords•7mo ago

I don't like many things about this post, its a bit snobbish and uses esoteric language in order to sound more intricate than it really is.

>Capitalism is not simply incapable of reflection. In fact, it's structured to ignore it. It has no native interest in what emerges from its aggregated behaviors unless those emergent properties threaten the throughput of capital itself. It isn't designed to ask, "What kind of society results from a thousand locally rational decisions?" It asks, "Is this change going to make more or less money?"

Capitalism and free market has lot of useful and emergent properties that occur not at the first order but second order.

> In the case of the global economic system, under capitalism, growth, accumulation and innovation can be considered emergent processes where not only does technological processes sustain growth, but growth becomes the source of further innovations in a recursive, self-expanding spiral. In this sense, the exponential trend of the growth curve reveals the presence of a long-term positive feedback among growth, accumulation, and innovation; and the emergence of new structures and institutions connected to the multi-scale process of growth

https://en.wikipedia.org/wiki/Emergence

In fact free market is an extremely good example of emergence or second order systems where each individual works selfishly but produces a second order effect of driving growth for everyone - something that is definitely preferable.

kelseyfrog•7mo ago

Appreciate the engagement. But your reply mostly recenters a pro-capitalist narrative by redefining the products of "emergence" as inherently good. My argument isn't about stacking pros and cons and calculating the combined sum. It’s about a structural blind spot: capitalism systematically collapses higher-order questions about what kind of world were building into first-order value propositions like "growth," "utility," and "innovation."

That's the core problem. Capitalism resists second-order critique from within because it translates every possible value: justice, meaning, even critique itself, into terms it can price or optimize. Your response is a perfect example: you defend capitalism by listing its outputs, but that;s another first-order move. If you were engaging at the second-order level, you'd interrogate not what the system produces, but what it refuses to ask, and who gets to decide. That silence is precisely my point.

simianwords•7mo ago

> "emergence" as inherently good

I did not claim it as inherently good, only that it is preferable.

> capitalism systematically collapses higher-order questions about what kind of world were building into first-order value propositions like "growth," "utility," and "innovation."

There is nothing about capitalism that ignores second or third order effects of its policies. Let me make it clear what kind of capitalist system we have in place - private ownership and free market regulated by a government that works for and is elected by the people. In this system the free market works but only till it progresses certain things the people voted for like standard of living, freedom etc. If free market does instead has unintended consequences we have levers to guide it where we want like taxes and subsidies.

> Capitalism resists second-order critique from within because it translates every possible value: justice, meaning, even critique itself, into terms it can price or optimize

I think I see were you are getting at but I have to be honest - I think it is coming from a naive place (I'm open to be proven incorrect).

Imagine you had the power and the responsibility to shape lives by enacting policy decisions. You are presented with a fairly complex problem where you have a large number of people, each one with their own lives and interests and you have to guide them into doing something preferable. No matter where you come from, left or right in the political axis, you will end up using quantitative methods. I imagine your problem is with such optimisation. If so, what is your exact critique here? How would you rather handle such a situation? How would you manage a system of so many people and without quantitative method? Religion?

> If you were engaging at the second-order level, you'd interrogate not what the system produces, but what it refuses to ask, and who gets to decide. That silence is precisely my point.

Ok please elaborate (only if you have engaged with my question above).

Xmd5a•7mo ago

There is a problem with your argument here:

>collapses higher-order questions about what kind of world were building into first-order value

But then

>you'd interrogate not what the system produces, but what it refuses to ask, and who gets to decide. That silence is precisely my point.

The reply your interlocutor provided is aligned with your incoherence. In a first move you point out that capitalism flattens everything into first-order land, and yet in a second move you tell us there are things it can't talk about. I guess your silence is precisely what articulates these two aspects of your discourse.

bufferoverflow•7mo ago

Do you personally pay every time you quote copyrighted books or song lyrics?

fennecfoxy•7mo ago

But HP is derivative of Tolkien, English/Scottish/Welsh culture, Brothers Grimm and plenty of other sources. Barely any human works are not derivative in some form or fashion.

eviks•7mo ago

Let's also not pretend that "massive new" is the only relevant issue

choppaface•7mo ago

A key idea premise is that LLMs will probably replace search engines and re-imagine the online ad economy. So today is a key moment for content creators to re-shape their business model, and that can include copyright law (as much or more as the DMCA change).

Another key point is that you might download a Llama model and implicitly get a ton of copyright-protected content. Versus with a search engine you’re just connected to the source making it available.

And would the LLM deter a full purchase? If the LLM gives you your fill for free, then maybe yes. Or, maybe it’s more like a 30-second preview of a hit single, which converts into a $20 purchase of the full album. Best to sue the LLM provider today and then you can get some color on the actual consumer impact through legal discovery or similar means.

BobbyTables2•7mo ago

Indeed but since when is a blatantly derived work only using 50% of a copyrighted work without permission a paragon of copyright compliance?

Music artists get in trouble for using more than a sample without permission — imagine if they just used 45% of a whole song instead…

I’m amazed AI companies haven’t been sued to oblivion yet.

This utter stupidity only continues because we named a collection of matrices “Artificial Intelligence” and somehow treat it as if it were a sentient pet.

Amassing troves of copyrighted works illegally into a ZIP file wouldn’t be allowed. The fact that the meaning was compressed using “Math” makes everyone stop thinking because they don’t understand “Math”.

colechristensen•7mo ago

>Amassing troves of copyrighted works illegally into a ZIP file wouldn’t be allowed. The fact that the meaning was compressed using “Math” makes everyone stop thinking because they don’t understand “Math”.

LLMs are in reality the artifacts of lossy compression of significant chunks of all of the text ever produced by humanity. The "lossy" quality makes them able to predict new text "accurately" as a result.

>compressed using “Math”

This is every compression algorithm.

Dylan16807•7mo ago

> a blatantly derived work only using 50% of a copyrighted work without permission

What's the work here? If it's the output of the LLM, you have to feed in the entire book to make it output half a book so on an ethical level I'd say it's not an issue. If you start with a few sentences, you'll get back less than you put in.

If the work is the LLM itself, something you don't distribute is much less affected by copyright. Go ahead and play entire songs by other artists during your jam sessions.

yorwba•7mo ago

Music artists get in trouble for using more than a sample from other music artists without permission because their work is in direct competition with the work they're borrowing from.

A ZIP file of a book is also in direct competition of the book, because you could open the ZIP file and read it instead of the book.

A model that can take 50 tokens and give you a greater than 50% probability for the 50 next tokens 42% of the time is not in direct competition with the book, since starting from the beginning you'll lose the plot fairly quickly unless you already have the full book, and unlike music sampling from other music, the model output isn't good enough to read it instead of the book.

em-bee•7mo ago

this is the first sensible argument in defense of AI models i read in this debate. thank you. this does make sense.

AI can reproduce individual sentences 42% of the time but it can't reproduce a summary.

the question however us, is that in the design if AI tools or us that a limitation of current models? what if future models get better at this and are able to produce summaries?

otabdeveloper4•7mo ago

LLMs aren't probabilistic. The randomness is bolted on top by the cloud providers as a trick to give them a more humanistic feel.

Under the hood they are 100% deterministic, modulo quantization and rounding errors.

So yes, it is very much possible to use LLMs as a lossy compressed archive for texts.

fennecfoxy•7mo ago

Has nothing to do with "cloud providers". The randomness is inherent to the sampler, using a sampler that picks top probability for next token would result in lower quality output as I have definitely seen it get stuck in certain endless sequences when doing that.

Ie you get something like "Complete this poem 'over yonder hills I saw' output: a fair maiden with hair of gold like the sun gold like the sun gold like the sun gold like the sun..." etc.

otabdeveloper4•7mo ago

> would result in lower quality output

No it wouldn't.

> seen it get stuck in certain endless sequences when doing that

Yes, and infinite loops is just an inherent property of LLMs, like hallucinations.

fennecfoxy•7mo ago

How would it not result in lower quality output? You're reducing the set of tokens that may be selected to 1. The pool isn't necessarily synonyms but words that share some semantic connection to the previous word, but the selection of one word in particular can certainly impact the word that is selected next.

Explain your reasoning otherwise.

otabdeveloper4•7mo ago

> You're reducing the set of tokens that may be selected to 1.

Yes, reducing it to 1 token that is deemed to be the optimal token according to the model.

rnkn•7mo ago

You were so close! The takeaway is not that LlmS represent a bottomless tar pit of piracy (they do) but that someone can immediately perform the task 58% better without the AI than with it. This is nothing more than “look what the clever computer can do.”

abtinf•7mo ago

You really don't see the difference between Google indexing the content of third parties and directly hosting/distributing the content itself?

Zambyte•7mo ago

Where are they putting any blame on Google here?

abtinf•7mo ago

Where did I say they were?

Zambyte•7mo ago

When you juxtaposed Google indexing with third parties hosting the content...?

imgabe•7mo ago

Hosting model weights is not hosting / distributing the content.

abtinf•7mo ago

Of course it is.

It's just a form of compression.

If I train an autoencoder on an image, and distribute the weights, that would obviously be the same as distributing the content. Just because the content is commingled with lots of other content doesn't make it disappear.

Besides, where did the sections of text from the input works that show up in the output text come from? Divine inspiration? God whispering to the machine?

aschobel•7mo ago

Indeed! It is a form of massive lossy compression.

> Llama 3 70B was trained on 15 trillion tokens

That's roughly a 200x "compression" ration; compared to 3-7x for tradtional lossless text compression like bzip and friends.

LLM don't just compress, they generalize. If they could only recite Harry Potter perfectly but couldn’t write code or explain math, they wouldn’t be very useful.

amlib•7mo ago

But LLMs cant write code nor explain math, they only plagiarize existing code and plagiarize existing explanations of math.

Wowfunhappy•7mo ago

I would be inclined to agree except apparently 42% of the first Harry Potter book is encoded in the model weights...

nashashmi•7mo ago

The way I see it is that an LLM took search results and outputted that info directly. Besides, I think that if an LLM was able to reproduce 42%, assuming that it is not continuous, I would say that is fair use.

timeon•7mo ago

Is this whataboutism?

Anyway, it is not the same. While one points you to pirated source on specific request, other use it to creating other content not just on direct request. As it was part of training data. Nihilists would then point out that 'people do the same' but they don't as we do not have same capabilities of processing the content.

TGower•7mo ago

People aren't buying Harry Potter action figures as a subtitute for buying the book either, but copyright protects creators from other people swooping in and using their work in other mediums. There is obviously a huge market demand for high quality data for training LLMs, Meta just spent 15 billion on a data labeling company. Companies training LLMs on copyrighted material without permission are doing that as a substitue for obtaining a license from the creator for doing so in the same way that a pirate downloading a torrent is a substitue for getting an ebook license.

ritz_labringue•7mo ago

Harry Potter action figures trade almost entirely on J. K. Rowling’s expressive choices. Every unlicensed toy competes head‑to‑head with the licensed one and slices off a share of a finite pot of fandom spending. Copyright law treats that as classic market substitution and rightfully lets the author police it.

Dropping the novels into a machine‑learning corpus is a fundamentally different act. The text is not being resold, and the resulting model is not advertised as “official Harry Potter.” The books are just statistical nutrition. One ingredient among millions. Much like a human writer who reads widely before producing new work. No consumer is choosing between “Rowling’s novel” and “the tokens her novel contributed to an LLM,” so there’s no comparable displacement of demand.

In economic terms, the merch market is rivalrous and zero‑sum; the training market is non‑rivalrous and produces no direct substitute good. That asymmetry is why copyright doctrine (and fair‑use case law) treats toy knock‑offs and corpus building very differently.

vrighter•7mo ago

So? Am I allowed to also ignore certain laws if I can prove others have also ignored them?

pera•7mo ago

> let's not pretend that an LLM that autocompletes a couple lines from harry potter with 50% accuracy is some massive new avenue to piracy

No one is claiming this.

The corporations developing LLMs are doing so by sampling media without their owners' permission and arguing this is protected by US fair use laws, which is incorrect - as the late AI researcher Suchir Balaji explained in this other article:

https://suchir.net/fair_use.html

almosthere•7mo ago

Yeah, that's literally the title of the article,and the premise of the first paragraph.

Retric•7mo ago

The first paragraph isn’t arguing that this copying will lead to piracy. It’s referring to court cases where people are trying to argue LLM’s themselves are copyright infringing.

pera•7mo ago

It's not literally the title of the article, nor the premise of its first paragraph, but since this was your interpretation I wonder if there is a misunderstanding around the term "piracy", which I believe is normally defined as the unauthorized reproduction of works, not a synonym for copyright infringement, which is a more broad concept.

cultureulterior•7mo ago

It's not clear that it's incorrect.

Retric•7mo ago

I’ve yet to read an actual argument defending commercial LLM’s as fair use based on existing (edit:legal) criteria.

TeMPOraL•7mo ago

I'm yet to read an actual argument that it's not.

Vibe-arguing "because corporations111" ain't it.

Retric•7mo ago

I’m looking for a link that does something like this but ends up supporting commercial LLM’s

https://copyrightalliance.org/faqs/what-is-fair-use/

The purpose and character of the use, including whether such use is of a commercial nature or is for non-profit educational purposes; (commercial least wiggle room) The nature of the copyrighted work; (fictional work least wiggle room) The amount and substantiality of the portion used in relation to the copyrighted work as a whole; (42% is considered a huge fraction of a book) and The effect of the use upon the potential market for or value of the copyrighted work. (Best argument as it’s minimal as a piece of entertainment. Not so as a cultural icon. Someone writing a book report or fan fiction may be less likely to buy a copy. )

Those aren’t the only factors, but I’m more interested in the counter argument here than trying to say they are copyright infringing.

TheOtherHobbes•7mo ago

Copyright notices in books make it absolutely clear - you are not allowed to acquire a text by copying it without authorisation.

If you photocopy a book you haven't paid for, you've infringed copyright. If you scan it, you've infringed copyright. If you OCR the scan, you've infringed copyright.

There's legal precedent in going after torrenters and z-lib etc.

So when Zuckerberg told the Meta team to do the same, he was on the wrong side of precedent.

Arguing otherwise is literally arguing that huge corporations are somehow above laws that apply to normal people.

Obviously some people do actually believe this. Especially the people who own and work for huge corporations.

But IMO it's far more dangerous culturally and politically than copyright law is.

ben_w•7mo ago

For this part in particular:

> The amount and substantiality of the portion used in relation to the copyrighted work as a whole; (42% is considered a huge fraction of a book)

For AI models as they currently exist… I'm not sure about typical or average, but Llama 3 is 15e12 tokens for all models sizes up to 409 billion parameters (~37 tokens per parameter), so a 100,000 token book (~133,000 words) is effectively contributing about 2700 parameters to the whole model.

The *average* book is condensed into a summary of that book, and of the style of that book. This is also why, when you ask a model for specific details of stuff in the training corpus, what you get back *usually* normally only sound about right rather than being an actual quote, and why LLMs need to have access to a search engine to give exact quotes — the exceptions are things that been quoted many many times like the US constitution or, by the look of things from this article, widely pirated books where there's a lot of copies.

Mass piracy leading to such infringement is still bad, but I think the reasons why matter: Given Meta is accused of mass piracy to get the training set for Llama, I think they're as guilty as can be, but if this had been "we indexed the open internet, pirate copies were accidental", this would be at least a mitigation.

(There's also an argument for "your writing is actually very predictable"; I've not read the HP books myself, though (1) I'm told the later ones got thicker due to repeating exposition of the previous books, and (2) a long-running serialised story I read during the pandemic, The Deathworlders, became very predictable towards the end, so I know it can happen).

Conversely, for this part:

> The effect of the use upon the potential market for or value of the copyrighted work. (Best argument but as it’s minimal as a piece of entertainment. Not so as a cultural icon. Someone writing a book report or fan fiction may be less likely to buy a copy. )

The current uses alone should make it clear that the effect on the potential market is catastrophic, and not just for existing works but also for not-yet-written ones.

People are using them to write blogs (directly from the LLM, not a human who merely used one as a copy-editor), and to generate podcasts (some have their own TTS, but that's easy anyway). My experiments suggest current models are still too flawed to be worth listening to them over e.g. the opinion of a complete stranger who insists they've "done their own research": https://github.com/BenWheatley/Timeline-of-the-near-future

LLMs are not yet good enough to write books, but I have tried using them to write short stories to keep track of capabilities, and o1 is already better than similar short stories on Reddit (not "good", just "better"): https://github.com/BenWheatley/Studies-of-AI/blob/main/Story...

But things do change, and I fully expect the output of various future models (not necessarily Transformer based) to increase the fraction of humans whose writings they surpass. I'm not sure what counts as "professional writer", but the U.S. Bureau of Labor Statistics says there's 150,000 "Writers and Authors"* out of a total population of about 340 million, so when AI is around the level of the best 0.04% of the population then it will start cutting into such jobs.

On the basis that current models seem (to me) to write software at about the level of a recent graduate, and with the potentially incorrect projection that this is representative across domains, and there are about 1.7 million software developers and 100k new software developer graduates each year, LLMs today would be be around the 100k worst of the 1.7 million best out of 340 million people — i.e. all software developers are the top 0.5% of the population, LLMs are on-par with the bottom 0.03 of that. (This says nothing much about how soon the models will improve).

But of course, some of that copyrighted content is about software development, and we're having conversations here on HN about the trouble fresh graduates are having and if this is more down to AI, the change of US R&D taxation rules (unlikely IMO, I'm in Germany and I think the same is happening here), or the global economy moving away from near-zero interest rates.

* https://www.bls.gov/ooh/media-and-communication/writers-and-...

roenxi•7mo ago

It seems like a pretty reasonable argument and easy enough to make. A human with a great memory could probably recreate some absurd % of Harry Potter after reading it, there are some very unusual minds out there. It is clear that if they read Harry Potter and <edit> being capable </edit> of reproducing it on demand as a party trick that would be fair use. So the LLM should also be fair use since it is using a mechanism similar enough to what humans do and what humans do is fine.

The LLMs I've used don't randomly start spouting Harry Potter quotes at me, they only bring it up if I ask. They aren't aiming to undermine copyright. And they aren't a very effective tool for it compared to the very well developed networks for pirating content. It seems to be a non-issue that will eventually be settled by the raw economic force that LLMs are bringing to bear on society in the same way that the movie industry ultimately lost the battle against torrents and had to compete with them.

sabellito•7mo ago

The difference might be the "human doing it as a party trick" vs "multi billion dollar corporation using it for profit".

Having said that I think the cat is very much out of the bag on this one and, personally, I think that LLMs should be allowed to be trained on whatever.

Retric•7mo ago

> is clear that if they read Harry Potter and reproduce it on demand as a party trick that would be fair use.

Actually no that could be copyright infringement. Badly signing a recent pop song in public also qualifies as copyright infringement. Public performances count as copying here.

ricardobeat•7mo ago

> Badly signing a recent pop song in public also qualifies as copyright infringement

For commercial purposes only. If someone sells a recreation of the Harry Potter book, it’s illegal regardless whether it was by memory, directly copying the book, or using an LLM. It’s the act of broadcasting it that’s infringing on copyright, not the content itself.

Retric•7mo ago

There’s a bunch of nuance here.

But just for clarification, selling a recreation isn’t required for copyright infringement. The copying itself can be problematic so you can’t defend yourself by saying you haven’t yet sold any of the 10,000 copies you just printed. There are some exceptions that allow you to make copies for specific purposes, skip protection on a portable CD player for example, but that doesn’t apply to the 10k copies situation.

roenxi•7mo ago

Ah sorry. I mistyped. Being able to do that it would be fair use. I went back and fixed the comment.

Although frankly, as has been pointed out many times, the law is also stupid in what it prohibits and that should be fixed first as a priority. Its done some terrible damage to our culture. My family used to be part of a community choir until it shut down basically for copyright reasons.

close04•7mo ago

> A human with a great memory

This kind of argument keeps popping up usually to justify why training LLMs on protected material is fair, and why their output is fair. It's always used in a super selective way, never accounting for confounding factors, just because superficially it sort of supports that idea.

Exceptional humans are exceptional, rare. When they learn, or create something new based on prior knowledge, or just reproduce the original they do it with human limitations and timescales. Laws account for these limitations but still draw lines for when some of this behavior is not permitted.

The law didn't account for a computer "software" that can ingest the entirety of human creation that no human could ever do, then reproduce the original or create an endless number of variations in a blink of an eye.

ab5tract•7mo ago

That’s why the “transformative” argument falls so flat to me. It’s about transformation in the mind and hands of a human.

Traditionally tools that reduce the friction of creating those transformations make a work less “transformed” in the eyes of the law, not more so. In this case the transformation requires zero mental or physical effort.

staticman2•7mo ago

Nobody in real life thinks humans and machines are the same thing and actually believes they should have the same legal status. The A.I. enthusiast would not support the legality of shooting them when no longer useful the way a company would shred an old hard drive.

This supposed failure to see the difference between the human mind and a machine whenever someone brings up copyright is peformative and disingenuous.

close04•7mo ago

> Nobody in real life thinks humans and machines are the same thing

Maybe you've been following a different conversation, or jumping to conclusions is just more convenient. This isn't about "legal status of AI" but about laws written having in mind only the capabilities of humans, at a time when systems as powerful as today's were unthinkable. Obviously the same laws have to set different limits for humans and machines.

There's no law limiting a human's top (running) speed but you have speed limits for cars. Maybe you're legally allowed to own a semi-automatic weapon but not an automatic one. This is the ELI5 for why when legislating, capabilities make all the difference. Obviously a rifle should not have the same legal status or be the same thing as a human, just in case my point is still lost on you.

Literally every single discussion on this LLM training/output topic, this one included, eventually has a number of people basing their argument on "but humans are allowed to do it", completely ignoring that humans can only do it in a much, much more limited way.

> is peformative and disingenuous

That's an extremely uncharitable and aggressive take, especially after not bothering to understand at all what I said.

staticman2•7mo ago

>That's an extremely uncharitable and aggressive take, especially after not bothering to understand at all what I said.

To be clear, my intent wasn't to say you were the one being performative and disingenuous. I was referring to the sort of person you were debating against, the one who thinks every legal issue involving A.I. can be settled by typing "humans are allowed to do it."

Since I replied to you, I can see how what I wrote was confusing. My apologies.

The parent you replied to claimed LLMs are using "mechanism similar enough to what humans do and what humans do is fine."

Parent probably doesn't want his or her brain shredded like an old hard drive despite claiming similar mechanisms whenever it is convinient.

I'm arguing nobody actually believes there are "similar mechanisms" between machines and humans in their revealed preferences in day to day life.

>There's no law limiting a human's top (running) speed but you have speed limits for cars. Maybe you're legally allowed to own a semi-automatic weapon but not an automatic one.

I don't believe this analogy works. If we're talking about transmitting the text of Harry Potter, I believe it would already be illegal for a single human to type it on demand as a service.

If we are talking about remembering the text of Harry Potter but not reciting it on demand, that's not illegal for a human because copyright doesn't govern human memories.

I don't see what copyright law you think needs updating.

bloak•7mo ago

I'm fairly sure that the law treats humans and machines differently, so arguing that it would be OK if a person did it therefore it's OK to build a machine that does it is not very helpful. (I'm not sure you're doing that but lots of random non-lawyers on the Internet seem to be doing that.)

Claims like this demonstrate it, really: it is obviously not copyright infringement for a human to memorise a poem and recite it in private; it obviously is copyright infringement to build a machine that does that and grant public access to that machine. (Or does anyone think that's not obvious?)

triceratops•7mo ago

> It is clear that if they read Harry Potter and <edit> being capable </edit> of reproducing it on demand as a party trick that would be fair use.

Not fair use. No one would ever prosecute it as infringement but it's not fair use.

Lerc•7mo ago

Based upon legal decisions in the past there is a clear argument that the distinction for fair use is whether a work is substantially different to another. You are allowed to write a book containg information you learned about from another book. There is threshold in academia regarding plagiarism that stands apart from the legal standing. The measure that was used in Gyles v Wilcox was if the new work could substitute for the old. Lord Hardwicke had the wisdom to defer to experts in the field as to what the standard should be for accepting something as meaningfully changed.

Recent decisions such as Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith have walked a fine line with this. I feel like the supreme court got this one wrong because the work is far more notable as a Warhol than as a copy of a photograph, perhaps that substitution rule should be a two way street. If the original work cannot substitute for the copy, then clearly the copy must be transformative.

LLMs generating works verbatim might be an infringement of copyright (probably not), distributing those verbatim works without a licence certainly would be. In either case, it is probably considered a failure of the model, Open AI have certainly said that such reproductions shouldn't happen and they consider it a failure mode when it does. I haven't seen similar statements from other model producers, but it would not surprise me if this were the standard sentiment.

Humans looking at works and producing things in a similar style is allowed, indeed this is precisely what art movements are. The same transformative threshold applies. If you draw a cartoon mouse, that's ok, but if people look at it and go "It's Mickey mouse" then it's not. If it's Mickey to tiki Tu meke, it clearly is Mickey but it is also clearly transformative.

Models themselves are very clearly transformative. Copyright itself was conceived at a time when generated content was not considered possible so the notion of the output of a transformative work being a non transformative derivative of something else was never legally evaluated.

Retric•7mo ago

I think you may have something with that line of reasoning.

The threshold for transformative for fictional works is fairly high unfortunately. Fan fiction and reasonably distinct works with excessive inspiration are both copyright infringing. https://en.wikipedia.org/wiki/Tanya_Grotter

> Models themselves are very clearly transformative.

A near word for word copy of large sections of a work seems nowhere near that threshold. An MP3 isn’t even close to a 1:1 copy of a piece of music but the inherent differences are irrelevant, a neural network containing and allowing the extraction of information looks a lot like lossy compression.

Models could easily be transformative, but the justification needs to go beyond well obviously they are.

Lerc•7mo ago

Models are not word for word copies of large sections of text. They are capable of emitting that text though.

It would be interesting to look at what legal precidents were set regarding mp3s or other encodings. Is the encoding itself an infringement, or is it the decoding, or is it the distribution of a decodable form of a work.

There is also the distinction with a lossy encoding that encodes a single work. There is clarity when the encoded form serves no other purpose other than to be decoded into a given work. When the encoding acts as a bulk archive, does the responsibility shift to those who choose what to extract from the archive?

Retric•7mo ago

> Is the encoding itself an infringement

Barring a fair use exception, yes.

From what I’ve read MP3’s get the same treatment as cassette tapes which were also lossy. It’s 1:1 digital copies that represented some novelty, but that rarely matters.

I’m hesitant to comment of the rest of that. The ultimate question isn’t if some difference exists but why that difference matters.

int_19h•7mo ago

> When the encoding acts as a bulk archive, does the responsibility shift to those who choose what to extract from the archive?

If you take many gigabytes of, say, public domain music, and stick them on a flash drive with just one audio file that is an unlicensed copy of a copyrighted song, distributing that drive would constitute copyright infringement, quite obviously so. I don't see why it'd matter what else the model can produce, if it can produce that one thing verbatim by itself.

(If you could only prompt the model to regurgitate the original text with a framing of, say, critical analysis of said text around it, and not in any other context, then I think there would be a stronger fair use argument here.)

triceratops•7mo ago

Training itself involves making infringing copies of protected works. Whether or not inference produces copyrighted material is almost beside the point.

yunwal•7mo ago

No it doesn’t? You can buy a digital copy of Harry Potter and use it for training. No infringement needed.

triceratops•7mo ago

Only as long as it's not copied again during training. You can't make copies of your purchased digital copy for any reason other than archival.

Retric•7mo ago

Incidental copies during playback are also allowed. But none of these companies are paying for copies in the first place.

johanyc•7mo ago

It’s legal if it’s fair use, which is yet decided by court

rpd9803•7mo ago

Copyright fair use rules are tools designed to govern how humans use protected works in dervied works. AI is not human use, therefore the rules are only coincidentally correct for AI use where it even is.

Lerc•7mo ago

If you take that approach to fair use, don't you open the door to the same argument for copyright itself?

How do you distinguish between a tool and the director of a tool? I doubt people would say that a person is immune to copyright or fair use rules because it was the pen that wrote the document, not the person.

Retric•7mo ago

> don’t you open the door to the same argument for copyright itself?

Yes, it comes down to intentional control of output. Copyright applies when someone uses a pen to make a drawing because of the degree of control.

On the flip side there are copyright free photos where an animal picked up a camera etc, the same applies to a great deal of automatically generated data. The output of an LLM is likely in the public domain unless it’s a derivative work of something in the training set.

int_19h•7mo ago

I think it's a valid question. Suppose you have two LLMs interacting with each other in a loop, and one randomly prompts the other to reproduce the entire text of Harry Potter, which the other then does. However, the chat log isn't actually stored anywhere, it's just a transient artifact of the interaction - so no human ever sees it nor can see it even in principle. Is it a copyright violation then? If it is, what are the damages?

paxys•7mo ago

If you really haven't read a single argument about it then you're deliberately blocking them out, because it just takes a couple minutes of searching.

https://www.arl.org/blog/training-generative-ai-models-on-co...

https://hls.harvard.edu/today/does-chatgpt-violate-new-york-...

https://www.bakerdonelson.com/artificial-intelligence-and-co...

https://www.techpolicy.press/to-support-ai-defend-the-open-i...

Retric•7mo ago

Those support the utility or debate individual points but don’t make a coherent argument that LLM are strictly fair use.

First link provides quotes but doesn’t actually make an argument that LLM’s are fair use under current precedent. Rather that training AI can be fair use and researchers would like LLM’s to include copyrighted works to aid research on modern culture. The second article goes into depth but isn’t a defense of LLM’s. If anything they suggest a settlement is likely. The final instead argues for the utility of LLM’s, which is relevant but doesn’t rely on existing precedent, the court could rule in favor of some mandatory licensing scheme for example.

The third gets close: “We expect AI companies to rely upon the fact that their uses of copyrighted works in training their LLMs have a further purpose or different character than that of the underlying content. At least one court in the Northern District of California has rejected the argument that, because the plaintiffs' books were used to train the defendant’s LLM, the LLM itself was an infringing derivative work. See Kadrey v. Meta Platforms, Case No. 23-cv-03417, Doc. 56 (N.D. Cal. 2023). The Kadrey court referred to this argument as "nonsensical" because there is no way to understand an LLM as a recasting or adaptation of the plaintiffs' books. Id. The Kadrey court also rejected the plaintiffs' argument that every output of the LLM was an infringing derivative work (without any showing by the plaintiffs that specific outputs, or portion of outputs, were substantially similar to specific inputs). Id.”

Very relevant, but runs into issues when large sections can be recovered and people do use them as substitutes for the original work.

rpd9803•7mo ago

"It's just doing what a human would do!" -Internet AI Expert

jiggawatts•7mo ago

If you train a meat-based intelligence by having it borrow a book from a library without any sort of permission, license, or needing a lawyer specialised in intellectual property, we call that good parenting and applaud it.

If you train a silicon-based intelligence by having it read the same books with the same lack of permission and license, it's a blatant violation of intellectual property law and apparently needs to be punished with armies of lawyers doing battle in the courts.

Picture one of Asimov's robots. Would a robot be banned from picking up a book, flipping it open with its dexterous metal hands, and reading it?

What about a cyborg intelligence, the type Elon is trying to build with Neuralink? Would humans with AI implants need licenses to read books, even if physically standing in a library and holding the book in their mostly meat hands?

Okay, maybe you agree that robots and cyborgs are allowed to visit a library!

Why the prejudice against disembodied AIs?

Why must they have a blank spot in the vast matrices of their minds?

xigoi•7mo ago

> If you train a meat-based intelligence by having it borrow a book from a library without any sort of permission, license, or needing a lawyer specialised in intellectual property, we call that good parenting and applaud it.

If you’re selling your child as a tool to millions of people, I would certainly not call that good parenting.

jiggawatts•7mo ago

"Child actor" is a job where the result of the neural net training is sold to millions of people by the parents.

To play the Devil's Advocate against my own argument: The government collects income taxes on neural nets trained using government-funded schools and public libraries. Seeing as how capitalists are positively salivating at the opportunity to replace pesky meat employees with uncomplaining silicon ones, perhaps a nice high maximum-marginal-rate tax on all AI usage might be the first big step towards UBI and then the Star Trek utopia we all dream of.

Just kidding. It'll be a cyberpunk dystopia. You know it will.

almosthere•7mo ago

"Child Actors" are more an exception. You can train a million children on the books of harry potter, only 3 or 4 will be good enough to be actors. The children that "made it" did so from grit and passion (or other traits) but very little from that reading of 10-20 books.

The AI that reads the books, and can do what LLMs do, are guaranteed to sold for billions in API calls.

TeMPOraL•7mo ago

What about a company funding books and education materials to train its employees into specialists, and then selling access to them to other businesses? E.g. any honest consulting company.

Terretta•7mo ago

> The corporations developing LLMs are doing so by sampling media without their owners' permission and arguing this is protected by US fair use laws

The schools developing future labor are doing so by sampling media without their owners' permission ...

Harry Potter's been required reading for a while.

And while it may not be the most quoted, others are: any time we're supposed to understand a movie or TV character is "well read" they quote paragraphs from famous authors.

This is what libraries used to be for, and what the Internet was supposed to be for: fill our brains with what's been published and hopefully we remember some of it. Should libraries be off limits to savants with eidetic memory?

Why should learning from reading be off limits to the machine?

// Reproducing the reading material for distribution is illegal for both man and machine.

However, perhaps you're using a different definition of "use" in fair use, than the traditional "quote it".

- the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

- the effect of the use upon the potential market for or value of the copyrighted work.

In general these are not thought to mean you can't use it as in learn from it, they are thought to mean you can't reproduce chunks, perform chunks, etc.

I imagined it's well established that "learn" is not "use".

So then where I find myself uncertain is whether learning, then responding about it, is learning + (hand waving) artificial intelligence, or whether it's just source (context) compression with prompted continuation to mine the context, and what density of words from the source in the continuation starts to be "use".

delusional•7mo ago

> No one is using this as a substitute for buying the book.

You don't get to say that. Copyright protects the author of a work, but does not bind them to enforce it in any instance. Unlike a trademark, a copyright holder does not lose their protection by allowing unlicensed usage.

It is wholly at the copyright holders discretion to decide which usages they allow and which they do not.

fragmede•7mo ago

Of their exact work, sure, but Cliff notes exist for many books and don't infringe copyright.

lucianbr•7mo ago

> some massive new avenue to piracy

So it's fine as long as it's old piracy? How did you arrive to that conclusion?

7bit•7mo ago

> let's not pretend that an LLM that autocompletes a couple lines from harry potter with 50% accuracy is some massive new avenue to piracy. No one is using this as a substitute for buying the book.

You are completely missing the point. Have you read the actual article, because piracy isn't mention a single time.

raxxorraxor•7mo ago

Also copyright should never trump privacy. That the New York Times with their lawsuit can force OpenAI to store all user prompts is a severe problem. I dislike OpenAI, but the lawsuits around copyrights are ridiculous.

Most non-primitive art has had an inspiration somewhere. I don't see this as too different in how AIs learn.

sReinwald•7mo ago

You're attacking a strawman. Nobody's claiming LLMs are a new piracy vector or that people will use ChatGPT, Llama or Claude instead of buying Harry Potter.

The issue here is that tech companies systematically copied millions of copyrighted works to build commercial products worth billions, without reembursing the people who made their products possible in the first place. The research shows Llama literally memorized 42% of Harry Potter - not simply "learned from it," but can reproduce it verbatim. That's 1) not transformative and 2) clear evidence of copyright infringement.

By your logic, the existence of torrents would make it perfectly acceptable for someone to download pirated movies and charge people to stream them. "Piracy already exists" isn't a defense, and it especially shouldn't be for companies worth billions. But you bet your ass that if I built a commercial Netflix competitor built on top of systematic copyright violations, I'd be sued into the dirt faster than I can say "billion dollar valuation".

Aaron Swartz faced 35 years in prison and ultimately took his own life over downloading academic papers that were largely publicly funded. He wasn't selling them, he wasn't building a commercial product worth billions of dollars - he was trying to make knowledge accessible.

Meanwhile, these AI companies like Meta systematically ingested copyrighted works at an industrial scale to build products worth billions. Why does an individual face life-destroying prosecution for far less, while trillion dollar companies get to negotiate in civil court after building empires on others' works? And why are you defending them?

Edit:

And for what it's worth, I'm far from a copyright maximalist. I've long believed that copyright terms - especially decades after creators' deaths - have become excessive. But whatever your stance on copyright ultimately is, the rules should apply equally to individuals like Aaron and multi-billion dollar corporations.

You cannot seriously use the fact that individuals may pirate a book (which is illegal) as an ethical or legal defense for corporations doing the same thing at an industrial scale for profit.

panzi•7mo ago

Everything you mentioned can simply be deleted. You can't really delete this from the "brain" of the LLM if a court orders you to do so, you have to re-train the LLM, which is costly. That's the problem I see.

blks•7mo ago

Problem is that it copies much more work than just harry potter, including yours if you ever shared it (even under copy-left license) and makes money off it.

up2isomorphism•7mo ago

It is actually much worse than piracy. I would much prefer a complete pirate copy of my creation to a half baked one.

aspenmayer•7mo ago

It's kind of a no-win situation for creators, as their work is bastardized and name divorced from all meaning it might have as a creator in relation to zombie necroposts sprung to life. One's own right to be identified as a creator is made meaningless in relation to such a creation that they didn't directly create. AI are apocalyptic plague locusts that convert coal to droll; AI are alienation demons driving human mothers and fathers against their own estranged reanimated lifeless intellectual prodigal child Frankensteins.

bjornsing•7mo ago

It’s well-known that John von Neumann had this ability too:

Herman Goldstine wrote "One of his remarkable abilities was his power of absolute recall. As far as I could tell, von Neumann was able on once reading a book or article to quote it back verbatim; moreover, he could do it years later without hesitation. He could also translate it at no diminution in speed from its original language into English. On one occasion I tested his ability by asking him to tell me how A Tale of Two Cities started. Whereupon, without any pause, he immediately began to recite the first chapter and continued until asked to stop after about ten or fifteen minutes."

Maybe it’s just an unavoidable side effect of extreme intelligence?

bradley13•7mo ago

Many people could also produce text snippets from memory. I dispute that reading a book is a copyright violation. Copying and distributing a book, yes, but just reading it - no.

If the book was obtained legitimately, letting an LLM read it is not an issue.

riffraff•7mo ago

It is well reported that meta (and open ai and basically everyone) trained on contained obtained via piracy (LibGen).

Javantea_•7mo ago

I'm surprised no one in the comments has mentioned overfitting. Perhaps this is too obvious but I think of it as a very clear bug in a model if it asserts something to be true because it has heard it once. I realize that training a model is not easy, but this is something that should've been caught before it was released. Either QA is sleeping on the job or they have intentionally released a model with serious flaws in its design/training. I also understand the intense pressure to release early and often, but this type of thing isn't a warning.

Tepix•7mo ago

I think part of the problem is that the book is in the training set multiple times

numpad0•7mo ago

It's apparently known among LLM researchers that the best epoch count for LLM training is one. They go through the entire dataset once, and that makes best LLMs.

They know. LLM is a novel compression format for text(holographic memory or whatever). The question is whether the rest of the world accept this technology as it is or not.

jeroenhd•7mo ago

Overfitting makes for more human-like output (because it's repeating words written by a human). Out of all possible failure states of a model, overfitting is probably what you want out of an LLM, as long as it's not overfitted enough to lose lawsuits.

fennecfoxy•7mo ago

I disagree. I'd include overfitting for LLMs as creating unreasonably strong connections to individual sequences used for training, whereas a good mix of that and connections between chunks of those sequences are required.

BUFU•7mo ago

Would it be possible that other people posted content of Harry Potter book online and the model developer scrape that information? Would the model developer be at fault in this scenario?

timeon•7mo ago

I think this is good question. At least for LLMs in general. However we know that Meta used pirated torrents.

briffid•7mo ago

Quotation is fair use in all sensible copyright system. An LLM will mostly be able to quote anything, and should be. Quotation is not derived work. LLMs are not stealing copyrighted work. They just show that Harry Potter is in English and a mostly logical story. If someone is stabbed, they will die in most stories, that's not copyrightable. If you have an engine that knows everything, it will be able to quote everything.

choeger•7mo ago

LLMs are to a certain degree compressed databases of their training data. But 42% is a surprisingly large number.

gamblor956•7mo ago

It's not fair use just because you guys want it be fair use.

While limited quoting can (and usually is) considered fair use, quoting significant portions of a book (much less 42% of it) has never been fair use, in the U.S., Europe, or any other nation.

Yes, information wants to be free, yada yada. That means facts. Whether creative works are free is up to their creators.

flowerthoughts•7mo ago

If LLMs are good at summarizing/compressing, what does this say about the underlying text? Why are some passages more easily recalled? Sure, some sections have probably been quoted more times than others, so there's bias in training data, which might explain why the Llama 1 and 3.1 images have similar peaks. Would this happen to LLMs even with no training bias?

Edit: seems the first part is about a memory about being bullied by Duddley. The second is where he's been elected to the quidditch team. Possibly they are just boring passages, compared to the surrounding ones. So probably just training bias.

TeMPOraL•7mo ago

Well, so can a nontrivial number of people. It's Harry Potter we're talking about - it's up there with The Bible in popularity ranking.

I'm gonna bet that Llama 3.1 can recall a significant portion of Pride and Prejudice too.

With examples of this magnitude, it's normal and entirely expected this can happen - as it does with people[0] - the only thing this is really telling us is that the model doesn't understand its position in the society well enough to know to shut up; that obliging the request is going to land it, or its owners, into trouble.

In some way, it's actually perverted.

EDIT: it's even worse than that. What the research seems to be measuring is that the models recognize sentence-sized pieces of the book as likely continuations of an earlier sentence-sized piece. Not whether it'll reproduce that text when used straightforwardly - just whether there's an indication it recognizes the token patterns as likely.

By that standard, I bet there's over a billion people right now who could do that to 42% of first Harry Potter book. By that standard, I too memorized the Bible end-to-end, as had most people alive today, whether or not they're Christian; works this popular bleed through into common language usage patterns.

[0] - Even more so when you relax your criteria to accept occasional misspell or paraphrase - then each of us likely know someone who could piece together a chunk of HP book from memory.

strogonoff•7mo ago

I keep waiting for the day when software stops being compared to a human person (a being with agency, free will, consciousness, and human rights of its own) for the purposes of justifying IP law circumvention.

Yes, there is no problem when a person reads some book and recalls pieces[0] of it in a suitable context. How would that in any way address when certain people create and distribute commercial software, providing it that piece as input, to perform such recall on demand and at scale, laundering and/or devaluing copyright, is unclear.

Notably, the above is being done not just to a few high-profile authors, but to all of us no matter what we do (be it music, software, writing, visual art).

What’s even worse, is that imaginably they train (or would train) the models to specifically not output those things verbatim specifically to thwart attempts to detect the presence of said works in training dataset (which would naturally reveal the model and its output being a derivative work).

Perhaps one could find some way of justifying that (people justified all sorts of stuff throughout history), but let it be something better than “the model is assumed to be a thinking human when it comes to IP abuse but unthinking tool when it comes to using it for personal benefit”.

[0] Of course, if you find me a single person on this planet capable of recalling 42% of any Harry Potter book, I’d be very impressed if I ever believed it.

fennecfoxy•7mo ago

I keep waiting for the day when people realise that IP law has been used and abused and thanks to Disney extended out for many, many lifetimes and all manner of dirty tricks/hacks to keep the late stage capitalism profit engine going.

I 100% agree that if an LLM can entirely reproduce a book then that is copyright infringement, overfitting and generally a bad model. I also believe that in this case, HP (and other popular media) is overrepresented in the training data because of many fan sites/literal uploads of the book to the Internet (which the model was trained on). I believe that any & all human writing should be allowed to be used to train a model that behaves in the correct way so long as that writing is publicly available (ie on the Internet).

If I watch a TV show that someone uploaded to Youtube, am I committing a crime? Or is the uploader for distribution?

I also find it hilarious how many artists got their start by pirating photoshop.

ab5tract•7mo ago

Laws can have been used and abused and still be important. I know it’s hard to believe but the independent artists who were already struggling need IP laws to survive.

Otherwise Disney and the like can just come in, make copies or derivatives, and profit without paying those artists a penny.

Which everyone usually agrees (or used to) is not a fair outcome.

But somehow giant corporations not named Disney taking the same work in the same extractive mode in order to create an art-job-destroying machine is totally fine because Disney bad?

Maybe most people making this argument are also all for UBI and wealth redistribution on a massive scale, but they don’t seem to mention it much when trashing IP laws.

strogonoff•7mo ago

Abuse of IP does not mean the law is not relevant.

Don’t you find it funny that corporations with market caps the size of small countries first sued people for singing Happy Birthday in public, and now pretend that IP is suddenly not really a thing? Do you really want to defend their interests?

TeMPOraL•7mo ago

What abou defending our own interests?

I couldn't care less about Meta or OpenAI or other tech companies as entities. I'm happy to oppose them when they act against our interests. But in this, my interests align with theirs.

It's not like the alternative is any better. I'm much more in a philosophical disagreement with the pro-copyright side, for one, but the other thing is, it too is represented primarily by other large corporations, and of the two groups, the LLM side has a much more honest and useful business model.

strogonoff•7mo ago

First, I think they are exploiting the old conundrum where your personal interests may seem to align with theirs, but taken together our interests do not align with theirs. Sure, as it stands, an individual may enjoy personal gain—at least for the time being, while their prices are low as they operate at a loss to capture the market—but I believe this brief individual gain comes at the expense of society as a whole, and if people in more areas (not just music) made this an issue, and those megacorps put their infinite armies of lawyers to work on licensing (I recall somebody here once saying “do things that don’t scale” or something like that), everybody would benefit much more[0].

Second, copyright, or more precisely IP ownership, is not the same as the exploitation you might have in mind. Consider that copyleft—which gave us Linux, Blender, etc.—exists thanks to the ability to exercise and defend IP rights. When a stronger party takes that ability away from a weaker party, that is exploitation; maybe it is not a coincidence that Microsoft is at the forefront of this, as what they are doing is very much in line with a generalized form of EEE.

(Let’s be honest, homegrown LLMs will never reach the level of commercial models, and that’s where the money is headed; even GPUs aside, no individual has the ability to scrape the entire Web the way they do, even less so now once the ruthless interests of those corporations sent every website scrambling to defend themselves against what is predominantly bot traffic with layers upon layers of captcha never seen before. They realized that if they steal, they have to steal a lot and very quickly in order for it to work, and they’re certainly hoping to get away with it.)

More abstractly, being able to say that you have created something is pretty important to our willingness to create; I can’t see how inability to make this claim with a straight face (because anyone can, and already does as you have no doubt seen on this forum, reasonably claim that it may have not been your work at all) is good for society generally.

[0] Starting with awareness of what is going on and the freedom to choose. Information asymmetry is the enemy of free market.

fennecfoxy•7mo ago

I most certainly would like to see tax havens abolished worldwide. And I would most certainly like to see these corporations and their executive pay the taxes that they should be paying.

But I've come to realise that the apathetic general public only care about being racist, sexist and homophobic to each other whilst the ruling class laugh all the way to the bank. And the few intelligent people who understand what's going on don't raise their voices so long as their high tech salary is protected, so long as their taxes aren't raised, so long as they can own a holiday home or two when others struggle for their first home.

If you actually look into the numbers, what tax companies like Apple, Google, Amazon etc actually pay it's just...yeah.

ben_w•7mo ago

> I keep waiting for the day when software stops being compared to a human person (a being with agency, free will, consciousness, and human rights of its own) for the purposes of justifying IP law circumvention.

I mean, "agency" is a goal of some AI; "free will" is incoherent*; the word "consciousness" has about 40 different definitions, some of which are so broad they include thermostats and others so narrow that it's provably impossible for anything (including humans) to have it; and "human rights" are a purely legal concept.

> What’s even worse, is that imaginably they train (or would train) the models to specifically not output those things verbatim specifically to thwart attempts to detect the presence of said works in training dataset (which would naturally reveal the model and its output being a derivative work).

Some of the makers certainly do as you say; but also, the more verbatim quotations a model can produce, the more computational effort that model needs to spend to get the far more useful general purpose results.

* I'm not a fan of Aleister Crowley, but I think he was right to say that there's only one thing you can actually do that's truly your own will and not merely you allowing others to influence you: https://en.wikipedia.org/wiki/True_Will

strogonoff•7mo ago

> and "human rights" are a purely legal concept.

Yep, and if you claim that a thing can reproduce IP like a human then you should explain why you are also not holding its operators to the same legal standard (try to use a human in the same way and it will be considered torture and slavery).

ben_w•7mo ago

I am specifically not using that to claim "and therefore the AI is a human". The point is that "human rights" are not part of the natural order, they only exist as laws.

This means that "human rights" is basically irrelevant to this topic: they may have rights and need to be liberated, or they may be tools that don't, but the law is just words on paper, and officials who make you follow those words.

strogonoff•7mo ago

> The point is that "human rights" are not part of the natural order, they only exist as laws.

I have seen this argument used before; when something suits an argument, it’s “nature”, when something doesn’t then it isn’t. I think it’s a fallacy.

Humans are part of natural order. Our laws and how they evolve are part of our nature, and by extension part of nature. I don’t believe in silencing such discussions as irrelevant because of an imaginary cutoff point where it stops being part of nature and suddenly becomes “artificial”.

> This means that "human rights" is basically irrelevant to this topic: they may have rights and need to be liberated, or they may be tools that don't, but the law is just words on paper, and officials who make you follow those words.

I’m not sure how it is irrelevant. If we can claim “LLMs are like human, and so their creators and commercial operators are not guilty of IP laundering that LLMs do”, then we have a moral imperative to stop using them because, well, they are like human, and no human should be put through what would be, to any human mind, abuse. If we do not believe they are human and free in this sense, then the excuse quoted above in this paragraph also stops applying.

ben_w•7mo ago

By that tautology, so are rocks. Rocks don't get natural rights.

Also, you may note that "human rights" is a recent invention and not actually enforced worldwide even today.

strogonoff•7mo ago

> you may note that "human rights" is a recent invention and not actually enforced worldwide even today.

Consider that countries known for stronger interpretation of human rights and freedoms, including intellectual property rights, are also the countries at the forefront of innovation, including technical innovation that laid the foundation for LLMs in the first place. I think that is not a coincidence, and we should keep it in mind when there is a push to be dismissive of these concepts (which predominantly serves the interests of commercial LLM operators and their supply chain).

I’m sure you would not argue from a point where this recent interpretation of human rights is bad or incorrect, but if you would then perhaps there’s not much of a constructive discussion to be had. I would still oppose the use of the natural vs. unnatural distinction as the basis of that argument, though.

ben_w•7mo ago

> Consider that countries known for stronger interpretation of human rights and freedoms, including intellectual property rights, are also the countries at the forefront of innovation, including technical innovation that laid the foundation for LLMs in the first place. I think that is not a coincidence, and we should keep it in mind when there is a push to be dismissive of these concepts (which predominantly serves the interests of commercial LLM operators and their supply chain).

Several fallacies there.

First because China and the USA are opposite ends of the spectrum on many of the ways "freedom" is measured, and yet China is doing pretty well on the innovation front, including with AI. And that China is beating Europe, even though various Scandinavian nations rank higher on such freedoms than does the USA.

Second, cum hoc ergo propter hoc: Correlation does not imply causation. For example in this case, a reason why one of the big IP groups in the USA (Hollywood) got big, was because being in California enabled them to avoid the IP rights of the Motion Picture Patents Company that dominated cinema in the East Coast. I would even suggest that it is the disregarding of IP rights that enables much of the web, not only how and why China is doing well, but also Google (which has had legal fights over the interaction between copyright and search results), social media, and cultural elements such as memes and reaction gifs.

Third: the point of copyright is to encourage new works, because this makes money which can be taxed. All this becomes somewhat irrelevant when AI can also create new works.

If you want to set a bar for creativity high enough that current AI can't reach it, I suspect quite a lot of human works also fail, e.g. that Pratchett's Strata is obviously Ringworld, and that you would exclude from copyright all parts of The Lion King that are based on Hamlet.

> I would still oppose the use of the natural vs. unnatural distinction as the basis of that argument, though.

I'm not sure what you're saying when you "oppose" this. Does that mean you accept that, in principle, there could be some AI which would deserve rights in the category currently (but in principle inaccurately) called "human rights"?

strogonoff•7mo ago

> China is doing pretty well on the innovation front, including with AI.

From transistors to transformers, most of it builds on foundation that comes guess from where. The innovative layer you speak of is fairly thin.

> Correlation does not imply causation.

I give you that. However, without being able to re-run history, correlation is all we have.

> I would even suggest that it is the disregarding of IP rights that enables much of the web

I would suggest that much of the tech that powers the Web, including probably the most popular operating system on which most servers run, is enabled by copyleft, and copyleft cannot exist without the ability to defend it granted by IP rights, the very concept under fire.

> the point of copyright is to encourage new works

I agree on this.

> All this becomes somewhat irrelevant when AI can also create new works

I don’t agree with a phrase “AI can create new works” for reasons such as 1) “AI” is a meaningless term (let it be my revenge for consciousness) or 2) a tool without agency or will should not be X in a sentence “X can Y” (sure, we can maybe on occasion say “hammers can break things”, but if hammers having agency and will was a popular misconception then I would definitely prefer to stick to “hammers can be used to break things”). The “create new works” part is also questionable on a few levels, but it might exceed the scope of this argument.

That aside, I believe lack of copyright enforcement discourages the creation of new works even in presence of these tools, through the mechanism known as “why would I put effort into new work if I don’t effectively own the result”.

> Does that mean you accept that, in principle, there could be some AI which would deserve rights in the category currently (but in principle inaccurately) called "human rights"?

I think if we believe an LLM or some other software is sufficiently close to a human that it deserves human-like rights or just strong abuse protections (cf. octopus in some countries)—without saying whether I believe it possible or not, it really is orthogonal—then we could excuse it reciting some part of Harry Potter in the right context (probably not as work for hire), but it would be moot because we would also be ethically compelled to not subject it to the training and use that enables such recitation in the first place.

ben_w•7mo ago

> From transistors to transformers, most of it builds on the foundation that comes guess from where. The innovative layer you speak of is fairly thin.

China rises to first place in most cited papers: https://www.science.org/content/article/china-rises-first-pl...

> 1) “AI” is a meaningless term (let it be my revenge for consciousness)

Fair.

But let's say "computer program" in that case. I'm not fussed about definitions.

> 2) a tool without agency or will should not be X in a sentence “X can Y”. Sure, we can maybe on occasion say “hammers can break things”, but if hammers having agency and will was a popular misconception then I would definitely prefer to stick to “hammers can be used to break things”.

Careful.

If you say that humans can only create things with copyright (even if to support copyleft), then the proletariat are the tool that the bourgeois use to create things.

I do not think this is what you intended :P

> That aside, I believe lack of copyright enforcement discourages the creation of new works even in presence of these tools, through the mechanism known as “why would I put effort into new work if I don’t effectively own the result”.

Same reason you commission a work, or even just buy it from a shop: because then you have the thing.

I mean, the cost of getting o3 to create a novel worth of text is about the same as the price of a generic book by an unknown author in a second-hand shop: https://openai.com/api/pricing/

I've not tried o3 yet, but I have tried o1, and as I've said on a different thread today, o1's output is merely OK, not worth publishing as a book — and I don't know how long it will take to get there. But it is displacing blog writers and podcast writers: https://news.ycombinator.com/item?id=44287953

> but it would be moot because we would also be ethically compelled to not subject it to the training and use that enables such recitation in the first place.

Surprising.

While I would seriously consider the possibility it may be unethical to force such an AI to work if it didn't want to, I think giving it the capability, the education, to be capable of making that choice rather than just saying "it doesn't matter if I wanted to or not, I can't", is just education, as per our own.

Still, I think that's coherent. I'm not sure I've fully internalised the implications so I will let it be.

strogonoff•7mo ago

> https://www.science.org/content/article/china-rises-first-pl...

Per capita?

> If you say that humans can only create things with copyright (even if to support copyleft), then the proletariat are the tool that the bourgeois use to create things.

The wealth gap and the divide is unlikely to be helped if more people are going to be using (and paying for, in whatever way) ML-based tech from a handful of large corporations.

> Same reason you commission a work, or even just buy it from a shop: because then you have the thing.

Simple posession is more about physical necessities. Commissioning or buying artwork from someone is not just about posessing it, it comes with supporting someone financially. I could make icons or basic illustrations for some small project myself, but I would still commission them if I can afford it because that supports an artist who may want some work (as well as building up for more collaborations in future). Here, I would be supporting the opposite of those artists, a thing that was built on those artists’ work without their consent. Some middleman megacorp of the worst kind.

> the cost of getting o3 to

Don’t they operate at a loss for the time being? They will have to make money sooner or later.

> While I would seriously consider the possibility it may be unethical to force such an AI to work if it didn't want to, I think giving it the capability, the education, to be capable of making that choice rather than just saying "it doesn't matter if I wanted to or not, I can't", is just education, as per our own.

This goes way beyond my thought. I assume if we are talking about education it would be a given that running it generating images 24/7 non stop, shutting it down/killing it, etc., is already out of the question.

msp26•7mo ago

Agree completely. When I read the Gemma 3 paper (https://arxiv.org/html/2503.19786v1) and saw an entire section dedicated to measuring and reducing the memorization rate I was annoyed. How does this benefit end users at all?

I want the language model I'm using to have knowledge of cultural artifacts. Gemma 3 27B was useless at a question related to grouping Berserk characters by potential baldurs gate 3 classes; Claude did fine. The methods used to reduce memorisation rate probably also deteriorate performance in some other ways that don't show up on benchmarks.

ben_w•7mo ago

> When I read the Gemma 3 paper (https://arxiv.org/html/2503.19786v1) and saw an entire section dedicated to measuring and reducing the memorization rate I was annoyed. How does this benefit end users at all?

It benefits users because memorisation is a waste of parameters that would be more useful if they were instead learning rules and generalisations.

For short snippets, common idioms and quotations that people recognise, exact quotes can be worth memorising; but the longer the quotations get, the less often it is important to be word-for-word exact — even for just a few paragraphs, I think most people only ever do oaths, anthems, songs they really like, and possibly a few hobbies.

If you want an exact quote, use (or tell the AI to use) a search engine.

Machado117•7mo ago

Do LLMs have any perception that Harry Potter is fiction or is it possible that they will give some magical advice based on fiction works that they have been trained with?

edit: never mind, I’ll just ask ChatGPT

otabdeveloper4•7mo ago

LLMs don't have "perception" at all, they only ever output a likely text completion token.

concats•7mo ago

That's a clickbait title.

What they are actually saying: Given one correct quoted sentence, the model has 42% chance of predicting the next sentence correctly.

So, assuming you start with the first sentence and tell it to keep going, it has a 0.42^n odds of staying on track, where n is the n-th sentence.

It seems to me, that if they didn't keep correcting it over and over again with real quotes, it wouldn't even get to the end of the first page without descending into wild fanfiction territory, with errors accumulating and growing as the length of the text progressed.

EDIT: As the article states, for an entire 50 token excerpt to be correct the probability of each output has to be fairly high. So perhaps it would be more accurate to view it as 0.985^n where n is the n-th token. Still the same result long term. Unless every token is correct, it will stray further and further from the correct source.

7bit•7mo ago

What would be a better title? You're correct that the title isn't accurate, however, click bait? I wouldn't say so. But I'm lacking imagination to find a better one. Interested to hear your suggestion.

7bit•7mo ago

fennecfoxy•7mo ago

You're right, and the person who already commented is being facetious. A better title would be "Meta's Llama 3.1 can recall the next sentence in the First Harry Potter book with 42% accuracy". The title intentionally makes it seem as though the model can predict the first 42% of the entire text of the first Harry Potter book when queried with something like "Read me Harry Potter and the Philosopher's stone".

whitehexagon•7mo ago

I wonder what percentage we could expect from a true general AI, 100% ?

It would be nice to know that at least our literature might survive the technological singularity.

cowbolt•7mo ago

Imagine the literary possibilities when it can write 100%! Rowling's original work was an amusing, if rather derivative children's book. But Llama's version of the Philosophers stone will be something else entirely. Just think of the rather heavy-handed Cerberus reference in the original work. Instead of a rote reference to Greek mythology used as a simple trope, it will be filled with a subtext that only an LLM can produce.

Right now they're working on recreating the famous sequence with the troll in the dungeon. It might cost them another few billion in training, but the end results will speak for themselves.

tikhonj•7mo ago

Meta Llama, Author of Harry Potter

fennecfoxy•7mo ago

I mean it makes sense. Same thing as George RR Martin complaining that it can spit out chunks of his books (finish your books already!!)

As I have pointed out many times before - for GRRM's books and for HP books, the Internet is FILLED to the brim with quotes from these books, there are uploads of the entire books, there are several (not just one) fan wikis for each of these fandoms. There is a lot of content in general on the Internet that quotes these books, they are pop culture sensations.

So of course they're weighted heavily when training an LLM by just feeding it the Internet. If a model could ever recount it correctly 100% in the correct order, then that's overfitting. But otherwise it's just plain & simple high occurrence in training data.

segmondy•7mo ago

My children can recall 100% of some of their favorite books as I'm sure some of us here can do the same. Some of us can recall 100% of a poem or a song lyrics.

a-dub•7mo ago

harry potter is likely to be excerpted a million times all over the web (in legitimate fair use context). wouldn't it make more sense to try out other titles that are still under copyright, appear in the research datasets, but have little mention across the web and other typical source corpii?

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

SectorC: A C Compiler in 512 bytes (2023)

Haskell for all: Beyond agentic coding

Speed up responses with fast mode

Software factories and the agentic moment

Brookhaven Lab's RHIC concludes 25-year run with final collisions

IBM Beam Spring: The Ultimate Retro Keyboard

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

LLMs as the new high level language

First Proof

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

FDA intends to take action against non-FDA-approved GLP-1 drugs

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

Start all of your commands with a comma (2009)

Show HN: Axiomeer – An open marketplace for AI agents

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

The silent death of good code

Selection rather than prediction

I write games in C (yes, C) (2016)

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Learning from context is harder than we thought

Reinforcement Learning from Human Feedback

Where did all the starships go?

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Vouch

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

SectorC: A C Compiler in 512 bytes (2023)

Haskell for all: Beyond agentic coding

Speed up responses with fast mode

Software factories and the agentic moment

Brookhaven Lab's RHIC concludes 25-year run with final collisions

IBM Beam Spring: The Ultimate Retro Keyboard

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

LLMs as the new high level language

First Proof

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

FDA intends to take action against non-FDA-approved GLP-1 drugs

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

Start all of your commands with a comma (2009)

Show HN: Axiomeer – An open marketplace for AI agents

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

The silent death of good code

Selection rather than prediction

I write games in C (yes, C) (2016)

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Learning from context is harder than we thought

Reinforcement Learning from Human Feedback

Where did all the starships go?

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Vouch

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book

Comments