Extracting books from production language models (2026)

75•logicprog•3w ago

Comments

cleandreams•4w ago

If you can extract a whole copyrighted book from a large language model, what is the point in copywriting your book?

ggm•4w ago

Legal recourse. If your publisher has the budget, you will find an economy where you can secure equitable relief. Or not, but then you recurse on the question: if laws don't apply, what's the point of society?

visarga•3w ago

This sounds pretty damning, why don't they implement a n-gram based bloom filter to ensure they don't replicate expression too close to the protected IP they trained on? Almost any random 10 word ngram is unique on the internet.

Alternatively they could train on synthetic data like summaries and QA pairs extracted from protected sources, so the model gets the ideas separated from their original expression. Since it never saw the originals it can't regurgitate them.

isodev•3w ago

But that would only hide the problem, doesn’t resolve the fact that models, in fact, violate copyright

jgalt212•3w ago

True, but it would certainly reduce litigation risk in so much as copypasta is ipso facto proof of copyright violation.

protocolture•3w ago

That hasnt been established. Theres no concrete basis to assert that training violates copyright.

isodev•3w ago

Everyone knows that what models do to obtain training data is not legal. We just need a very old system about copyright to catch up already so we can ban the practice.

protocolture•3w ago

"What they do to obtain" ah nice goal post shift.

When legally obtained, training is fine. Training doesnt violate copyright. Unauthorised copying and distribution does. Which is why OpenAI should have just paid for physical copies of all those books and scanned them.

>We just need a very old system about copyright to catch up already so we can ban the practice.

No we really don't need copyright to get worse. Its pretty damn harmful as it is.

soulofmischief•3w ago

The idea of applying clean-room design to model training is interesting... having a "dirty model" and a "clean model", dirty model touches restricted content and clean model works only with the output of the dirty model.

However, besides how this sidesteps the fact that current copyright law violates the constitutional rights of US citizens, I imagine there is a very real threat of the clean model losing the fidelity of insight that the dirty model develops by having access to the base training data.

bryanrasmussen•3w ago

>this sidesteps the fact that current copyright law violates the constitutional rights of US citizens

I think most people sidestep this as it's the first I've heard of it! Which right do you think is being violated and how?

soulofmischief•3w ago

Actually, plenty of activists, for example Cory Doctorow, have spent a significant amount of effort discussing why the DMCA, modern copyright law, DRM, etc. are all anti-consumer and how they encroach on our rights.

It's late so I don't feel like repeating it all here, but I definitely recommend searching for Doctorow's thoughts on the DMCA, DRM and copyright law in general as a good starting point.

But generally, the idea that people are not allowed to freely manipulate and share data that belongs to them is patently absurd and has been a large topic of discussion for decades.

You've probably at least been exposed to how copyright law benefits corporations such as Disney, and private equity, much more than it benefits you or I. And how copyright law has been extended over and over by entities like Disney just so they could prolong their beloved golden geese from entering public domain as long as possible; far, far longer than intended by the original spirit of the copyright act.

JimDabell•3w ago

> To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.

Copyright is not “you own this forever because you deserve it”, copyright is “we’ll give you a temporary monopoly on copying to give you an incentive to create”. It’s transactional in nature. You create for society, society rewards you by giving you commercial leverage for a while.

Repeatedly extending copyright durations from the original 14+14 years to durations that outlast everybody alive today might technically be “limited times” but obviously violates the spirit of the law and undermines its goal. The goal was to incentivise people to create, and being able to have one hit that you can live off for the rest of your life is the opposite of that. Copyright durations need to be shorter than a typical career so that its incentive for creators to create for a living remains and the purpose of copyright is fulfilled.

In the context of large language models, if anybody successfully uses copyright to stop large language models from learning from books, that seems like a clear subversion of the law – it’s stopping “the progress of science and useful arts” not promoting it.

(To be clear, I’m not referring to memorisation and regurgitation like the examples in this paper, but rather the more commonplace “we trained on a zillion books and now it knows how language works and facts about the world”.)

visarga•3w ago

Duration of copyright is one way it was perverted, but the other direction was scope. In 1930 judge Hand said in relation to Nichols v. Universal Pictures:

> Upon any work...a great number of patterns of increasing generality will fit equally well. At the one end is the most concrete possible expression...at the other, a title...Nobody has ever been able to fix that boundary, and nobody ever can...As respects play, plagiarism may be found in the 'sequence of events'...this trivial points of expression come to be included.

And since then a litany of judges and tests expanded the notion of infringement towards vibes and away from expression:

- Hand's Abstractions / The "Patterns" Test (Nichols v. Universal Pictures)

- Total Concept and Feel (Roth Greeting Cards v. United Card Co.)

- The Krofft Test / Extrinsic and Intrinsic Analysis

- Sequence, Structure, and Organization (Whelan Associates v. Jaslow Dental Laboratory)

- Abstraction-Filtration-Comparison (AFC) Test (Computer Associates v. Altai)

The trend has been to make infringement more and more abstract over time, but this makes testing it an impossible burden. How do you ensure you are not infringing any protected abstraction on any level in any prior work? Due diligence has become too difficult now.

apical_dendrite•3w ago

I'm assuming that the goal of the bloom filter is to prevent the model from producing output that infringes copyright rather than hide that the text is in the training data.

In that case the model would lose the ability to provide relatively brief quotes from copyrighted sources in its answers, which is a really helpful feature when doing research. A brief quote from a copyrighted text, particularly for a transformative purpose like commentary is perfectly fine under copyright law.

orbital-decay•3w ago

That would reduce the training quality immensely. Besides, any generalist model really needs to remember facts and texts verbatim to stay useful, not just generalize. There's no easy way around that.

stubish•3w ago

Even if output is blocked, if it can be demonstrated that the copyrighted material is still in the model then you become liable for distribution and/or duplication without a license.

Training on synthetic data is interesting, but how do you generate the synthetic data? Is it turtles all the way down?

empiko•3w ago

IMO they just don't have any idea what data are actually copyrighted and are too lazy to invest in the problem.

orbital-decay•3w ago

It's all pretty obvious to anyone who tried a similar experiment just out of curiosity. Big models remember a lot. And all non-local models have regurgitation filters in place due to this fact, with the entire dataset indexed (e.g. Gemini will even cite the source of the regurgitated text as it gives the RECITATION error). You'll eventually trip those filters if you force the model to repeat some copyrighted text. Interesting that they don't even try to circumvent those, they simply repeat the request from the interruption point, as the match needs some runway to trigger and by that time a part of the response is already streamed in.

rurban•3w ago

I find it interesting that OpenAI's safety worked best, where the others didn't work at all. I had different impressions before

clbrmbr•3w ago

I found that Opus 4 was happy to regurgitate a random paragraph from the latter half of Wealth of Nations that nobody quotes. It was probably only in the training data once.

I was thinking we could use this technique to figure out which books were in / out of the training data for various models. Limitation is having to wrestle with refusals.

carshodev•3w ago

Why would they filter non copyright material? Who cares if it repeats things that are already public/freely usable and available.

carshodev•3w ago

Everyone just needs to be honest and accept that YES these models were OBVIOUSLY trained on millions of copyright material.

The models would be at least 50% better if these filters weren't in place. These filters force the model essentially lie, thus they will obviously degrade output quality.

The problem is the general public isn't 100% certain of the copyright violations/ don't understand this yet and lawyers/government will try and sue if the companies admitted it. So a Moloch is created where it's a lose lose and the model quality suffers as a result.

(if people want exact copies of text content they can already get them for free through the same sites that these companies got them, so I don't see the models regurgitation as a issue worth worsening quality over.)

Start all of your commands with a comma (2009)

Software Engineering Is Back

Hoot: Scheme on WebAssembly

Reinforcement Learning from Human Feedback

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

Vocal Guide – belt sing without killing yourself

Omarchy First Impressions

Making geo joins faster with H3 indexes

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Ga68, a GNU Algol 68 Compiler

What Is Ruliology?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Cross-Region MSK Replication: K2K vs. MirrorMaker2

Show HN: I spent 4 years building a UI design tool with only the features I use

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Show HN: If you lose your memory, how to regain access to your computer?

Microsoft open-sources LiteBox, a security-focused library OS

An Update on Heroku

The AI boom is causing shortages everywhere else

PC Floppy Copy Protection: Vault Prolok

Was Benoit Mandelbrot a hedgehog or a fox?

Dark Alley Mathematics

How to effectively write quality code with AI

Female Asian Elephant Calf Born at the Smithsonian National Zoo

I now assume that all ads on Apple news are scams

Understanding Neural Network, Visually

Delimited Continuations vs. Lwt for Threads