It has been admitted and Anthropic knew that this trial would totally bankrupt them had they said they were innocent and continued to fight the case.
But of course, there's too much money on the line, which means even though Anthropic settled (admitting guilt and profiting off of pirated books) they (Anthropic) knew there was no way they could win that case, and was not worth taking that risk.
> The pivotal fair-use question is still being debated in other AI copyright cases. Another San Francisco judge hearing a similar ongoing lawsuit against Meta ruled shortly after Alsup's decision that using copyrighted work without permission to train AI would be unlawful in "many circumstances."
The first of many.
It’s important in the fair use assessment to understand that the training itself is fair use, but the pirating of the books is the issue at hand here, and is what Anthropic “whoopsied” into in acquiring the training data.
Buying used copies of books, scanning them, and training on it is fine.
Rainbows End was prescient in many ways.
That's how Google Books, the Internet Archive, and Amazon (their book preview feature) operated before ebooks were common. It's not scalable-in-a-garage but perfectly scalable for a commercial operation.
The books that are destroyed in scanning are a small minority compared to the millions discarded by libraries every year for simply being too old or unpopular.
Book burnings are symbolic (Unless you're in the world of Fareinheit 451). The real power comes from the political threat, not the fact that paper with words on them is now unreadable.
It is 500,000 books in total so did they really scan all those books instead of using the pirated versions? Even when they did not have much money in the early phases of the model race?
Agreed. Great book for those looking for a read: https://www.goodreads.com/book/show/102439.Rainbows_End
The author, Vernor Vinge, is also responsible for popularizing the term 'singularity'.
What I'm wondering is if they, or others, have trained models on pirated content that has flowed through their networks?
I’m surprised Google hasn’t hit its competitors harder with the fact that they actually got permission to scan books from its partner libraries and Facebook and OpenAI just torrented books2/books3, but I guess they have aligned incentive to benefit from a legal framework that doesn’t look to closely at how you went about collecting source material
https://www.reddit.com/r/libgen/comments/1n4vjud/megathread_...
This is to teach a lesson because you cannot prosecute all thieves.
Yale Law Journal actually writes about this, the goal is to deter crime because in most cases damages cannot be recovered or the criminal will never be caught in the first place.
It’s crazy to imagine, but there was surely a document or slack message thread discussing where to get thousands of books, and they just decided to pirate them and that was OK. This was entirely a decision based on ease or cost, not based on the assumption it was legal. Piracy can result in jail time IIRC, so honestly it’s lucky the employee who suggested this, or took the action avoided direct legal liability.
Oh and I’m pretty sure other companies (meta) are in litigation over this issue, and the publishers knew that settlement below the full legal limit would limit future revenue.
Well actively generating revenue at least.
Profits are still hard to come by.
In this specific case the settlement caps the lawyer fees at 25%, and even that is subject to the courts approval. In addition they will ask for $250k total ($50k / plaintiff) for the lead plaintiffs, also subject to the courts approval.
I think that this is a distinction many people miss.
If you take all the works of Shakespeare, and reduce it to tokens and vectors is it Shakespeare or is it factual information about Shakespeare? It is the latter, and as much as organizations like the MLB might want to be able to copyright a fact you simply cannot do that.
Take this one step further. IF you buy the work, and vectorize it, thats fine. But if you feed it in the vectors for Harry Potter so many times that it can reproduce half of the book, it becomes a problem when it spits out that copy.
And what about all the other stuff that LLM's spit out? Who owns that. Well at present, no one. If you train a monkey or an elephant to paint, you cant copyright that work because they aren't human, and neither is an LLM.
If you use an LLM to generate your code at work, can you leave with that code when you quit? Does GPL3 or something like the Elastic Search license even apply if there is no copyright?
I suspect we're going to be talking about court cases a lot for the next few years.
This seems too cute by half, courts are generally far more common sense than that in applying the law.
This is like saying using `rails generate model:example` results in a bunch of code that isn't yours, because the tool generated it according to your specifications.
I’d guess the legal scenario for `rails generate` is that you have a license to the template code (by way of how the tool is licensed) and the template code was written by a human so licensable by them and then minimally modified by the tool.
So things get even more dark because what becomes distribution can have a really vague definition and maybe the AI companies will only follow the law just barely, just for the sake of not getting hit with a lawsuit like this again. But I wonder if all this case did was maybe compensate the authors this one time. I doubt if we can see a meaningful change towards AI companies attitude's towards fair use/ essentially exploiting authors.
I feel like that they would try to use as much legalspeak as possible to extract as much from authors (legally) without compensating them which I feel is unethical but sadly the law doesn't work on ethics.
It remains deranged.
Everyone has more than a right to freely have read everything is stored in a library.
(Edit: in fact initially I wrote 'is supposed to' in place of 'has more than a right to' - meaning that "knowledge is there, we made it available: you are supposed to access it, with the fullest encouragement").
They're merely doing what anyone is allowed to with the books that they own, loaning them out, because copyright law doesn't prohibit that, so no license is needed.
Buying used copies of books, scanning them, and printing them and selling them: not fair use
Buying used copies of books, scanning them, and making merchandise and selling it: not fair use
The idea that training models is considered fair use just because you bought the work is naive. Fair use is not a law to leave open usage as long as it doesn’t fit a given description. It’s a law that specifically allows certain usages like criticism, comment, news reporting, teaching, scholarship, or research. Training AI models for purposes other than purely academic fits into none of these.
Unless legislation changes, model training is pretty much analogous to that. Now of course if the employee in question - or the LLM - regurgitates a copyrighted piece verbatim, that is a violation and would be treated accordingly in either case.
Does this still hold true if multiple employees are "trained" from scanned copies at the same time?
Can only imagine the pitch, yes please give us billions of dollars. We are going to make a huge investment like paying of our lawsuits.
It basically does nothing for them besides that. Given the split decisions so far, I'm not sure what value the Alsup decision is going to bring to the industry, moving forward, when it's in the context of books that Anthropic physically purchased. The other AI cases are generally not fact patterns where the LLM was trained with copyrighted materials that the AI company legally purchased copies of.
> Although the payment is enormous, it is small compared with the amount of money that Anthropic has raised in recent years. This month, the start-up announced that it had agreed to a deal that brings an additional $13 billion into Anthropic’s coffers. The start-up has raised a total of more than $27 billion since its founding in 2021.
It's not the way we expect people to do business under normal circumstances, but in new markets with new products? I guess I don't see much actually wrong with this. Authors still get paid a price they were willing to accept, and Anthropic didn't need to wait years to come to an agreement (again, publishers weren't actually selling what AI companies needed to buy!) before training their LLMs.
Obviously there would be handling costs + scanning costs, so that’s the floor.
Maybe $20 million total? Plus, of course, the time it would take to execute.
Book authors may see some settlement checks down the line. So might newspapers and other parties that can organize and throw enough $$$ at the problem. But I'll eat my hat if your average blogger ever sees a single cent.
More broadly, I think that's a goofy argument. The books were "freely available" too. Just because something is out there, doesn't necessarily mean you can use it however you want, and that's the crux of the debate.
Or what if not even distributing it but rather distributing the outputs of the LLM (so closed source LLM like anthropic)
I am genuinely curious as to if there is some gray area that might be exploited by AI companies as I am pretty sure that they don't want to pay 1.5B dollars yet still want to exploit the works of authors. (let's call a spade a spade)
At least if you're a regular citizen.
1. A Settlement Fund of at least $1.5 Billion: Anthropic has agreed to pay a minimum of $1.5 billion into a non-reversionary fund for the class members. With an estimated 500,000 copyrighted works in the class, this would amount to an approximate gross payment of $3,000 per work. If the final list of works exceeds 500,000, Anthropic will add $3,000 for each additional work.
2. Destruction of Datasets: Anthropic has committed to destroying the datasets it acquired from LibGen and PiLiMi, subject to any legal preservation requirements.
3. Limited Release of Claims: The settlement releases Anthropic only from past claims of infringement related to the works on the official "Works List" up to August 25, 2025. It does not cover any potential future infringements or any claims, past or future, related to infringing outputs generated by Anthropic's AI models.
Edit: I'll get ratio'd for this- but its the exact same thing google did in it's lawsuit with Epic. They delayed while the public and courts focused in apple (oohh, EVIL apple)- apple lost, and google settled at a disadvantage before they had a legal judgment that couldn't be challenged latter.
Indeed, it is not only payout, but the destruction of the datasets. Although the article does quote:
> “Anthropic says it did not even use these pirated works,” he said. “If some other generative A.I. company took data from pirated source and used it to train on and commercialized it, the potential liability is enormous. It will shake the industry — no doubt in my mind.”
Even if true, I wonder how many cases we will see in the near future.
A settlement means the claimants no longer have a claim, which means if they're also part of- say, the New York Times affiliated lawsuit- they have to withdraw. A neat way of kneecapping a country wide decision that LLM training on copy written material is subject to punitive measures don't you think?
In my experience&training in a fintech corp- Accepting a settlement in any suit weakens your defense- but prevents a judgement and future claims for the same claims from the same claimants (a la double jeopardy). So, again- at minimum- this prevents an actual judgement. Which, likely would be positive for the NYT (and adjacent) cases.
And they actually went and did that afterwards. They just pirated them first.
It looks like you'll be able to search this site if the settlement is approved:
> https://www.anthropiccopyrightsettlement.com/
If your work is there, you qualify for a slice of the settlement. If not, you're outta luck.
Is there a way to make your content on the web "licensed" in a way where it is only free for human consumption?
I.e. effectively making the use of AI crawlers pirating, thus subject to the same kind of penalties here?
That curl script you use to automate some task could become infringing.
The purpose of the copyright protections is to promote "sciences and useful arts," and the public utility of allowing academia to investigate all works(1) exceeds the benefits of letting authors declare their works unponderable to the academic community.
(1) And yet, textbooks are copyrighted and the copyright is honored; I'm not sure why the academic fair-use exception doesn't allow scholars to just copy around textbooks without paying their authors.
I'm not sure to what extent you can specify damages like these in a contract, ask the lawyer who is writing it.
https://www.anthropic.com/news/anthropic-raises-series-f-at-...
A judge making on a ruling based on his opinion of how transformative a technology will be doesn't inspire confidence. There's an equivocation on the word "transformative" here -- not just transformative in the fair use sense, but transformative as in world-changing, impactful, revolutionary. The latter shouldn't matter in a case like this.
> Companies and individuals who willfully infringe on copyright can face significantly higher damages — up to $150,000 per work
Settling for 2% is a steal.
> In June, the District Court issued a landmark ruling on A.I. development and copyright law, finding that Anthropic’s approach to training A.I. models constitutes fair use,” Aparna Sridhar, Anthropic’s deputy general counsel, said in a statement.
This is the highest-order bit, not the $1.5B in settlement. Anthropic's guilty of pirating.
I feel it is insane that authors do not receive some sort of standard compensation for each training use. Say a few hundred to a few thousand depending on complexity of their work.
I might find argument of comparing it to human when it is fully legal person and cutting power to it or deleting is treated as murder. Before that it is just bullshit.
And fundamentally reason for copy right to exist is to support creators and to promote them to create more. In world where massively funded companies can freely exploit their work and even in many case fully substitute that principle is failed.
You can follow the case here: https://www.courtlistener.com/docket/69058235/bartz-v-anthro...
You can see the motion for settlement (what the news article is about) here: https://storage.courtlistener.com/recap/gov.uscourts.cand.43...
Imagine going to 500k publishers to buy it individually. 3k per book is way cheaper. The copyright system is turning into a data marketplace in front of our eyes
The main cost of doing this would be the time - even if you bought up all the available scanning capacity it would probably take months. In the meantime your competition who just torrented everything would have more high-quality training data than you. There are probably also a fair number of books in libgen which are out of print and difficult to find used.
It's less than 1% Anthropic's valuation -- a valuation utterly dependent on all the hoovering up of others' copyrighted works.
AFAICT, if this settlement signals that the typical AI foundation model company's massive-scale commercial theft doesn't result in judgments that wipe out a company (and its execs), then we have confirmation that is a free-for-all for all the other AI gold rush companies.
Then making deals to license rights, in sell-it-to-us-or-we'll-just-take-it-anyway deals, becomes only a routine and optional corporate cost reduction exercise, but not anything the execs will lose sleep over if it's inconvenient.
The settlement is real money though. Valuation is imaginary.
Writers were the true “foundational” piece of LLMs, anyway.
If someone breaks into my house and steals my valuables, without my consent, then giving me stock in their burglary business isn't much of a deterrent to them and other burglars.
Deterrence/prevention is my real goal, not the possibly get a token settlement from whatever bastard rips me off.
We need the analogue of laws and police, or the analogue of homeowner has a shotgun.
I understand that intentional copyright infringement is a crime in the US, you just need to convince the DOJ to prosecute Anthropic for it...
TBH I'm just going to plow all that money back into Anthropic... might was well cut out the middleman.
You can search LibGen by author to see if your work is included. I believe this would make you a member of the class: https://www.theatlantic.com/technology/archive/2025/03/searc...
If you are a member of the class (or think you are) you can submit your contact information to the plaintiff's attorneys here: https://www.anthropiccopyrightsettlement.com/
Also passing on the cost to consumers of generated content since companies now would need to pay royalties on the back-end should also likely increase the cost of generating slop and hopefully push back against that trend.
This shouldn't just be books, but all written content, like scholarly journals and essays, news articles and blogs, etc.
I realize this is just wishful thinking, but there's got to be some nugget of aspirational desire to pay it forward.
OpenAI and Google will follow soon now that the precedent has been set, and will likely pay more.
It will be a net win for Anthropic.
on_meds•2h ago
It’s not precedent setting but surely it’ll have an impact.
SlowTao•1h ago
typs•1h ago
mschild•1h ago
https://www.tomshardware.com/tech-industry/artificial-intell...
nerevarthelame•1h ago
>During a deposition, a founder of Anthropic, Ben Mann, testified that he also downloaded the Library Genesis data set when he was working for OpenAI in 2019 and assumed this was “fair use” of the material.
Per the NYT article, Anthropic started buying physical books in bulk and scanning them for their training data, and they assert that no pirated materials were ever used in public models. I wonder if OpenAI can say the same.
lewdwig•1h ago
I would not be surprised if investors made their last round of funding contingent on settling this matter out of court precisely to ensure no precedents are set.