Japan's largest paper, Yomiuri Shimbun, sues Perplexity for copyright violations

https://www.niemanlab.org/2025/08/japans-largest-newspaper-yomiuri-shimbun-sues-perplexity-for-copyright-violations/

199•aspenmayer•6mo ago

Comments

aspenmayer•6mo ago

Original title edited for length:

> Japan’s largest newspaper, Yomiuri Shimbun, sues AI startup Perplexity for copyright violations

ujkhsjkdhf234•6mo ago

Before someone mentions Japan effectively making all data fair use for AI training, Japan specifically forbids direct recreation which is what this lawsuit is about.

ants_everywhere•6mo ago

If they are copying and pasting news articles on their site, that's a pretty straightforward copyright case I would think.

In the US at least this should be pretty well covered by the case law on news aggregators.

AraceliHarker•6mo ago

It was the Yomiuri Shimbun, which boasts the world's largest circulation, that established the mass reproduction of not just article bodies, but even headlines, as a violation of copyright.

ants_everywhere•5mo ago

Thanks, I was aware of the distinction between article bodies and headlines for news aggregators, but I was not aware of Yomiuri Shimbun's role

ronsor•6mo ago

I don't know why Perplexity in particular gets everyone in a nit. It's not even particularly special: a user inputs a query, an AI model does a web search and fetches some pages on the user's behalf, and then it serves the result to the user.

Putting aside that other products, such as OpenAI's ChatGPT and modern Google Search have the same "AI-powered web search" functionality, I can't see how this is meaningfully different from a user doing a web search and pasting a bunch of webpages into an LLM chat box.

> But what about ad revenue?

The user could be using an ad blocker. If they're using Perplexity at all, they probably already are. There's no requirement for a user agent to render ads.

> But robots.txt!!!11

`robots.txt` is for recursive, fully automated requests. If a request is made on behalf of a user, through direct user interaction, then it may not be followed and IMO shouldn't be followed. If you really want to block a user agent, it's up to you to figure out how to serve a 403.

> It's breaking copyright by reproducing my content!

Yes, so does the user's browser. The purpose of a user agent is to fetch and display content how the user wants. The manner in which that is done is irrelevant.

jaredwiener•6mo ago

There's a difference between what is technically feasible and what is allowed, legally or even morally.

Just because it is possible -- or even easy -- to essentially steal from newspapers/other media outlets, doesn't make it right, or legal. The people behind it put in labor, financial resources, and time to create a product that, like almost every other service, has terms attached -- and those usually come with some form of monetization. Maybe it is a paywall, maybe it is advertisements -- but it is there.

Using an adblocker, or finding some loophole around a paywall, etc, are all very easy to do technically, as any reader of this site knows. That said, the media outlet doesn't have to allow it. And when it is violated on an industrial scale, like Perplexity, then they can be understandably upset and take legal action. And that includes any AI (or other technology, for that matter) that is a wrapper around plagiarism.

Sites opted in to Google originally because it fed them traffic. They most likely did not opt in to an AI rewriter that takes their work and republishes it without any compensation.

Alex4386•6mo ago

Well, some bots even spoof User-Agents, requesting tons of requests without proper rate-limiting (looking at you, ByteSpider)

No fair plays done by people, even before the LLMs, so we get the PoW challenge on everywhere.

And what is that conclusion? since Adblockers are used by anywhere, it is OK to corporates not to license them directly and just yank them and put it into curation service? especially without ads? that's a licensing issue. the author allowed you to view the article if you provide them monetary support (i.e. ads), they didn't allow you to reproduce and republish the work by default.

also calling browser itself as reproducing? Yes, the data might be copied in memory (but I wouldn't call it as reproducing material, more like transfer from the server to another), but redistribution is the main point here.

It's like saying well, "the part of the variable is replicated to register from the L2 cache, so whole file on DRAM can be authorized to reproduce", Your point of calling "it's reproducing and should not be reproduced in first place" can't be prevented unless you bring non-turing computers that doesn't use active memory.

kazinator•6mo ago

The only reason you can say "looking at you ByteSpider" is that it identifies itself. In 2025, that qualifies it as a nice bot.

The nasty bots make a single access from an IP, and don't use it again (for your server), and are disguised to look like a browser hit out of the blue with few identifying marks.

petesergeant•5mo ago

> I don't know why Perplexity in particular gets everyone in a nit

I suspect they seem easier to sue than OpenAI, Anthropic, Meta, Google, and literally anything coming out of china.

daedrdev•6mo ago

Japan has extremely favorable copyright laws to the holders. My understanding is that without explicit permission, there is no fair use and so any reproduction or modified work is only allowed as long as they don't request a takedown.

beepbooptheory•6mo ago

From tfa:

> Japan’s copyright law allows AI developers to train models on copyrighted material without permission. This leeway is a direct result of a 2018 amendment to Japan’s Copyright Act, meant to encourage AI development in the country’s tech sector. The law does not, however, allow for wholesale reproduction of those works, or for AI developers to distribute copies in a way that will “unreasonably prejudice the interests of the copyright owner.”

Alex4386•6mo ago

tl;dr: If you are not directly affecting the "sales" of the product, you are good to go. But It seems perplexity did, and (as they might call it) directly trying to compete as a news source

Personally, About their news service, Their news summarization is kinda misleading with AI hallucination in some places.

kazinator•6mo ago

Training a model isn't redistribution; only when you give someone a copy of the model can we think about there being a problem. At that point, you are not training, but redistributing a derived work.

stubish•6mo ago

I wonder if you can download the copyrighted material without permission though? The article specifically states 'the scraping has been used by Perplexity to reproduce the newspaper’s copyrighted articles in responses to user queries without authorization'. They don't seem to be complaining about the training (legal), but the scraping.

anticensor•6mo ago

Japanese copyright law still has a few statutory exceptions.

AraceliHarker•6mo ago

The belief that it's acceptable to copy or alter copyrighted material unless the rights holder objects is merely an assertion by those who violate copyright law. Barring a few exceptions such as citation or non-commercial use without internet distribution, you are generally prohibited from using someone else's creative work without their consent.

totetsu•6mo ago

The Japan Newspaper Publishers & Editors Association is very active lobbying about this area https://www.pressnet.or.jp/english/

charcircuit•6mo ago

It's best not to crawl Japanese newspapers. Japan does not have the same kind of fair use. Even reproducing facts from a newspaper can be infringing.

Hamuko•6mo ago

Most of the world doesn't have fair use.

pyrale•6mo ago

I suspect we'll see AI's claim to fair use be challenged even in the US. The claim to be transformative is mostly based on the "shape" of the information being delivered (i.e. the AI rephrases the information).

However, the transformative nature of derivative work is not only about its apparence. It also factors in whether the transformation changes the nature of the message, and whether the derivative work is in direct competition with the original work [1]. I suspect for e.g. news articles, there's a good case that people get information that way instead of going to the newspaper, which means the derivative work competes with the original. Also when it comes to reporting news, there's not many ways to make the message different that doesn't make the AI service bad.

[1]: https://en.wikipedia.org/wiki/Andy_Warhol_Foundation_for_the...

charcircuit•5mo ago

Japan does have fair use.

https://www.cric.or.jp/english/clj/cl2.html#chapter2sect3sub...

Hamuko•5mo ago

That's not fair use.

SilverElfin•6mo ago

I don’t understand why corporations can violate copyright laws at hyper scale but individuals are banned from small scale piracy through authoritarian internet governance.

mlinhares•6mo ago

The law only exists for those without enough money and influence to control the enforcers.

wat10000•6mo ago

They don’t even need control. It’s a version of the old saying that if you owe the bank a million dollars then you have a problem, but if you owe a billion dollars then the bank has a problem. If your company is important enough then it’s not possible (at least not politically) to punish it significantly. See also: 2008 and “too big to fail.”

nradov•6mo ago

Perplexity is still a small startup. If Enron and Theranos could be published then Perplexity can be punished. So far it's unclear whether they've done anything illegal.

wat10000•6mo ago

It’s very difficult to punish Perplexity without also hitting OpenAI, Grok, Google, Facebook, etc.

It’s plenty clear to me that they’ve broken copyright law a lot. They’ve downloaded copyrighted material without permission for their own use, which we’ve been assured is Not Good for us individual people. Some of them even redistributed it by seeding torrents, which is even more Not Good.

nradov•6mo ago

It only seems "plenty clear" to you because you're ignorant about the basics of copyright law in the USA and Japan. Fortunately we have actual courts to decide these issues. The applicable laws (including centuries of case law in the USA) are complex and whether particular actions are legal often depends on nuances that aren't covered in news articles.

wat10000•6mo ago

I’m not talking about Japan. In the US, seeding a torrent containing copyrighted material without authorization from the copyright holder is unambiguously a copyright violation.

freetime2•6mo ago

Presumably you are talking about this case, where Meta is accused of having downloaded a bunch of having torrented a bunch of copyrighted works. [1]

Of relevance here is the fact that 1) Meta denies having seeded the content, and there looks to be no hard evidence that they distributed the content to other users, 2) the case is ongoing, so a decision has not yet been reached about whether they broke any laws, and 3) the fact that Meta is being sued for this shows that even corporations worth trillions of dollars are not immune to the consequences of breaking the law.

[1] https://www.tomshardware.com/tech-industry/artificial-intell...

wat10000•6mo ago

Of course they’re not immune to consequences. It’s just that the consequences are so relatively small that they don’t really care. Reminds me of the quote about how the law treats everyone equally: both rich and poor are forbidden to sleep under bridges, beg, and steal bread.

aspenmayer•6mo ago

> Perplexity is still a small startup.

https://www.crunchbase.com/organization/perplexity-ai

I don’t know how to parse this. I don’t think of them as small. Though they were only founded in 2022 and may not have a huge number of employees, they have had 8 funding rounds. They’re private, so I don’t know what they have raised, but some say that the company could have a $18B valuation.

https://www.bloomberg.com/news/articles/2025-07-17/ai-startu... | https://archive.is/6DZpo

Is that small?

presentation•6mo ago

Maxwell Tabarrok has a take on this, basically in his words:

> The confusion of intellectual property and property rights is fair enough given the name, but intellectual property is not a property right at all. Property rights are required because property is rivalrous and exclusive: When one person is using a pair of shoes or an acre of land, other people’s access is restricted. This central feature is not present for IP: an idea can spread to an infinite number of people and the original author’s access to it remains untouched.

> There is no inherent right to stop an idea from spreading in the same way that there is an inherent right to stop someone from stealing your wallet. But there are good reasons why we want original creators to be rewarded when others use their work: Ideas are positive externalities.

> When someone comes up with a valuable idea or piece of content, the welfare maximizing thing to do is to spread it as fast as possible, since ideas are essentially costless to copy and the benefits are large.

> But coming up with valuable ideas often takes valuable inputs: research time, equipment, production fixed costs etc. So if every new idea is immediately spread without much reward to the creator, people won’t invest these resources upfront, and we’ll get fewer new ideas than we want. A classic positive externalities problem.

> Thus, we have an interest in subsidizing the creation of new ideas and content.

And so you can reframe whether or not IP rights should be assigned in this case, based on whether you believe that the welfare generated by making AI better by providing it with content is more valuable for society than the welfare generated by subsidizing copyright holders.

[1] https://open.substack.com/pub/maximumprogress/p/ai-copyright...

wat10000•6mo ago

You should also look at the welfare generated by showing that all are equal under the law, versus showing that companies can get away with blatant lawbreaking if they can convince people that it’s for the greater good.

The proper way to decide this would be to pass a law in the legislature. But of course our system in general and tech companies in particular don’t work that way.

presentation•5mo ago

The USA tends to invest a great deal of legislative power in the courts, and as you mention the legislature isn’t very responsive nor effective, so this is what we get.

greysphere•6mo ago

There's no inherent right to anything, really. The statements in whatever declaration or philosophy are just arbitrary lines. Physical property rights are just as arbitrary as the divine right if kings (and incredibly closely related when that property is inherited!)

The argument really isn't based on rights, it's based on the rules of the game have been that people that make things get to decide what folks get to do with those things via licensing agreements, except for a very small set of carve outs that everyone knew about when they made the thing. The argument is consent. The counter argument is one/all of ai training falls under one of those carve outs, and/or it's undefined so it should default to whatever anyone wants, and/or we should pass laws that change the rules. Most of these are just as logical as if someone invented resurrection tomorrow, then murder would no longer be a crime.

philipallstar•5mo ago

> the divine right if kings (and incredibly closely related when that property is inherited!)

These seem to be very different indeed. You only need to be able to own and give property to have inheritance.

If your property is owned by a monarch or de facto the state, and you work your lifetime to rent it from them, then you don't get inheritance.

greysphere•5mo ago

The similarity between divine right of kings and inheritance is that an unearned is transferred via circumstances of birth.

Your statements seem to extend that further: If you rent an apartment, you the property is owned by an landlord (lord is literally in the title!) and passed down by their wishes. Similarly if you work for Walmart for life, the company is owned and passed down by the Waltons. In these cases the property rights extend beyond life and are transferred via circumstances of birth, while the rights of labor end.

Interesting that IP rights are ended by death (or death+n years) as well. This line of reasoning suggests maybe that should apply to all property.

Hamuko•6mo ago

>the welfare generated by making AI better by providing it with content is more valuable for society than the welfare generated by subsidizing copyright holders.

Isn't the AI in this case also copyrighted intellectual property that benefits its owners and not the society? As far as I know, Perplexity is a private, for-profit corporation.

I don't see how improving Perplexity's proprietary models is any more beneficial to society than YouTube blocking ad blockers.

presentation•5mo ago

Because there is arguably more societal value in commercial AI being able to do tasks well than there is in users being able to avoid looking at ads on an ad-supported platform.

pjc50•6mo ago

That's the standard rubric, but it doesn't actually answer the question of differential enforcement, which comes down to the usual questions: money and power.

presentation•5mo ago

In this case, the money and power are there precisely because people perceive AI as having the potential to reshape society, resulting in its creators receiving money and power, so it’s a bit of a chicken and egg situation.

impossiblefork•6mo ago

I actually think physical property rights are much more problematic than copyright.

Works are so sparse, and there is such an explosion in how many texts there are that when someone has a right to the exclusive use of one of these huge numbers that are almost unrepresentable, you lose almost nothing.

If someone didn't announce that they had written, let's say, Harry Potter and there was a secret law forbidding you from distributing it, that would be really bad, but it would never matter.

Copyright infringement is a pure theft of service. You took it because it was there, because someone had already spent the effort to make it, and that was the only reason you took it.

Land, physical property, etc. meanwhile, is something that isn't created only by human effort.

For this reason copyright, rather than some fake pseudo-property of lower status than physical property, is actually much more legitimate than physical property.

aspenmayer•6mo ago

How one adjudicates ownership or authorship disputes under copyright is fundamentally different than disputes about land and property ownership. We can go to records and so on in each case, but a resolution would be different in each case, because they are different sorts of potential violations or transgressions.

I don’t think it’s as clear who is at fault if I mention “he who must not be named” in a hypothetical scenario where Harry Potter was never published, and then start telling people about the manuscript I found. If I violated someone’s rights to privacy or property to get or keep the original manuscript, that’s one thing, but merely having it even if the author didn’t want me to have it as a copy especially is another issue. If I never published it but merely described it to others, I’m not sure if I’m any less culpable, but it seems like I should be.

I’m not sure how much more I can explore your thought experiment, but I appreciate you for sharing it with me.

rr808•6mo ago

Its the same reason how Uber could run a ride service without taxi medallions and Air BnB can open home stays in your neighborhood. If there is enough money involved, the VCs in Silicon Valley know who to pay to get what they want.

freetime2•6mo ago

> I don’t understand why corporations can violate copyright laws at hyper scale

Can they, though? Isn't that why Perplexity is being sued?

JimDabell•6mo ago

Learning isn’t copying and copyright only restricts copying. Are you comparing cases where individuals distribute copies to cases where corporations are not distributing copies? The difference seems clear.

pluto_modadic•6mo ago

anthropic has lawyers and buys senators, aron swartz was one dude corporations could make an example of via the courts.

charcircuit•6mo ago

Transformative usages of copyrighted material is very different than people consuming content thr way it was meant to be consumed for free.

thrance•6mo ago

Is it? Bulk downloading of every articles of a journal is OK if I train a neural network on it later, but accessing a single one without paying is not?

mightysashiman•6mo ago

I'm not pirating, I'm AI model training. Got it!

t0lo•6mo ago

Yanis Varoufakis would like to have a word with you

eviks•6mo ago

Do you understand this for other laws?

hulitu•6mo ago

It is because corpprattions can pay lawmakers for this, just how they did in the case of copyright law. Welcome to "democracy".

thrance•6mo ago

Yes, you do understand why. In our societies, capital is king.

yorwba•6mo ago

Both corporations and individuals are banned from piracy, but both corporations and individuals can violate copyright laws at hyper scale until somebody stops them. Corporations are probably more likely to get sued, but also more likely to get a lawyer instead of completely losing their head over a legal threat.

suspended_state•5mo ago

Disclaimer: I am not a lawyer, this is just my interpretation of the situation from the comments above.

I don't have an answer to your question, which seems more general and doesn't correspond to the situation described by the article anyway: here the corporations have the right to use copyrighted materials to train their model, in the same way that you are allowed to learn from the same materials. You might even learn it by heart if you want to, but copyright laws forbid you from reproducing it, and in this instance the Japanese law tries to follow the same principle for AI models.

How should the corporations implement their training to prevent their models to reproduce the material verbatim is their problem, not the copyright holder's, in exactly the same fashion if you learn an article by heart, it's on you to make sure you won't recite it to the public.

_DeadFred_•5mo ago

For profit products are not individual human's putting in effort to learn. Stop making that comparison.

Humans are human. Humans can human when there is no profit motive without it being a copyright violation. Effectively infinitely scaling, for profit products, can't 'human' without it being a copyright violation. The two are much different cases, in no way comparable.

For profit products are PRODUCTS intended to make money for companies. AIs are scalable past an individual human.

Rules/concepts for humans are not relevant at all for for profit products.

prasadjoglekar•5mo ago

Two different issues IMO. Piracy is depriving someone of payment for an item for which payment was expected. Neither you nor Perplexity may pirate a DVD that you didn't buy.

Copyright usually doesn't prevent copying per se, it's the redistribution that is violative. You, as well as Perplexity are free to scrape public sites. You'll both be sued if you distribute it.

mattigames•6mo ago

I wish there was a open fund anyone could donate with the exclusive aim of suing Perplexity, OpenAI and others for copyright violations, where a team of lawyers would help the cases with the most likelihood to win, that would try to highlight that the way such systems are "learning" have little similitude to the intent of the law when it was written to give layaway for other artists/authors to create similar creations.

miohtama•6mo ago

I wish there would be an open fund that allows me to do opposite and the fund would countersue copyright holders for holding development back and demanding excessive mafia payments

bluefirebrand•6mo ago

People getting paid for the work they do is offensive to you?

wand3r•6mo ago

I personally find this argument really lazy. In a very reductionist reframing, independent artists who uploaded some art to the internet for fun believe that AI shouldn't be allowed to exist without them being paid, essential alleging their contribution to AI is fundamental to it's existence. I would be a lot more receptive to the fact that all humans generally contributed to the information this system consumed and we enact some democratic law that 15% of all profits flow into some public tax fund, rather than litigate every single instance of potential copywrite on the per person or organizational level.

There are obviously laws that differ in every region but at a philosophical level I believe in the ideal of fair use. An AI is a distinctly different "work" than these originals and much like a human's own output is informed by all the information they have taken in over their lifetime, so is the output of a model.

sensanaty•6mo ago

If these AIs can't exist without also gobbling up those artist's work, then yes? You can't have it both ways, either their artwork is worthless for the purposes of training an AI (in which case there should be no problem not hoovering up their art, right?) or it's worth something and they should be compensated for it.

wand3r•5mo ago

You are entitled to your opinion. Personally, I would only be able to accept your worldview if these artists grew up on something like an island without books or internet and pursued their craft 100% intuitively without any external influence. Then they could make a claim their work was 100% original. Otherwise, I find all human output to be derivative and build off the body of work of the entire race. This is one of mankind's greatest advantages IMO.

edit: When many make this argument, what they are really saying is "big fucks small". This may not be what you are saying, but seems to be the general philosophy of many who make this argument. I am sympathetic to that which is why I believe we should have something like a 15% tax or 2% of revenue of AI paid into a general tax fund. I find it impossible to litigate how much a news article should be "worth" when 400 of the same news article were written the same day with the value immeadiately diminishing after the "news" was new.

lmm•6mo ago

CamperBob2•6mo ago

Amazing how many copyright maximalists there are on a site called "Hacker News."

Seems to be a fairly recent trend. Wonder what changed.

wat10000•6mo ago

What changed is that copyright violation used to be something individuals did quietly, and got punished for. Now it’s something big companies are doing openly and they’re getting tons of money for it and zero consequences.

CamperBob2•5mo ago

"Copyright violation?" That remains to be seen, doesn't it? Which court do you sit on, and how many trillions of dollars in future value do you feel comfortable tossing away?

The copyright industry has done all it can for us, even in the most charitable interpretation. They literally, by constitutional mandate, can't be allowed to stand in the way of progress. We're not talking Napster 2.0 here.

wat10000•5mo ago

You’re going to give me shit for calling out a clear copyright violation because I’m not a judge, and yet you feel comfortable saying that it’s unconstitutional(?!) to stand in their way? What court do you sit on?

CamperBob2•5mo ago

A literal, plain-language reading of the Constitution is sufficient. Article I, Section 8, Clause 8: [The Congress shall have Power . . . ] To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.

Copyright doesn't promote the progress of science. Rather the opposite, as it allows journals that contribute nothing to progress to charge the rest of us to access research our taxes paid for.

As for "arts," useful and otherwise, those are secured these days via unbreakable permanent DRM, which overtly violates the constitutional basis of copyright law as a time-limited bargain with the public domain. You should be at least as outraged about that as you are about AI, but evidently you're not.

Meanwhile, you'd have to have rocks in your head to argue that AI doesn't constitute scientific progress at a bare minimum.

wat10000•5mo ago

Actual judges on actual courts seem to think DRM is fine. So I’m confused. Do you reject laymen interpreting the law and only accept the evaluation of a judge, as indicated by your first comment? Or do you reject what judges say and go with your own “plain reading”? Seems like you’re confused about who’s qualified to say what constitutes lawbreaking.

CamperBob2•5mo ago

You do understand who the Constitution was written for, right? It wasn't written primarily for interpretation by judges. Judicial review came along later. It was written for you and me, and for the legislators we elect.

I don't view any decision or legislation that grants unbreakable DRM the force of law as legitimate. A work should benefit from temporary legal protection or permanent technical protection, but not both. My position is that if the founders had meant something other than a "Limited Time," they would have said so. If you disagree, great, but that means we're done here.

Matters such as whether AI training is fair use are better subjects for judicial review, IMO, because there's no plain language to go by. Of course I reserve the right to disagree with that decision, and to subsequently ignore it, in keeping with the spirit of the times. :)

And a billion people in China will respect a copyright-maximalist decision even less than I will.

mattigames•5mo ago

Nothing changed on my case (and many others), is that perhaps you never grasped the big picture of our view, in that copyright law should be soft against consumers that violate it (for non-profit reasons) and hard against corporations that do.

CamperBob2•5mo ago

Let's see if training a model is actually considered a copyright violation. I don't know that, and neither do you.

If it is adjudicated to be a violation, well, that's the end of copyright, for better or worse. AI is more important. Don't fight to lock down information; fight for equitable access instead.

jongjong•6mo ago

IMO the legal system is in disarray due to extreme asymmetries in how the law is selectively applied.

First of all, the way certain platforms get sued for certain activities while others are left alone is unfair and creates significant market distortions.

Then there is the fact that wealthy individuals have much better legal representation than non-wealthy individuals.

Then there are tax loopholes which create market asymmetries above that.

The word 'fair' doesn't even make sense anymore. We've got to start asking; fair for who?

freetime2•6mo ago

So it sounds like they definitely scraped the content and used it for training, which is legal:

The article is almost completely lacking in details though about how the information was reproduced/distributed to the public. It could be a very cut-and-dry case where the model would serve up the entire article verbatim. Or it could be a much more nuanced case where the model will summarize portions of an article in its own words. I would need to read up on Japanese copyright law, as well as see specific examples of infringement, to be able to make any sort of conclusion.

It seems like a lot of people are very quick to jump to conclusions in the absence of any details, though, which I find frustating.

stubish•6mo ago

> So it sounds like they definitely scraped the content and used it for training, which is legal

It certainly seems legal to train. But the case is about scraping without permission. Does downloading an article from a website, probably violating some small print user agreement in the process, count as distribution or reproduction? I guess the court will decide.

incompatible•6mo ago

According to the article, they are complaining that the downloaded content had "been used by Perplexity to reproduce the newspaper’s copyrighted articles in responses to user queries." Derived works.

staticautomatic•5mo ago

“Reproduce” in this context reads like “copy/republish”, which would not be a derivative work.

incompatible•5mo ago

Yes, if it's an exact copy, but I don't know if their system is actually presenting entire articles, or just fragments (copyrightable, perhaps) and perhaps mixing them with other text.

mvdtnz•5mo ago

Reproducing articles is not "deriving" anything. It's reproducing.

alexey-salmin•6mo ago

Generally the court practice so far was that if you don't register or login, you never accept the user agreement. If the website is still willing to serve content to non-registred users, you're free to archive it. How you can use it afterwards is a separate question.

bgwalter•5mo ago

LLMs are able to reproduce the entire IP. Sometimes it requires more than one prompt. I've seen examples in the wild where a single prompt was sufficient:

https://jskfellows.stanford.edu/theft-is-not-fair-use-474e11...

Therefore, their output is a derivative work and violates copyright. The 2018 amendment is driven by big capital and should be reverted. Machines can plagiarize at huge scale and should have have no human rights.

freetime2•5mo ago

I'm aware of the fact that LLMs can reproduce IP used in training data, and consider the example NYT article in your link to be "a very cut-and-dry case" of copyright infringment. And commercial AI companies especially should be held liable for damages if they can't or won't implement effective guardrails to prevent this from happening.

I'm somewhat optimistic this problem can be solved, though, with filters and usage policies. YouTube, another platform with basically unlimited potential for copyright infringement, has managed to implement a system that is good enough at preventing infringement to keep lawsuits at bay.

It's also not clear if that's what Yomiuri Shimbun is alleging here. In their 2023 "Opinion on the Use of News Content by Generative AI" [1] they give this example:

> Newspaper companies have long provided databases containing past newspaper pages and articles for a fee, and in recent years, they have also sold article data for AI development. If AI imports large quantities of articles, photos, images, and other data from news organizations’ digital news sites without permission, commercial AI services for third parties developing it could conflict with the existing database sales market and “unreasonably prejudice the interests of the copyright owner” (Article 30-4 of the Act). Also, even if all or part of a particular article communicates nothing further than facts and hardly constitutes a copyright, many contents deserve legal protection because of the effort and cost invested by the newspaper companies. Even if an AI collects and uses only the factual part, it does not mean it will always be legal.

So basically arguing that 2018 amendment which allows the use of copyrighted works to train AI models without permission from the copyright holder is not applicable because the use would "would unreasonably prejudice the interests of the copyright owner in light of the nature or purpose of the work or the circumstances of its exploitation". [2]

... which I think is a much more nuanced argument. I don't think we can just lump all of these cases together and say "it's infringement" or "it's fair use" without actually considering the details in each case. Or the specific laws in each country.

[1] https://www.pressnet.or.jp/statement/20230517_en.pdf

[2] https://www.cric.or.jp/english/clj/cl2.html

Shaddox•6mo ago

The fundamental problem is that everyone is expected to pitch in to help train these AIs, but only a handful of people benefit from it.

lvl155•5mo ago

This is what I call the Zuckerberg business model.

ekianjo•6mo ago

> If quality news content, which underpins democracy, decreases, the public’s right to know may be hampered.

quality news content has not been a thing for a long time now, so the public will not notice any change

tjpnz•5mo ago

Off-topic: Yomiuri Shimbun operates its own theme park and it's an absolute delight, especially during winter months when there's a spectacular light show during the evenings. I prefer it to Tokyo Disneyland because there's plenty there to occupy young children but with reasonable waiting times.

Give it a try on your next visit to Tokyo. I recommend arriving on the cablecar - almost feels like you're descending into Jurassic Park by helicopter (wife gets quite annoyed when I predictably start humming John Williams).

https://www.yomiuriland.com/en/

DoNotNotify is now Open Source

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Haskell for all: Beyond agentic coding

SectorC: A C Compiler in 512 bytes (2023)

Matchlock: Linux-based sandboxing for AI agents

LLMs as the new high level language

The Architecture of Open Source Applications (Volume 1) Berkeley DB

Moroccan sardine prices to stabilise via new measures: officials

Software factories and the agentic moment

Speed up responses with fast mode

LineageOS 23.2

Modern and Antique Technologies Reveal a Dynamic Cosmos

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Wood Gas Vehicles: Firewood in the Fuel Tank (2010)

Vocal Guide – belt sing without killing yourself

uLauncher

First Proof

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Substack confirms data breach affects users’ email addresses and phone numbers

Start all of your commands with a comma (2009)

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

LLMs as Language Compilers: Lessons from Fortran for the Future of Coding

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Where did all the starships go?

Show HN: A luma dependent chroma compression algorithm (image compression)

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

The Scriptovision Super Micro Script video titler is almost a home computer

DoNotNotify is now Open Source

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Haskell for all: Beyond agentic coding

SectorC: A C Compiler in 512 bytes (2023)

Matchlock: Linux-based sandboxing for AI agents

LLMs as the new high level language

The Architecture of Open Source Applications (Volume 1) Berkeley DB

Moroccan sardine prices to stabilise via new measures: officials

Software factories and the agentic moment

Speed up responses with fast mode

LineageOS 23.2

Modern and Antique Technologies Reveal a Dynamic Cosmos

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Wood Gas Vehicles: Firewood in the Fuel Tank (2010)

Vocal Guide – belt sing without killing yourself

uLauncher

First Proof

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Substack confirms data breach affects users’ email addresses and phone numbers

Start all of your commands with a comma (2009)

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

LLMs as Language Compilers: Lessons from Fortran for the Future of Coding

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Where did all the starships go?

Show HN: A luma dependent chroma compression algorithm (image compression)

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

The Scriptovision Super Micro Script video titler is almost a home computer

Japan's largest paper, Yomiuri Shimbun, sues Perplexity for copyright violations

Comments