The AI-Scraping Free-for-All Is Coming to an End

https://nymag.com/intelligencer/article/ai-scraping-free-for-all-by-openai-google-meta-ending.html

70•geox•4mo ago

Comments

WaltPurvis•4mo ago

jmkni•4mo ago

It is a bit ironic that a paywalled article like this will have a top level comment with the archive link, which can then be easily scraped by AI (along with the comments)

tenuousemphasis•4mo ago

It's not ironic at all. The only reason the anti-paywall sites work is that the news companies in fact want some scrapers reading the full article.

mschuster91•4mo ago

Actually, the team behind archive dot today in at least spiegel.de has premium accounts, I presume bought with anonymous credit cards.

You can see artifacts when their servers are at queue load and you see the URLs, a few resources have the JWT with the account details in the URL. IIRC the clearname of the account in the token is Masha Rabinovich, with an email account masha@dns.li, an identity that has cropped up in various investigations [1][2].

[1] https://gyrovague.com/2023/08/05/archive-today-on-the-trail-...

[2] https://webapps.stackexchange.com/questions/145817/who-owns-...

ec109685•4mo ago

Also interesting how sites like this are mainstream whereas a link to a site hosting an mp3 of pirated music wouldn’t be tolerated in discussion forums like this.

I think a big difference is that there’s no micro transactions or compulsory licensing for content, so it always feels patently unfair to buy a subscription to read one article.

yencabulator•4mo ago

I'd argue it's more that RIAA has historically been much more aggressive at suing than newspapers or magazines.

ec109685•4mo ago

True. I think it has ended up a net good. People make a living on music, and licensed music is everywhere.

orbisvicis•4mo ago

Kinda hard to discuss the news when your members can't read the news.

JacobKfromIRC•4mo ago

In this case, it also seems like the paywall doesn't show up if you have JavaScript disabled, which I find strange, but lots of news sites are like that I think.

euroderf•4mo ago

Related: Has anyone trained an LLM strictly on HN comments and linked-to articles ? I for one would get a kick out of interrogating it.

1gn15•4mo ago

Biased TL;DR: Reddit (notable for having a high stock value from their "selling data" business [1]), Medium, Quora, and Cloudflare competitor Fastly created a standard to restrict what the reader can do with the data users created, called Really Simple Licensing (RSL). Basically robots.txt but with more details, notably with details on how much you should pay Reddit/Medium/Quora.

While this likely has no legal weight (except for EU TDM for commercial use, where the law does take into account opt-outs), they are betting on using services like CloudFlare and Fastly to enforce this.

[1] https://www.investors.com/research/the-new-america/reddit-st...

isodev•4mo ago

In other words, a lightweight form of DRM. Here come the reasons why we shouldn’t all deploy CloudFlare and similar as gatekeepers to the web.

Is there even one example of a “tech mega corp” that has grown to control more than 1/5 of its market without this circling back to hurt people in some way? A single example?

PhantomHour•4mo ago

> While this likely has no legal weight

I wouldn't be quite so sure about that. The AI industry has entirely relied on 'move fast and break things' and 'old fart judges who don't understand the tech' as their legal strategy.

The idea that AI training is fair use isn't so obvious, and quite frankly is entirely ridiculous in a world where AI companies pay for the data. If it's not fair use to take reddit's data, it's not fair use to take mine either.

On a technological level the difference to prior ML is straightforward: A classical classifier system is simply incapable of emitting any copyrighted work it was trained on. The very architecture of the system guarantees it to produce new information derived from the training data rather than the training data itself.

LLMs and similar generative AI do not have that safeguard. To be practically useful they have to be capable of emitting facts from training data, but have no architectural mechanism to separate facts from expressions. For them to be capable of emitting facts they must also be capable of emitting expressions, and thus, copyright violation.

Add in how GenAI tends to directly compete with the market of the works used as training data in ways that prior "fair use" systems did not and things become sketchy quickly.

Every major AI company knows this, as they have rushed to implement copyright filtering systems once people started pointing out instances of copyrighted expressions being reproduced by AI systems. (There are technical reasons why this isn't a very good solution to curtail copyright infringement by AI)

Observe how all the major copyright victories amount to judges dismissing cases on grounds of "Well you don't have an example specific to your work" rather than addressing whether such uses are acceptable as a collective whole.

janalsncm•4mo ago

> The very architecture of the system guarantees it to produce new information derived from the training data rather than the training data itself

A “classical” classifier can regurgitate its training data as well. It’s just that Reddit never seemed to care about people training e.g. sentiment classifiers on their data before.

In fact a “decoder” is simply autoregressive token classification.

orangecat•4mo ago

'old fart judges who don't understand the tech'

If this intended to refer to Judge Alsup, it is extremely wrong.

PhantomHour•4mo ago

It is not.

visarga•4mo ago

> but have no architectural mechanism to separate facts from expressions

Sure they do. Every time a bot searches, reads your site and formulates an answer it does not replicate your expression. First of all, it compares across 20.. 100 sources. Second, it only reports what is related to the user query. And third - it uses its own expression. It's more like asking a friend who read those articles and getting an answer.

LLMs ability to separate facts from expression is quite well developed, maybe their strongest skill. They can translate, paraphrase, summarize, or reword forever.

PhantomHour•4mo ago

This is a baseless assertion of emergent behaviour.

> Every time a bot searches

We are talking about LLMs by themselves, not larger systems using them.

> LLMs ability to separate facts from expression is quite well developed

It is not. Whether you ask an LLM for an excerpt of the bible, or an excerpt of The Lord of the Rings, the LLM does not distinguish. It has no concept of what is, and what is not, under copyright.

squigz•4mo ago

> LLMs ability to separate facts from expression is quite well developed, maybe their strongest skill.

There should presumably be data showing the reliability of LLMs' knowledge to be quite high, then?

ndriscoll•4mo ago

I don't see how that follows. It can learn a false "fact" while not retaining the way that statement was expressed. It can also just make up facts entirely, which by definition then did not come from any training data.

HarHarVeryFunny•4mo ago

> The idea that AI training is fair use isn't so obvious

> Observe how all the major copyright victories amount to judges dismissing cases on grounds of "Well you don't have an example specific to your work" rather than addressing whether such uses are acceptable as a collective whole.

Well, all a judge can/should do is to apply current law to the case before them. In the case of generative AI then it seems that it's mostly going to be copyright and "right of publicity" (reproducing someone else's likeness/voice) that apply.

Copyright infringment is all about having published something based on someone else's work - AFAIK it doesn't have anything to say about someone/something having the potential to infringe (e.g. training an AI) if they haven't actually done it. It has to be about the generated artifact.

Of course copyright law wasn't designed with generative AI in mind, and maybe now that it is here we need new laws to protect creative content. For example, should OpenAI be able to copy Studio Ghibli's "trademark" style without requiring permission?

PhantomHour•4mo ago

> Well, all a judge can/should do is to apply current law to the case before them

This is true, and I do not mean to suggest it is bad. But rather, that it leaves uncertainty. These cases can all be struck down without reducing the possibility that if one does stick, the entire industry is at stake.

> Copyright infringment is all about having published something based on someone else's work - AFAIK it doesn't have anything to say about someone/something having the potential to infringe (e.g. training an AI) if they haven't actually done it. It has to be about the generated artifact.

A notable problem here is that AI models are not "standalone products" but tools provided as a service. This complicates the situation.

Take Disney/Universal's case against Midjourney, which is both about the models but also the provision of services.

Even if only the latter gets deemed illegal, that's ruinous for the big AI companies. What good is OpenAI if they can't provide ChatGPT? Who would license a LLM if the act of using it creates constant legal risks?

luckylion•4mo ago

Does that have any implications on liability for content? They're no longer just a provider, they are re-licensing and marketing content. Are they losing protection?

ec109685•4mo ago

It’s surprising Reddit doesn’t get pushback for reselling their user’s content.

The right thing would be for the end users to receive the compensation Reddit is getting from AI companies.

lotsofpulp•4mo ago

It is not clear what makes that the right thing. For example, I have probably saved a decent amount of time and money searching for solutions on Reddit, so would it have been “right” for me to compensate Reddit?

ec109685•4mo ago

You did via ads, and some of that value should go to the commenters.

deadbabe•4mo ago

Just ladder kicking at this point.

jsnell•4mo ago

The headline seems pretty aspirational.

The licensing standard they're talking about will achieve nothing.

Anti-bot companies selling scraping protections will run out of runway: there's a limited set of signals, and none of them are robust. As the signals get used, they're also getting burned. And it's politically impossible to expand the web platform to have robust counter-abuse capabilities.

Putting the content behind a login wall can work for large sites, but not small ones.

The free-for-all will not end until adversarial scraping becomes illegal.

carlosjobim•4mo ago

> Putting the content behind a login wall can work for large sites, but not small ones.

Syndication is the answer. Small artists are on Spotify, small video makers are on YouTube.

salawat•4mo ago

Yes. Conglomeration and centralization. More, more, more!

See the problem?

carlosjobim•4mo ago

You don't have to syndicate a million small creators to have a product worthwhile for consumers, it could be a thousand, a hundred, ten thousand creators in a syndicate. You can have a huge number of syndicates, which benefits creators and consumers.

orbisvicis•4mo ago

But in such an environment syndicates will have an incentive to centralize.

carlosjobim•4mo ago

I don't see why. In general, there are competing syndicates and businesses of every size in most sectors of the economy.

em-bee•4mo ago

which sectors would that be? not the tech sector, not the oil sector, not the car sector. i see companies buying up properties in real state, i hear about companies buying up retirement homes (or some other kind of care facilities). retail? online retail? fast food? processed food, everywhere i see massive dominating brands. music labels? movies are consolidating in major studios. although they recently got some new players with netfix, apple and amazon. but those are still dominating companies.

carlosjobim•4mo ago

It is clear that no matter which examples I would give you, you would not acknowledge that there are any sectors with competition. Anybody can look at any sector of their interest and see that there is competition, and it is trivially easy to do so. Including in the sectors you gave as examples. If you don't believe there is competition within fast food, then please list all the fast food companies in your country below.

em-bee•4mo ago

i am not denying that there is competition. the problem is that you reject that there is an incentive to centralize. if that was true, then none of the consolidations we have seen would have happened.

carlosjobim•4mo ago

Yes, there are incentives to centralize. But since customers are such an incredibly diverse group, it will be very difficult to make any huge centralization unless one company delivers an incredibly good product for a very good price, which also satisfies creators. And if that happens, then great.

em-bee•4mo ago

it will be very difficult to make any huge centralization unless one company delivers an incredibly good product for a very good price, which also satisfies creators.

not true. all they need to do is to buy up their competitors if they have any and remove them from the market, so that you end up with no choice. or take microsoft. they never had any competitors for a long time, and they defend their marketshare with all tricks they can think of.

here are just a few articles about this issue. they focus on tech companies, but the same is happening in every industry:

https://insights.som.yale.edu/insights/wave-of-acquisitions-...

https://www.cbsnews.com/colorado/news/rep-ken-buck-big-tech-...

https://www.bbc.com/news/business-54443188

https://www.library.hbs.edu/working-knowledge/how-big-compan...

https://www.vox.com/the-goods/22550608/how-big-business-expl...

https://reason.com/2021/07/07/how-big-business-uses-big-gove...

https://www.linkedin.com/pulse/big-lie-fair-share-how-compan...

carlosjobim•4mo ago

It was as I said. You would never acknowledge that competition exists or has at any time existed within any sector. So to keep arguing against you is like arguing against somebody who claims that everybody in town wears a hat.

You're only doing yourself a disservice by refusing to acknowledge reality, when it's right in front of your face.

em-bee•4mo ago

well, we apparently see two different realities.

i do acknowledge that competition exists, but i also argue that this is being overshadowed by big companies who may compete amongst themselves but use their power to prevent competition by smaller companies.

you seem to say it doesn't matter, people wouldn't buy from big companies if their products weren't good. and i disagree with that. people buy from big companies because they are cheaper, because their marketing is overwhelming, and because they are lured with free products that small companies can't afford to offer. creators are forced to be on youtube because the audience is on youtube. competition exists, but it doesn't matter. same goes for publishing books on amazon. i know one author who stated that he can't afford not to be exclusive on amazon because it would significantly reduce his revenue.

besides a few exceptions, small companies can not compete against big ones. it is not a fair playing field.

and i really don't understand why you keep arguing about competition, and claim that i don't acknowledge that competition exists. i didn't make such a claim.

the thing i am claiming is that competition does not counteract centralization.

atm3ga•4mo ago

As AI companies like Perplexity introduce AI enabled browsers like Comet, they will scrape web sites through the interaction of end-users with whatever site they are using. Therefore, indeed anti-bot companies are absolutely running out of runway.

thelittleone•4mo ago

Wow hadn't even considered this... so say I have a members only section of my site where I share high value content, one of the members browses using Comet, and that scrapes the private content and sends to perplexity?

lupire•4mo ago

This also happens with covert botnets running secretly on user machines.

datadrivenangel•4mo ago

Any user could manually download your data anyways. Access is access.

tempodox•4mo ago

And a browser can do it automated and behind user’s back.

kanemcgrath•4mo ago

Not sure if its still an issue, but companies were buying popular web extensions, then auto updating malware/spyware into them. I haven't heard much about this in a while, but I think chrome still forces auto updates for extensions, so I would expect this to be the biggest vector for scraping walled data now.

ec109685•4mo ago

The way comet browses the web is weird enough that it’s easily detectable.

atm3ga•4mo ago

Does detectability matter? Are we now entering an era of forced browser compliance? That is, if I use Comet exclusively as my browser; is my bank, insurance company, or news site going to force me to stop and use a "normal" browser and what will that look like as every browser also has AI capabilities? Maybe certain resources will only be available via apps? Seems like a very slippery slope and very user hostile.

orbisvicis•4mo ago

I really don't want AI to be able to produce my bank account balance and routing number on demand.

Aerroon•4mo ago

Great, but it won't stop there. You will use Chrome or else.

Well, with one alternative: Edge.

Incipient•4mo ago

Surely that's highly illegal, and no one would actually use a browser that sent your entire browsing DATA not just history, to a third party?

zbentley•4mo ago

I would hope so as well, but doubt it: if the user consents to their communications being MITM’d by the browser, basically, then I’m not sure there’s currently a legal basis for forbidding that behavior. Many sites/applications accessed by the browsing user may have terms that forbid that kind of data sharing though.

cwmoore•4mo ago

Gross. Terminate TOSs. We all need legal agents: perhaps they would (technically) time-travel back to when these kinds of intrusions began and retroactively disaggregate the prolonged and massive data theft from human beings' individual choicemaking efforts.

gdulli•4mo ago

Did you stop getting non-compliant spam when that became illegal?

Gigachad•4mo ago

It’s pretty easy. Most sites will get locked behind accounts, likely with phone number verification. Then they will be able to easily spot automated scraping.

chrsw•4mo ago

Every time I think the web is finally dead, it somehow gets deader.

observationist•4mo ago

Copyright law perpetuating the institutions that are no longer providing value to the commons means copyright law has completely and utterly failed.

We don't need these institutions. We don't need these publishing platforms.

It's ok for them to die. They no longer provide value.

Adversarial scraping is not a thing, and it can't hurt you.

Fair use, however, is a thing, and what we need to be doing is totally overhauling copyright law such that it maximizes protections for individual creative types, and does away with the exploitable corporatized loopholes and bureaucracy.

99% of all sales for nearly all copyrighted products are done within the first 4 years of a work hitting the market. Give ironclad copyright to the creator for 5 years. The creator can assign their rights, explicitly, in writing, to a third party, for any particular work, or any particular fraction of their work, but each and every assignment of rights has to be explicitly documented and notarized.

No more DMCA automated bullshit. The creator can submit a copyright claim. They need to provide evidence. If the evidence of wrongdoing is false, they should be fined. If a third party files a claim, they should be fined, zero exceptions, even if they have assigned rights.

Artists and creators and writers should get the recognition - if someone creates a thing, they attach a name to it, and they can lease rights to corporations or the like.

After 5 years, extend fair use to something liberal and generous, requiring both acknowledgments of source works and royalties, no more than 15%, paid to the creator/s. If multiple post-5 year "fair use" creators are involved, the 15% is split between them. From 5-15 years, you have to give credit and pay a fair use royalty. If you're a trillion dollar company, you're shelling out a lot of royalties. If you're an artist reusing other art, or writing fanfic for profit, or whatever, you're buying other artists a coffee in tribute.

After 15 years, it becomes public domain.

Anything older than 5 years becomes fair game for training AI or otherwise using in software. You set aside 15% for distribution and reimbursement once a year, and notify any creator of your use of their material.

We need something sane, that scales, that doesn't hand power to corrupt cadres of lawyers and middle men who do nothing but leach from creatives and ruin innocent people's lives.

AI is here to stay. Let's set up a system in which they contribute back to the commons in a significant way, that doesn't favor byzantine licensing and gatekeeping schemes designed to keep lawyers fat and happy off the efforts of people actually contributing to the common good. Let's allow the corporate media platforms and publishing outfits to die off. We have much better ways of doing things and better ways of rewarding people for their work. We don't need lawyers sucking up 80% of the profits for "facilitating deals" or whatever it is they tell themselves to sleep at night.

Raze the old system and salt the ground. Simplify everything for the practical and creative people to maximize on the value all around, get people the credit and profit they deserve, and foster a vibrant public good. It doesn't need to be thousands of pages of technicalities and byzantine law and legal tradecraft. That game was built for the lawyers, and we should stop playing it.

freejazz•4mo ago

Yawn.

aaaggg•4mo ago

L - wish they'd stop posting articles that are paywalled...

janalsncm•4mo ago

> There was for years an experimental period, when ethical and legal considerations about where and how to acquire training data for hungry experimental models were treated as afterthoughts.

Those things were afterthoughts because for the most part the experimental methods sucked compared to the real thing. If we were in mid 2016 and your LSTM was barely stringing together coherent sentences, it was a curiosity but not a serious competitor to StackOverflow.

I say this not because I don’t think law/ethics are important in the abstract, but because they only became relevant after significant technological improvement.

Zigurd•4mo ago

Sites containing original content will adopt active measures against LLM scraper bots. Unlike search indexing bots, there's much less upside to allowing scraping for LLM training material. Openly adversarial actions like serving up poisoned text that would induce LLMs to hallucinate is much more defensible.

ath3nd•4mo ago

Next: the AI bubble is coming to an end. Also fingers crossed that the career and employment of Mark Zuckerberg also follow suit soon.

throwawayqqq11•4mo ago

AI companies struggle to convert profits, meta does not.

VC will eventually run out, then comes the burst.

xarope•4mo ago

I can see how the AI companies would work around this though:

user queries "static" training data in LLM; LLM guesses something, then searches internet in real-time for data to support the guesses. This would be classified as "browsing" rather than trawling.

(the searched data then get added back into the corpus, thus sadly sidestepping all the anti-AI trawling mechanisms)

Kind of like the way a normal user would.

The problem is, as others have already mentioned, how would the LLMs know what is a good answer versus a bad, when a "normal" user also has this issue?

ericdotlee•4mo ago

What a lot of these journalists don't realize is ai tools are the internet funnels of the future. People use ChatGPT not Google to source info. The way you get results is begging these tools to search for specific bits of info in order to get visibility.

rkozik1989•4mo ago

I've got a hard time believing most folks are clicking on the sources that LLM chatbots sometimes provide with their answers. Especially because they've been hard sold on the idea that AI chatbots are as smart as the smartest people on the planet. They're likely just going to do what normal folks do when smart people answer questions i.e. assume they're right and move on.

skarz•4mo ago

I do. I treat ChatGPT output the same as Wikipedia content. I find what I am looking for then immediately open the source to confirm. I would never in a million years take ChatGPT output or Wikipedia content and blindly reference it without doing my due diligence.

koolhead17•4mo ago

How has Linkedin succeeded why rest failed?

PolicyPhantom•4mo ago

Free-for-All was a natural assumption in the early internet, but in the age of AI, alignment with contracts and governance becomes essential. Technical capability alone is not enough — without mechanisms like licensing or audits to ensure legitimacy, such practices may prove socially unsustainable.

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

SectorC: A C Compiler in 512 bytes (2023)

Haskell for all: Beyond agentic coding

Speed up responses with fast mode

Software factories and the agentic moment

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

First Proof

IBM Beam Spring: The Ultimate Retro Keyboard

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

FDA intends to take action against non-FDA-approved GLP-1 drugs

LLMs as the new high level language

Al Lowe on model trains, funny deaths and working with Disney

Show HN: Axiomeer – An open marketplace for AI agents

Start all of your commands with a comma (2009)

Show HN: A luma dependent chroma compression algorithm (image compression)

The AI boom is causing shortages everywhere else

GitBlack: Tracing America's Foundation

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

Vouch

I write games in C (yes, C) (2016)

The silent death of good code

The F Word

Selection rather than prediction

Reinforcement Learning from Human Feedback

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Learning from context is harder than we thought

Where did all the starships go?

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

SectorC: A C Compiler in 512 bytes (2023)

Haskell for all: Beyond agentic coding

Speed up responses with fast mode

Software factories and the agentic moment

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

First Proof

IBM Beam Spring: The Ultimate Retro Keyboard

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

FDA intends to take action against non-FDA-approved GLP-1 drugs

LLMs as the new high level language

Al Lowe on model trains, funny deaths and working with Disney

Show HN: Axiomeer – An open marketplace for AI agents

Start all of your commands with a comma (2009)

Show HN: A luma dependent chroma compression algorithm (image compression)

The AI boom is causing shortages everywhere else

GitBlack: Tracing America's Foundation

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

Vouch

I write games in C (yes, C) (2016)

The silent death of good code

The F Word

Selection rather than prediction

Reinforcement Learning from Human Feedback

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Learning from context is harder than we thought

Where did all the starships go?

The AI-Scraping Free-for-All Is Coming to an End

Comments