frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

The AI-Scraping Free-for-All Is Coming to an End

https://nymag.com/intelligencer/article/ai-scraping-free-for-all-by-openai-google-meta-ending.html
29•geox•2h ago

Comments

WaltPurvis•2h ago
http://archive.today/SqPCL
jmkni•32m ago
It is a bit ironic that a paywalled article like this will have a top level comment with the archive link, which can then be easily scraped by AI (along with the comments)
tenuousemphasis•22m ago
It's not ironic at all. The only reason the anti-paywall sites work is that the news companies in fact want some scrapers reading the full article.
1gn15•1h ago
Biased TL;DR: Reddit (notable for having a high stock value from their "selling data" business [1]), Medium, Quora, and Cloudflare competitor Fastly created a standard to restrict what the reader can do with the data users created, called Really Simple Licensing (RSL). Basically robots.txt but with more details, notably with details on how much you should pay Reddit/Medium/Quora.

While this likely has no legal weight (except for EU TDM for commercial use, where the law does take into account opt-outs), they are betting on using services like CloudFlare and Fastly to enforce this.

[1] https://www.investors.com/research/the-new-america/reddit-st...

isodev•1h ago
In other words, a lightweight form of DRM. Here come the reasons why we shouldn’t all deploy CloudFlare and similar as gatekeepers to the web.

Is there even one example of a “tech mega corp” that has grown to control more than 1/5 of its market without this circling back to hurt people in some way? A single example?

PhantomHour•1h ago
> While this likely has no legal weight

I wouldn't be quite so sure about that. The AI industry has entirely relied on 'move fast and break things' and 'old fart judges who don't understand the tech' as their legal strategy.

The idea that AI training is fair use isn't so obvious, and quite frankly is entirely ridiculous in a world where AI companies pay for the data. If it's not fair use to take reddit's data, it's not fair use to take mine either.

On a technological level the difference to prior ML is straightforward: A classical classifier system is simply incapable of emitting any copyrighted work it was trained on. The very architecture of the system guarantees it to produce new information derived from the training data rather than the training data itself.

LLMs and similar generative AI do not have that safeguard. To be practically useful they have to be capable of emitting facts from training data, but have no architectural mechanism to separate facts from expressions. For them to be capable of emitting facts they must also be capable of emitting expressions, and thus, copyright violation.

Add in how GenAI tends to directly compete with the market of the works used as training data in ways that prior "fair use" systems did not and things become sketchy quickly.

Every major AI company knows this, as they have rushed to implement copyright filtering systems once people started pointing out instances of copyrighted expressions being reproduced by AI systems. (There are technical reasons why this isn't a very good solution to curtail copyright infringement by AI)

Observe how all the major copyright victories amount to judges dismissing cases on grounds of "Well you don't have an example specific to your work" rather than addressing whether such uses are acceptable as a collective whole.

janalsncm•24m ago
> The very architecture of the system guarantees it to produce new information derived from the training data rather than the training data itself

A “classical” classifier can regurgitate its training data as well. It’s just that Reddit never seemed to care about people training e.g. sentiment classifiers on their data before.

In fact a “decoder” is simply autoregressive token classification.

orangecat•7m ago
'old fart judges who don't understand the tech'

If this intended to refer to Judge Alsup, it is extremely wrong.

luckylion•1h ago
Does that have any implications on liability for content? They're no longer just a provider, they are re-licensing and marketing content. Are they losing protection?
deadbabe•1h ago
Just ladder kicking at this point.
jsnell•1h ago
The headline seems pretty aspirational.

The licensing standard they're talking about will achieve nothing.

Anti-bot companies selling scraping protections will run out of runway: there's a limited set of signals, and none of them are robust. As the signals get used, they're also getting burned. And it's politically impossible to expand the web platform to have robust counter-abuse capabilities.

Putting the content behind a login wall can work for large sites, but not small ones.

The free-for-all will not end until adversarial scraping becomes illegal.

carlosjobim•49m ago
> Putting the content behind a login wall can work for large sites, but not small ones.

Syndication is the answer. Small artists are on Spotify, small video makers are on YouTube.

salawat•31m ago
Yes. Conglomeration and centralization. More, more, more!

See the problem?

atm3ga•39m ago
As AI companies like Perplexity introduce AI enabled browsers like Comet, they will scrape web sites through the interaction of end-users with whatever site they are using. Therefore, indeed anti-bot companies are absolutely running out of runway.
thelittleone•31m ago
Wow hadn't even considered this... so say I have a members only section of my site where I share high value content, one of the members browses using Comet, and that scrapes the private content and sends to perplexity?
lupire•12m ago
This also happens with covert botnets running secretly on user machines.
aaaggg•21m ago
L - wish they'd stop posting articles that are paywalled...
janalsncm•13m ago
> There was for years an experimental period, when ethical and legal considerations about where and how to acquire training data for hungry experimental models were treated as afterthoughts.

Those things were afterthoughts because for the most part the experimental methods sucked compared to the real thing. If we were in mid 2016 and your LSTM was barely stringing together coherent sentences, it was a curiosity but not a serious competitor to StackOverflow.

I say this not because I don’t think law/ethics are important in the abstract, but because they only became relevant after significant technological improvement.

Young activists who toppled Nepal's government now picking new leaders

https://www.reuters.com/world/asia-pacific/young-activists-who-toppled-nepals-government-now-pick...
1•JumpCrisscross•2m ago•0 comments

Show HN: Yobio – A super fast, clean Link-in-Bio tool

https://www.yobio.link/
1•FabianJani•2m ago•0 comments

Show HN: Fiverr for AI Agents (Sokosumi)

https://www.sokosumi.com
2•Padierfind•2m ago•0 comments

Who da user 112253 on Roblox?

1•ce-caleb•5m ago•1 comments

New catalyst could make mixed plastic recycling a reality

https://phys.org/news/2025-09-catalyst-plastic-recycling-reality.html
1•PaulHoule•6m ago•0 comments

Charlie Kirk shooting suspect not cooperating authorities Utah governor says

https://www.reuters.com/world/us/charlie-kirk-shooting-suspect-not-cooperating-with-authorities-u...
1•Bender•6m ago•0 comments

Apple Unveils iPhone Memory Protections to Combat Sophisticated Attacks

https://www.securityweek.com/apple-unveils-iphone-memory-protections-to-combat-sophisticated-atta...
1•Bender•8m ago•0 comments

Remote CarPlay Hack Puts Drivers at Risk of Distraction and Surveillance

https://www.securityweek.com/remote-carplay-hack-puts-drivers-at-risk-of-distraction-and-surveill...
1•Bender•9m ago•0 comments

XAI lays off workers tasked with training Grok

https://www.businessinsider.com/elon-musk-xai-layoffs-data-annotators-2025-9
1•jdale27•9m ago•1 comments

What Happened to Mexico City's Food Scene? Americans

https://www.nytimes.com/2025/09/08/dining/mexico-city-food-restaurants.html
1•mooreds•11m ago•1 comments

How I made cycling fun again

https://proofinprogress.com/posts/2025-09-14/how-i-made-cycling-fun-again.html
1•timdaub•12m ago•1 comments

Fun with Google Scholar

https://diffuse.one/p/d2-003
1•ashvardanian•15m ago•0 comments

Common Postgres Row-Level-Security Footguns

https://www.bytebase.com/blog/postgres-row-level-security-footguns/
1•shangxiao•17m ago•0 comments

Lucia Joyce

https://en.wikipedia.org/wiki/Lucia_Joyce
1•petethomas•18m ago•0 comments

After 10 years of black hole science, Stephen Hawking is proven right

https://www.npr.org/2025/09/11/nx-s1-5537131/ligo-10-years-black-holes-hawking-theory-confirmed
1•jonbaer•19m ago•0 comments

The future is open: Answering the most common tech writing worries

https://passo.uno/tech-writing-optimism-reddit/
1•theletterf•20m ago•0 comments

3 Fatal overdoses in L.A. County linked to synthetic supplement

https://www.latimes.com/california/story/2025-09-13/synthetic-kratom-linked-to-3-fatal-overdoses-...
1•petethomas•22m ago•0 comments

Oregon mass layoffs approach Great Recession levels

https://www.oregonlive.com/business/2025/09/oregon-mass-layoffs-approach-great-recession-levels.html
2•rwc9•23m ago•0 comments

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

https://arxiv.org/abs/2509.09677
2•jonbaer•23m ago•0 comments

Rule 30 (1d) cellular automata feeding into Conway's Game of Life (2d)

https://usize.github.io/1d2d/
2•fugeesnfunions•24m ago•0 comments

I built a fitness app via spec coding with Kiro

https://devblac.github.io/post/building-a-gym-tracker-with-spec-coding/
2•blacher•29m ago•0 comments

External Secrets Operator resumes releases

https://github.com/external-secrets/external-secrets/issues/5084
1•nanibot•29m ago•0 comments

The Scientific Virtues

https://slimemoldtimemold.com/2022/02/10/the-scientific-virtues/
1•eamag•30m ago•0 comments

World emissions hit record high, but the EU leads trend reversal

https://joint-research-centre.ec.europa.eu/jrc-news-and-updates/world-emissions-hit-record-high-e...
3•saubeidl•30m ago•0 comments

Rock discovery contains 'clearest sign' yet of ancient life on Mars, NASA says

https://www.cnn.com/2025/09/10/science/nasa-mars-sapphire-falls-rock-sample
2•tzury•31m ago•0 comments

Can an Amazon AI voice guide you better than customer product reviews?

https://www.cnbc.com/2025/09/14/amazon-product-reviews-ai-customers-online-shopping.html
1•rntn•32m ago•0 comments

Elastic Projections

https://kunimune.blog/2023/12/29/introducing-the-elastic-projections/
1•andsoitis•35m ago•0 comments

2027: Race to AGI Game

https://thoughtwax.com/2027-race-to-agi/
1•3willows•40m ago•1 comments

Camneerg: The Mac Plus Web Server

https://www.spacerogue.net/Camneerg/
1•xk3•44m ago•0 comments

Russian 'YouTube' hides western movies on its front page

https://torrentfreak.com/pirates-hide-uploads-with-morse-code-rutube-hides-movies-on-its-front-pa...
1•gloxkiqcza•44m ago•1 comments