frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

The Nonprofit Doing the AI Industry's Dirty Work

https://www.theatlantic.com/technology/2025/11/common-crawl-ai-training-data/684567/
9•kgwgk•3mo ago

Comments

zeech•3mo ago
Dupe: https://news.ycombinator.com/item?id=45810135
Aloisius•3mo ago
Calling archiving the web for researchers dirty work is a bit much.

Unless something has changed since I was there, the crawler didn't intentionally bypass any paywalls.

The crawler obeyed robots.txt, throttled itself when visiting slow sites to avoid overloading them and announced its user agent clearly with a URL explaining what it was and how to block it if desired.