He unfortunately had no choice to put most of the content behind a login-wall (you can only see parts of the articles/forum posts when logged out) but he is strongly considering just hard pay-walling some content at that point... We're talking about someone who in good faith provided partial data dumps of content freely available for these companies to download, but, caching / etags? none of these AI companies, hiring "the best and the brightest" have ever heard of that, rate limiting? LOL what is that?
This is nuts, these AI companies are ruining the web.
I run a honeypot that generates urls with the source IP so I am pretty confident it is all one bot, in the past 48 hours I have had over 200,000 ips hit the honeypot.
I am pretty sure this is Bytedance, they occasionally hit these tagged honeypot urls with their normal user agent and their usual .sg datacenter.
TLDR it's trivial to send fake info when you're the one who controls the info.
I think the eng teams behind those were just more competent / more frugal on their processing.
And since there wasn't any AWS equivalent, they had to be better citizens since well-known IP range ban for the crawled websites was trivial.
The search engines were also limited in resources, so they were judicious about what they fetched, when, and how often; optimizing their own crawlers saved them money, and in return it also saved the websites too. Even with a hundred crawlers actively indexing your site, they weren't going to index it more than, say, once a day, and 100 requests in a day isn't really that much even back then.
Now, companies are pumping billions of dollars into AI; budgets are infinite, limits are bypassed, and norms are ignored. If the company thinks it can benefit from indexing your site 30 times a minute then it will, but even if it doesn't benefit from it there's no reason for them to stop it from doing so because it doesn't cost them anything. They cannot risk being anything other than up-to-date, because if users are coming to you asking about current events and why space force is moving to Alabama and your AI doesn't know but someone else's does, then you're behind the times.
So in the interests of maximizing short-term profit above all else - which is the only thing AI companies are doing in any way shape or form - they may as well scrape every URL on your site once per second, because it doesn't cost them anything and they don't care if you go bankrupt and shut down.
That's not my department! says Crawler von Braun
Sonnet responded: “Sorry, I have no access.” Then I asked it why and it was flummoxed and confused. I asked why Anthropic did not simply maintain mirrors of Wikipedia in XX different languages and run a cron job every week.
Still no cogent answer. Pathetic. Very much an Anthropic blindspot—to the point of being at least amoral and even immoral.
Do the big AI corporation that have profited greatly from Wikimedia Foundation give anything back? Or are they just large internet blood suckers without ethics?
Dario and Sam et al.: Contribute to the welfare of your own blood donors.
Would be great if they did that and maybe seeded it too.
Even worse when you consider that you can download all of Wikipedia for offline use...
I'm still learning the landscape of LLMs, but do we expect an LLM to be able to answer that? I didn't think they had meta information about their own operation.
Cloudflare's solution to every problem is to allow them to control more of the internet. What happens when they have enough control to do whatever they want? They could charge any price they want.
Giving bots a cryptographic identity would allow good bots to meaningfully have skin in the game and crawl with their reputation at stake. It's not a complete solution, but could be part of one. Though you can likely get the good parts from HTTP request signing alone, Cloudflare's additions to that seem fairly extraneous.
I honestly don't know what is a good solution. The status quo is certainly completely untenable. If we keep going like we are now, there won't be a web left to protect in a few years. It's worth keeping in mind that there's an opportunity cost, and even a bad solution may be preferrable to no solution at all.
... I say operating an independent web crawler.
You could combine that with some sort of IPFS/Bittorrent like system where you allow others to rehost your static content, indexed by the merkle hash of the content. That would allow users to donate bandwidth.
I really don't like the idea that you can get out of this by surveiling user agents more or distinguishing between "good" and "bad" bots which is a massive social problem.
Some ongoing recent discussion:
Cloudflare Radar: AI Insights
https://news.ycombinator.com/item?id=45093090
The age of agents: cryptographically recognizing agent traffic
https://news.ycombinator.com/item?id=45055452
That Perplexity one:
Perplexity is using stealth, undeclared crawlers to evade no-crawl directives
https://news.ycombinator.com/item?id=44785636
AI crawlers, fetchers are blowing up websites; Meta, OpenAI are worst offenders
I don't see this slowing down. If websites don't adapt to the AI deep search reality, the bot will just go somewhere else. People don't want to read these massive long form pages geared at outdated Google SEO techniques.
You are right that it doesn't look like it is slowing down, but the developing result of this will not be people posting a shorter recipe, it will be a further contraction of the public facing, open internet.
Made it when I was a teenager and got stuck running it the rest of my life.
Of course, the bots go super deep into the site and bust your cache.
Maybe they'll crawl less when it starts damaging models.
It's a statically generated React site I deploy on Netlify. About ten days ago I started incurring 30GB of data per day from user agents indicating they're using Prerender. At this pace almost all of that will push me past the 1TB allotted for my plan, so I'm looking at an extra ~$500USD a month for the extra bandwdith boosters.
I'm gonna try the robots.txt options, but I'm doubtful this will be effective in the long run. Many other options aren't available if I want to continue using a SaaS like Netlify.
My initial thoughts are to either move to Cloudflare Pages/Workers where bandwidth is unlimited, or make an edge function that parses the user agent and hope it's effective enough. That'd be about $60 in edge function invocations.
I've got so many better things to do than play whack-a-mole on user agents and, when failing, pay this scraping ransom.
Can I just say fuck all y'all AI harvesters? This is a popular free service that helps get people off of their Microsoft dependency and live their lives on a libre operating system. You wanna leech on that? Fine, download the data dumps I already offer on an ODbL license instead of making me wonder why I fucking bother.
Also, sue me, the cathedral has defeated the bazaar. This was predictable, as the bazaar is a bunch of stonecutters competing with each other to sell the best stone for building the cathedral with. We reinvented the farmer's market, and thought that if all the farmers united, they could take down Walmart. It's never happening.
It's not clear to me what taking down Cloudflare/Walmart means in this context. Nor how banding together wouldn't just incur the very centralization that is presumably so bad it must be taken down.
P.S. Thank you for ProtonDB, it has been so incredibly helpful for getting some older games running.
One of the worst takes I've seen. Yes, that's expensive, but the individuals doing insane amounts of unnecessary scraping are the problem. Let's not act like this isn't the case.
No, it's both.
The crawlers are lazy, apparently have no caching, and there is no immediately obvious way to instruct/force those crawlers to grab pages in a bandwidth-efficient manner. That being said, I would not be surprised if someone here will smugly contradict me with instructions on how to do just that.
In the near term, if I were hosting such a site I'd be looking into slimming down every byte I could manage, using fingerprinting to serve slim pages to the bots and exploring alternative hosting/CDN options.
The images are from steamcdn-a.akamaihd.net, which I assume is already being hosted by a third-party (Steam)
I run a small-but-growing boutique hosting infrastructure for agency clients. The AI bot crawler problem recently got severe enough that I couldn't just ignore it anymore.
I'm stuck between, on one end, crawlers from companies that absolutely have the engineering talent and resources to do things right but still aren't, and on the other end, resource-heavy WordPress installations where the client was told it was a build-it-and-forget-it kind of thing. I can't police their robots.txt files; meanwhile, each page load can take a full 1s round trip (most of that spent in MySQL), there are about 6 different pretty aggressive AI bots, and occasionally they'll get stuck on some site's product variants or categories pages and start hitting it at a 1r/s rate.
There's an invisible caching layer that does a pretty nice job with images and the like, so it's not really a bandwidth problem. The bots aren't even requesting images and other page resources very often; they're just doing tons and tons of page requests, and each of those is tying up a DB somewhere.
Cumulatively, it is close to having a site get Slashdotted every single day.
I finally started filtering out most bot and crawler traffic at nginx, before it gets passed off to a WP container. I spent a fair bit of time sampling traffic from logs, and at a rough guess, I'd say maybe 5% of web traffic is currently coming from actual humans. It's insane.
I've just wrapped up the first round of work for this problem, but that's just buying a little time. Now, I've gotta put together an IP intelligence system, because clearly these companies aren't gonna take "403" for an answer.
The Cathedral won. Full stop. Everyone, more or less, is just a stonecutter, competing to sell the best stone (i.e. content, libraries, source code, tooling) for building the cathedrals with. If the world is a farmer's market, we're shocked that the farmer's market is not defeating Walmart, and never will.
People want Cathedrals; not Bazaars. Being a Bazaar vendor is a race to the bottom. This is not the Cathedral exploiting a "tragedy of the commons," it's intrinsic to decentralization as a whole. The Bazaar feeds the Cathedral, just as the farmers feed Walmart, just as independent websites feed Claude, a food chain and not an aberration.
Let's say there's two competing options in some market. One option is fully commercialized, the other option holds to open-source ideals (whatever those are).
The commercial option attracts investors, because investors like money. The money attracts engineers, because at some point "hacker" came to mean "comfortable lifestyle in a high COL area". The commercial option gets all the resources, it gets a marketing team, and it captures 75% of the market because most people will happily pay a few dollars for something they don't have to understand.
The open source option attracts a few enthusiasts (maybe; or, often, just one), who labor at it in whatever spare time they can scrape together. Because it's free, other commercial entities use and rely on the open source thing, as long it continues to be maintained in something that, if you squint, resembles slave labor. The open source option is always a bit harder to use, with fewer features, but it appeals to the 25% of the market that cares about things like privacy or ownership or self-determination.
So, one conclusion is "people want Cathedrals", but another conclusion could be that all of our society's incentives are aligned towards Cathedrals.
It would be insane, after all, to not pursue wealth just because of some personal ideals.
It's not about capitalism or incentives. Humans have cognitive limits and technology is very low on the list for most. They want someone else to handle complexity so they can focus on their lives. Medieval guilds, religious hierarchies, tribal councils, your distribution's package repository, it's all cathedrals. Humans have always delegated complexity to trusted authorities.
The 25% who 'care about privacy or ownership' mostly just say they care. When actually faced with configuring their own email server or compiling their own kernel, 24% of that 25% immediately choose the cathedral. You know the type, the people who attend FOSDEM carrying MacBooks. The incentives don't create the demand for cathedrals, but respond to it. Even in a post-scarcity commune, someone would emerge to handle the complex stuff while everyone else gratefully lets them.
The bazaar doesn't lose because of capitalism. It loses because most humans, given the choice between understanding something complex or trusting someone else to handle it, will choose trust every time. Not just trust, but CYA (I'm not responsible for something I don't fully understand) every time. Why do you think AI is successful? I'd rather even trust a blathering robot than myself. It turns out, people like being told what to do on things they don't care about.
Isn't this the licensing problem? Berkeley release BSD so that everyone can use it, people do years of work to make it passable, Apple takes it to make macOS and iOS because the license allows them to, and then they have both the community's work and their own work so everyone uses that.
The Linux kernel is GPLv2, not GPLv3, so vendors distribute binary blob drivers/firmware with their hardware and then the hardware becomes unusable as soon as they stop publishing new versions because then to use the hardware you're stuck with an old kernel with known security vulnerabilities, or they lock the boot loader because v2 lacks the anti-Tivoization clause in v3.
If you use a license that lets the cathedral close off the community's work then you lose, but what if you don't do that?
The classic 80/20 rule applies. You can catch about 80% of lazy crawler activity pretty easily with something like this, but the remaining 20% will require a lot more effort. You start encountering edge cases, like crawlers that use AWS for their crawling activity, but also one of your customers somewhere is syncing their WooCommerce orders to their in-house ERP system via a process that also runs on AWS.
I guess its a kind of soft login required for every session?
update: you could bake it into the cookie approval dialog (joke!)
I myself browse with cookies off, sort of, most of the time, and the number of times per day that I have to click a Cloudflare checkbox or help Google classify objects from its datasets is nuts.
You mean the peri-AI web? Or is AI already done and over and no longer exerting an influence?
Can't these responses still be cached by a reverse proxy as long as the user isn't logged in, which the bots presumably aren't?
(If you choose to read this as, "WordPress is awful, don't use WordPress", I won't argue with you.)
It'd probably be easier to come at it from the other side and throw more resources at the DB or clean it up. I can't imagine what's going on that it's spending a full second on DB queries, but I also don't really use WP.
We had never had any issue before and suddenly we get taken down 3 times in as many days. When I investigated it was all claude.
They were just pounding every route regardless of timeouts with no throttle. It was nasty.
They give web scrapers a bad rep.
Even if sites offered their content in a single downloadable file for bots, the bot creators would not trust it is not stale and out of date so they'd still continue to scrape ignoring the easy method.
I help administer a somewhat popular railroading forum. We've had some of these AI crawlers hammering the site to the point that it became unusable to actual human beings. You design your architecture around certain assumptions, and one of those was definitely not "traffic quintuples."
We've ended up blocking lots of them, but it's a neverending game of whack-a-mole.
The obvious issues are: a) who would pay to host that database. b) Sites not participating because they don't want their content accessible by LLMs for training (so scraping will still provide an advantage over using the database). c) The people implementing these scrapers are unscrupulous and just won't bother respecting sites that direct them to an existing dumped version of their content. d) Strong opponents to AI will try poisoning the database with fake submissions...
Or does this proposed database basically already exist between Cloudflare and the Internet Archive, and we already know that the scrapers are some combination of dumb and belligerent and refuse to use anything but the live site?
Cloudflare has some large part of the web cached, IA takes too long to respond and couldn’t handle the load. Google/OpenAI and co could cache these pages but apparently don’t do it aggressively enough or at all
The attitude is visible in everything around AI, why would crawling be different?
Perhaps the AI crawlers can "click on some ads"
There is absolutely no need for vast majority of websites to use databases and SSR, most of the web can be statically rendered and cost peanuts to host, but alas WP is the most popular "framework"
k310•5h ago
No kidding. An increasing number of sites are putting up CAPTCHA's.
Problem? CAPTCHAS are annoying, they're a 50 times a day eye exam, and
> Google's reCAPTCHA is not only useless, it's also basically spyware [0]
> reCAPTCHA v3's checkbox test doesn't stop bots and tracks user data
[0] https://www.techspot.com/news/106717-google-recaptcha-not-on...
ronsor•5h ago
marginalia_nu•4h ago
At least with what I'm doing poorly configured or outright malicious bots consume about 5000x the resources than human visitors, so having no bot mitigation means I've basically given up and decided I should try to make it as a vegetable farmer instead of doing stuff online.
Bot mitigation in practice is a tradeoff between what's enough of an obstacle to keep most of the bots out, while at the same time not annoying the users so much they leave.
I think right now Anubis is one of the less bad options. Some users are annoyed by it (and it is annoying), but it's less annoying than clicking fire hydrants 35 times and as long as you configure right it seems to keep most of the bots out, or at least drives them to behave in a more identifiable manner.
Probably won't last forever, but I don't know what would besides like going full anacap special needs kid and doing crypto microtransactions for each page request. Would unfortunately drive off not only the bots, but the human visitors as well.
timpera•4h ago
danudey•1h ago
superkuh•4h ago
ccgreg•4h ago
> Our observations also highlight the vital role of open data initiatives like Common Crawl. Unlike commercial crawlers, Common Crawl makes its data freely available to the public, helping create a more inclusive ecosystem for AI research and development. With coverage across 63% of the unique websites crawled by AI bots, substantially higher than most commercial alternatives, it plays a pivotal role in democratizing access to large-scale web data. This open-access model empowers a broader community of researchers and developers to train and improve AI models, fostering more diverse and widespread innovation in the field.
...
> What’s notable is that the top four crawlers (Meta, Google, OpenAI and Claude) seem to prefer Commerce websites. Common Crawl’s CCBot, whose open data set is widely used, has a balanced preference for Commerce, Media & Entertainment and High Tech sectors. Its commercial equivalents Timpibot and Diffbot seem to have a high preference for Media & Entertainment, perhaps to complement what’s available through Common Crawl.
And also there's one final number that isn't in the Fastly report but is in the EL Reg article[2]:
> The Common Crawl Project, which slurps websites to include in a free public dataset designed to prevent duplication of effort and traffic multiplication at the heart of the crawler problem, was a surprisingly-low 0.21 percent.
1: https://learn.fastly.com/rs/025-XKO-469/images/Fastly-Threat...
2: https://www.theregister.com/2025/08/21/ai_crawler_traffic/