It also presumes that dealing with automated traffic is a solved problem, which with the volumes of LLM scraping going on, is simply not true for more hobbyist setups.
FWIW I have not seen a reputable report on the % of web scraping in the past 3 years.
(Wikipedia being a notable exception...but I would guess Wikipedia to see a far larger increase than anything else.)
A lot of it is coming through compromised residential endpoint botnets.
More importantly, Wikipedia almost certainly represents the ceiling of traffic increase. But luckily, we don't have to work with such coarse estimation, because according to Cloudflare, the total increase from combined search and AI bots in the last year (May 2024 - May 2025), has just been... 18% [2].
The way you hear people talk about it though, you'd think that servers are now receiving DDOS-levels of traffic or something. For the life of me I have not been able to find a single verifiable case of this. Which if you think about it makes sense... It's hard to generate that sort of traffic, that's one of the reasons people pay for botnets. You don't bring a site to its knees merely by accidentally "not making your scraper efficient". So the only other possible explanation would be such a larger number of scrapers simultaneously but independently hitting sites. But this also doesn't check out. There aren't thousands of different AI scrapers out there that in aggregate are resulting in huge traffic spikes [2]. Again, the total combined increase is 18%.
The more you look into this accepted idea that we are in some sort of AI scraping traffic apocalypse, the less anything makes sense. You then look at this Anubis "AI scraping mitigator" and... I dunno. The author contends that one if its tricks is that it not only uses JavaScript, but "modern JavaScript like ES6 modules," and that this is one of the ways it detects/prevents AI scrapers [3]. No one is rolling their own JS engine for a scraper such that they are being blocked from their inability to keep up with the latest ECMAScript spec. You are just using an existing JS engine, all of which support all these features. It would actually be a challenge to find an old JS engine these days.
The entire things seems to be built on the misconception that the "common" way to build a scraper is doing something curl-esque. This idea is entirely based on the google scraper which itself doesn't even work that way anymore, and only ever did because it was written in the 90s. Everyone that rolls their own scraper these days just uses Puppeteer. It is completely unrealistic to make a scraper that doesn't run JavaScript and wait for the page to "settle down" because so many pages, even blogs, are just entirely client-side rendered SPAs. If I were to write a quick and dirty scraper today I would trivially bypass Anubis' protections... by doing literally nothing and without even realizing Anubis exists. Just using standard scraping practices with Puppeteer. Meanwhile Anubis is absolutely blocking plenty of real humans, with the author for example telling people to turn on cookies so that Anubis can do its job [4]. I don't think Anubis is blocking anything other than humans and Message's link preview generator.
I'm investigating further, but I think this entire thing may have started due to some confusion, but want to see if I can actually confirm this before speculating further.
1. https://www.techspot.com/news/107407-wikipedia-servers-strug... (notice the clickbait title vs. the actual contents)
2. https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-cr...
3. https://codeberg.org/forgejo/discussions/issues/319#issuecom...
4. https://github.com/TecharoHQ/anubis/issues/964#issuecomment-...
A better analogy would be "Robots.txt is a note saying your backdoor might be unlocked".
I don't even think it's a note saying your back door is unlocked? As myself and others shared in a sibling comment thread, that we have worked at places that implemented robots.txt in order to prevent bots from getting into nearly-infinite tarpits of links that lead to nearly-identical pages.
The issue is a debate between what the expectations are for content posted on the public internet. There is the viewpoint that it should be totally machine operable and programmatic and if you want it to be private you should gate it behind authentication, that the semantic web is an important concept and violating it is a breach of protocol. There's also the argument that it's your content, no one has a right to it, and you should be able to license its use anyway you want. There is a trade off between the implications of the two.
Otherwise you can whitelist a specific crawler in robots.txt
This has nothing to do with keeping my webserver from crashing, and has more to do with crawlers using content to train AI.
Anything I actually want to keep as a legacy, I’ll store with permanent.org
Just say you won't honor it and move on.
When it was introduced, the web was largely collaborative project within the academic realm. A system based on the honor system worked for the most part.
These days the web is adversarial through and through. A robots.txt file seems like an anachronistic, almost quaint museum piece, reminding us of what once was, while we stoop head first into tech feudalism.
The horrors of the 1990's internet is quaint by comparison to the society level problems we now have.
I know that ignoring a robots.txt file doesn't carry the same legal consequences as trespassing on physical land, but it's still going against the expressed wishes of the site owner.
Sure, you can argue that the site owner should restrict access using other gates, just as you might argue a land owner should put up a fence.
But isn't this a weird version of Chesterton's Fence, where a person decides that they will trespass beyond the fenced area because they can see no reason why the area should be fenced?
It's a bit like going to a clothing optional beach with a big camera and taking a bunch of photos. Is what you're doing legal? In most countries, yes. Are you an asshole for doing it? Also yes.
The people who wouldn't don't need the sign, the people who want to do it anyway.
If you don't want crawling, there are other ways to prevent / slow down crawling than asking nicely.
You’re welcome to ride if you obey the rules of carriage.
Don’t make me tap the sign.
yeah, the fact that it is actually useful for blocking crawlers is kind of a misleading thing. it's called "robots.txt", it's there to help the robots, not to block them. you use it to help a robot crawl your site more efficiently, and tell them what not to bother looking at so they don't waste their time.
people seem to have forgotten really quickly that making your website as accessible as possible to crawlers was actually considered a good thing, and there was a whole industry around optimizing websites for search engines crawlers.
for the rest of the net, ROBOTS.TXT is still often used for limiting the blast radius of search engines and bot crawl-delays and other "we know you're going to download this, please respect these provisions" type situations, as a sort of gentlemen's agreement. the site operator won't blackhole your net-ranges if you abide their terms. that's a reasonably useful thing to have.
robots.txt is helping you identify which parts of the website the author believes are of interest for search indexing or AI training or whatever.
fetching robots.txt and behaving in a conforming manner can open doors for you. If I spot a bot like that in my logs, I might whitelist them, and feed them a different robots.txt.
ROBOTS.TXT is a suicide note - https://news.ycombinator.com/item?id=13376870 - Jan 2017 (30 comments)
Robots.txt is a suicide note - https://news.ycombinator.com/item?id=2531219 - May 2011 (91 comments)
> What this situation does, in fact, is cause many more problems than it solves - catastrophic failures on a website are ensured total destruction with the addition of ROBOTS.TXT.
Of course an archival pedant [1] will tell you it's a bad idea (because it makes their archival process less effective)—but this is one of those "maybe you should think for yourself and not just implement what some rando says on the internet" moments.
If you're using version control, running backups, and not treating your production env like a home computer (i.e., you're aware of the ephemeral nature of a disk on a VPS), you're fine.
[1] Archivists are great (and should be supported), but when you turn it into a crusade, you get foolish, generalized takes like this wiki.
————
We have a faceted search that creates billions of unique URLs by combinations of the facets. As such, we block all crawlers from it in robots.txt, which saves us AND them from a bunch of pointless indexing load. But a stealth bot has been crawling all these URLs for weeks. Thus wasting a shitload of our resources AND a shitload of their resources too. Whoever it is, they thought they were being so clever by ignoring our robots.txt. Instead they have been wasting money for weeks. Our block was there for a reason.
I guess another angle on this is putting trust in people to comply with ROBOTS.txt. There is no guarantee so we should probably design with the assumption that our sites will be crawled however people want.
Also im curious about your use case.
>We have a faceted search that creates billions of unique URLs by combinations of the facets.
Are we talking about a search that has filters like (to use ecommerce as an example), brand, price range, color, etc. And then all these combinations make up a URL (hence bilions)? How does a crawler discover these? They are just designed to detect all these filters and try all combinations? That doesn't really jive with my understanding with crawlers but otherwise IDK how it would be generating billions of unique URLs. I guess maybe they could also be included in sitemaps but I doubt that.
In the past, we've used this behavior as a signal to identify block bad bots. These days, they will try again from 2000 separate residential IPs before they give up. But there was a long time where egregious duplicate page views of these faceted pages (against the advice of robots.txt) made detecting certain bad bots much easier.
But that doesn't mean that there aren't bad players, that ignore the robots.txt, give random user agent strings, or connects from IPs from all the world to avoid being blocked.
LLMs has changed a bit the landscape, mostly because far more players want to get everything or have automated tools to search your information on specific requests. But that doesn't rule out that still exist well-behaved players.
See https://news.ycombinator.com/item?id=43476337 for a random example of a discussion about this.
My personal position is that robots.txt is useless when faced with companies who have no sense of shame about abusing the resources of others. And if it is useless, there isn't much of a point in having it. Just make sure that nothing public facing is going to be too expensive for your server. But that's like saying that the solution to thieves is to not carry money around. Yes, it is a reasonable precaution. But it doesn't leave me feeling any better about the thieves.
To me, robots.txt is a friendly way to say, "Hey bots, this is what I allow. Stay in these lanes including crawl-delay and I won't block you." Step outside and I can put you on an exercise wheel. I know very few support crawl-delay but that is not my problem. Blocking bots or making them waste a lot of cycles or get dummy data or wildly reordering packets or adding random packet loss or slowing them to 2KB/s is more fun for me than playing Doom.
[1] - https://www.youtube.com/watch?v=B4zwh26kP8o [video][2 mins]
Counter-point: I have a blog I don't want to appear on search engines because it has private stuff on it. 25 years ago I added two lines to robots.txt file, and I've never seen it show up on any search engine ever since.
I'm not pretending nobody has indexed my blog and kept a copy of the results. I'm just saying the blog I started in college doesn't show up when you search for my name on Google, which is all I care about.
rafram•2h ago
xp84•1h ago
1. You (bot) are wasting your bandwidth, CPU, storage on a literally unbounded set of pages
2. This may or may not cause resource problems for the owner of the site (e.g. Suppose they use Algolia to power search and you search for 10,000,000 different search terms... and Algolia charges them by volume of searches.)
The author of this angry rant really seems specifically ticked at some perceived 'bad actor' who is using robots.txt as an attempt to "block people from getting at stuff" but it's super misguided in that it ignores an entire purpose of robots.txt that is not even necessarily adversarial to the "robot."
This whole thing could have been a single sentence: "Robots.txt has a few competing vague interpretations and is voluntary; not all bots obey it, so if you're fully relying on it to prevent a site from being archived, that won't work."
paulddraper•1h ago
That has been one of the biggest uses -- improve SEO by preventing web crawlers from getting lost/confused in a maze of irrelevant content.