The data transfer, on the other hand, could be substantial and costly. Is it known whether these crawlers do respect caching at all? Provide If-Modified-Since/If-None-Match or anything like that?
e.g. they'll take an entire sentence the user said and put it in quotes for no reason.
Thankfully search engines started ignoring quotes years ago, so it balances out...
I run a very small browser game (~120 weekly users currently), and until I put its Wiki (utterly uninteresting to anyone who doesn't already play the game) behind a login-wall, the bots were causing massive amounts of spurious traffic. Due to some of the Wiki's data coming live from the game through external data feeds, the deluge of bots actually managed to crash the game several times, necessitating a restart of the MariaDB process.
Anubis's creator says the same thing:
> In most cases, you should not need this and can probably get by using Cloudflare to protect a given origin. However, for circumstances where you can't or won't use Cloudflare, Anubis is there for you.
To stop malicious bots like this, Cloudflare is a great solution if you don't mind using it (you can enable a basic browser check for all users and all pages, or write custom rules to only serve a check to certain users or on certain pages). If you're not a fan of Cloudflare, Anubis works well enough for now if you don't mind the branding.
Here's the cloudflare rule I currently use (vast majority of bot traffic originates from these countries):
ip.src.continent in {"AF" "SA"} or
ip.src.country in {"CN" "HK" "SG"} or
ip.src.country in {"AE" "AO" "AR" "AZ" "BD" "BR" "CL" "CO" "DZ" "EC" "EG" "ET" "ID" "IL" "IN" "IQ" "JM" "JO" "KE" "KZ" "LB" "MA" "MX" "NP" "OM" "PE" "PK" "PS" "PY" "SA" "TN" "TR" "TT" "UA" "UY" "UZ" "VE" "VN" "ZA"} or
ip.src.asnum in {28573 45899 55836}
All this attempting to block AI scrapers would not be an issue if they respected rate-times, knew how to back of when a server starts responding to slowly, or caching frequently visited sites. Instead some of these companies will do everything, including using residential ISPs, to ensure that they can just piledrive the website of some poor dude that's just really into lawnmowers, or the git repo of some open source developer who just want to share their work.
Very few are actually against AI-crawlers, if they showed just the tiniest amount of respect, but they don't. I think Drew Devault said it best: "Please stop externalizing your costs directly into my face"
> "I don't know what this actually gives people, but our industry takes great pride in doing this"
> "unsleeping automatons that never get sick, go on vacation, or need to be paid health insurance that can produce output that superficially resembles the output of human employees"
> "This is a regulatory issue. The thing that needs to happen is that governments need to step in and give these AI companies that are destroying the digital common good existentially threatening fines and make them pay reparations to the communities they are harming."
<3 <3
I just set a rate-limit in cloudflare because no legitimate symbol server user will ever be excessive.
Web crawlers have been around for years, but many of the current ones are more indiscriminate and less well behaved.
Is the reason these large companies don't care because they are large enough to hide behind a bunch of lawyers?
Maybe some civil lawsuit about terms of service? You'd have to prove that the scraper agreed to the terms of service. Perhaps in the future all CAPTCHAs come with a TOS click-through agreement? Or perhaps every free site will have a login wall?
AI crawlers were downloading whole pages and executing all the javascript tens of millions of times a day - hurting performance, filling logs, skewing analytics and costing too much money in Google Maps loads.
Really disappointing.
[1] https://developers.cloudflare.com/cloudflare-challenges/
I wish there were a better way to solve this.
AI is going to damage society not in fancy sci-fi ways but by centralizing profit made at the expense of everyone else on the internet, who is then forced to erect boundaries to protect themselves, worsening the experience for the rest of the public. Who also have to pay higher electricity bills, because keeping humans warm is not as profitable as a machine which directly converts electricity into stock price rises.
breakyerself•3h ago
bgwalter•3h ago
Any bot that answers daily political questions like Grok has many web accesses per prompt.
8organicbits•2h ago
bgwalter•2h ago
I'm not sure how many websites are searched and discarded per query. Since it's the remote, proprietary LLM that initiates the search I would hesitate to call them agents. Maybe "fetcher" is the best term.
danaris•1h ago
They're going out and scraping everything, so that when they're asked a question, they can pull a plausible answer from their dataset and summarize the page they found it on.
Even the ones that actively go out and search/scrape in response to queries aren't just scraping a single site. At best, they're scraping some subset of the entire internet that they have tagged as being somehow related to the query. So even if what they present to the user is a summary of a single webpage, that is rarely going to be the product of a single request to that single webpage. That request is going to be just one of many, most of which are entirely fruitless for that specific query: purely extra load for their servers, with no gain whatsoever.
snowwrestler•2h ago
The reason is that user requests are similar to other web traffic because they reflect user interest. So those requests will mostly hit content that is already popular, and therefore well-cached.
Corpus-building crawlers do not reflect current user interest and try to hit every URL available. As a result these hit URLs that are mostly uncached. That is a much heavier load.
shikon7•1h ago
bgwalter•1h ago
prasadjoglekar•3h ago
danaris•2h ago