e.g. they'll take an entire sentence the user said and put it in quotes for no reason.
Thankfully search engines started ignoring quotes years ago, so it balances out...
I run a very small browser game (~120 weekly users currently), and until I put its Wiki (utterly uninteresting to anyone who doesn't already play the game) behind a login-wall, the bots were causing massive amounts of spurious traffic. Due to some of the Wiki's data coming live from the game through external data feeds, the deluge of bots actually managed to crash the game several times, necessitating a restart of the MariaDB process.
Anubis's creator says the same thing:
> In most cases, you should not need this and can probably get by using Cloudflare to protect a given origin. However, for circumstances where you can't or won't use Cloudflare, Anubis is there for you.
To stop malicious bots like this, Cloudflare is a great solution if you don't mind using it (you can enable a basic browser check for all users and all pages, or write custom rules to only serve a check to certain users or on certain pages). If you're not a fan of Cloudflare, Anubis works well enough for now if you don't mind the branding.
Here's the cloudflare rule I currently use (vast majority of bot traffic originates from these countries):
ip.src.continent in {"AF" "SA"} or
ip.src.country in {"CN" "HK" "SG"} or
ip.src.country in {"AE" "AO" "AR" "AZ" "BD" "BR" "CL" "CO" "DZ" "EC" "EG" "ET" "ID" "IL" "IN" "IQ" "JM" "JO" "KE" "KZ" "LB" "MA" "MX" "NP" "OM" "PE" "PK" "PS" "PY" "SA" "TN" "TR" "TT" "UA" "UY" "UZ" "VE" "VN" "ZA"} or
ip.src.asnum in {28573 45899 55836}
> "I don't know what this actually gives people, but our industry takes great pride in doing this"
> "unsleeping automatons that never get sick, go on vacation, or need to be paid health insurance that can produce output that superficially resembles the output of human employees"
> "This is a regulatory issue. The thing that needs to happen is that governments need to step in and give these AI companies that are destroying the digital common good existentially threatening fines and make them pay reparations to the communities they are harming."
<3 <3
I just set a rate-limit in cloudflare because no legitimate symbol server user will ever be excessive.
Is the reason these large companies don't care because they are large enough to hide behind a bunch of lawyers?
breakyerself•2h ago
bgwalter•1h ago
Any bot that answers daily political questions like Grok has many web accesses per prompt.
8organicbits•1h ago
bgwalter•1h ago
I'm not sure how many websites are searched and discarded per query. Since it's the remote, proprietary LLM that initiates the search I would hesitate to call them agents. Maybe "fetcher" is the best term.
danaris•27m ago
They're going out and scraping everything, so that when they're asked a question, they can pull a plausible answer from their dataset and summarize the page they found it on.
Even the ones that actively go out and search/scrape in response to queries aren't just scraping a single site. At best, they're scraping some subset of the entire internet that they have tagged as being somehow related to the query. So even if what they present to the user is a summary of a single webpage, that is rarely going to be the product of a single request to that single webpage. That request is going to be just one of many, most of which are entirely fruitless for that specific query: purely extra load for their servers, with no gain whatsoever.
snowwrestler•1h ago
The reason is that user requests are similar to other web traffic because they reflect user interest. So those requests will mostly hit content that is already popular, and therefore well-cached.
Corpus-building crawlers do not reflect current user interest and try to hit every URL available. As a result these hit URLs that are mostly uncached. That is a much heavier load.
shikon7•21m ago
bgwalter•3m ago
prasadjoglekar•1h ago
danaris•40m ago