I'd sacrifice two CPU cores for this just to make their life awful.
I would make a list of words from each word class, and a list of sentence structures where each item is a word class. Pick a pseudo-random sentence; for each word class in the sentence, pick a pseudo-random word; output; repeat. That should be pretty simple and fast.
I'd think the most important thing though is to add delays to serving the requests. The purpose is to slow the scrapers down, not to induce demand on your garbage well.
> I came to the conclusion that running this can be risky for your website. The main risk is that despite correctly using robots.txt, nofollow, and noindex rules, there's still a chance that Googlebot or other search engines scrapers will scrape the wrong endpoint and determine you're spamming.
RewriteEngine On
# Block requests that reference .php anywhere (path, query, or encoded)
RewriteCond %{REQUEST_URI} (\.php|%2ephp|%2e%70%68%70) [NC,OR]
RewriteCond %{QUERY_STRING} \.php [NC,OR]
RewriteCond %{THE_REQUEST} \.php [NC]
RewriteRule .* - [F,L]
Notes: there's no PHP on my servers, so if someone asks for it, they are one of the "bad boys" IMHO. Your mileage may differ. # Nothing to hack around here, I’m just a teapot:
location ~* \.(?:php|aspx?|jsp|dll|sql|bak)$ {
return 418;
}
error_page 418 /418.html;
No hard block, instead reply to bots the funny HTTP 418 code (https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...). That makes filtering logs easier.Live example: https://FreeSolitaire.win/wp-login.php (NB: /wp-login.php is WordPress login URL, and it’s commonly blindly requested by bots searching for weak WordPress installs.)
> You have an image on your error page, which some crappy bots will download over and over again.
Most bots won’t download subresources (almost none of them do, actually). The HTML page itself is lean (475 bytes); the image is an Easter egg for humans ;-) Moreover, I use a caching CDN (Cloudflare).
Be better if the scraper is left waiting for a packet that'll never arrive (till it times out obviously)
The LB will see the unresponded requests and think your webserver is failing.
Ideal would be to respond at the webserver and let the LB drop the response.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
https://maurycyz.com/misc/the_cost_of_trash/#:~:text=throw%2...
Otherwise you can also chain compression methods like: "Content-Encoding: gzip gzip".
With toxic AI scrapers like Perplexity moving more and more to headless web browsers to bypass bot blocks, I think a brotli bomb (100GB of \0 can be compressed to about 78KiB with Brotli) would be quite effective.
I have a web crawler and I have both scraping byte limit and timeout, so zip bombs dont bother me much.
https://github.com/rumca-js/crawler-buddy
I think garbage blabber would be more effective.
AI scrapers will plagiarise your work and bring you zero traffic.
So to continue your analogy, I made my part of the beach accessible for visitors to enjoy, but certain people think they can carry it away for their own purpose ...
The line is "I technically and able to do this" and "I am engaging with a system in good faith".
Public parks are just there and I can technically drive up and dump rubbish there and if they didn't want me to they should have installed a gate and sold tickets.
Many scrapers these days are sort of equivalent in that analogy to people starting entire fleets of waste disposal vehicles that all drive to parks to unload, putting strain on park operations and making the parks a less tenable service in general.
This is where the line should be, always. But in practice this criterion is applied very selectively here on HN and elsewhere.
After all: What is ad blocking, other than direct subversion of the site owner's clear intention to make money from the viewer's attention?
Applying your criterion here gives a very simple conclusion: If you don't want to watch the ads, don't visit the site.
Right?
Does anyone have a counterargument?
Not the that two wrongs make a right, and it's definitely a bit of an argument of convenience for people who find adverts annoying. But I think most people are less opposed to the idea of advertising as popularly imagined (i.e. paper newspaper-style where you just see an advert) to support their favourite blog than they are to the current web advertising model (just by viewing the advert to get an unspecified amount of information instantly stolen and sent off to a bunch of shady companies who process it and sell it on, and don't get any way to veto it before loading a website and having the damage done).
To stretch the park analogy it might be that the park sells a licence to a company to make some cash from advertising to its visitors, which it kind of expects to be things like adverts on the benches and so on. That company then starts photographing people from the bushes, recording conversations and putting Airtags in visitors' pockets to boost the profits it makes itself. Visitors then start wearing masks, stop talking and wear clothes with zipped pockets. You can say the visitors are wrong to violate the implicit park usage agreement that they submit to the surveillance to fund the park (and advertising company), or you can say that the company is wrong to expand the original license to advertise into an invasion of privacy without even telling the visitors what they were going to do before they entered, or, indeed, during or after.
These scrapers drown peoples' servers in requests, taking up literally all the resources and driving up cost.
About 10-15 years ago, the scourge I was fighting was social media monitoring services, companies paid by big brands to watch sentiment across forums and other online communities. I was running a very popular and completely free (and ad-free) discussion forum in my spare time, and their scraping was irritating for two reasons. First, they were monetising my community when I wasn’t. Second, their crawlers would hit the servers as hard as they could, creating real load issues. I kept having to beg our hosting sponsor for more capacity.
Once I figured out what was happening, I blocked their user agent. Within a week they were scraping with a generic one. I blocked their IP range; a week later they were back on a different range. So I built a filter that would pseudo-randomly[0] inject company names[1] into forum posts. Then any time I re-identified[2] their bot, I enabled that filter for their requests.
The scraping stopped within two days and never came back.
--
[0] Random but deterministic based on post ID, so the injected text stayed consistent.
[1] I collated a list of around 100 major consumer brands, plus every company name the monitoring services proudly listed as clients on their own websites.
[2] This was back around 2009 or so, so things weren't nearly as sophisticated as they are today, both in terms of bots and anti-bot strategies. One of the most effective tools I remember deploying back then was analysis of all HTTP headers. Bots would spoof a browser UA, but almost none would get the full header set right, things like Accept-Encoding or Accept-Language were either absent, or static strings that didn't exactly match what the real browser would ever send.
I suppose it could have an impact if 30% of all, say, Coca Cola mentions on the web came from that site, but then it would have to be a very big site. I don't think the bot company would notice, let alone care, if it was 0.01% of the mentions.
I remember years-ago (2008?) I worked in a company where every mention of it was manually reviewed by someone from PR department. I imagine now the tools are even better.
Different thing is that discussion is often very low quality (forums died for multiple reasons, reddit is dying too - astro-turf gallore now)
They would have received multiple complaints about it from customers, performed an investigation, and ultimately perform a manual excision of the junk data from their system; both the raw scrapes and anywhere it was ingested and processed. This was probably a simple operation, but might not have been if their architecture didn’t account for this vulnerability.
Config snippet for anyone interested:
if ($http_user_agent ~* "Chrome/\d{2,3}\.\d+\.\d{2,}\.\d{2,}") {
set $block 1;
}
if ($http_accept_language = "") {
set $block "${block}1";
}
if ($block = "11") {
return 403;
}For those adversaries, you need to work out a careful balance between deterrence, solving problems (e.g. resource abuse), and your desire to “win”. In extreme cases your best strategy is for your filter to “work” but be broken in hard to detect ways. For example, showing all but the most valuable content. Or spiking the data with just enough rubbish to diminish its value. Or having the content indexes return delayed/stale/incomplete data.
And whatever you do, don’t use incrementing integers. Ask me how I know.
Beyond that, look for how the bots are finding new URLs to probe, and don’t give them access to those lists/indexes. In particular, don’t forget about site maps. I use cloudflare rules to restrict my site map to known bots only.
They discovered those URLs simply by parsing pages that contain like buttons. Those do have rel="nofollow" on them, and the URL pattern is disallowed in robots.txt, but I'd be surprised it that'd stop someone who uses thousands of IPs to proxy their requests. I don't have a site map.
If, instead, you only act on a percentage of requests, you can add noise in an insidious way without signaling that you caught them. It will make their job troubleshooting and crafting the next iteration much harder. Also, making the response less predictable is a good idea - throw different HTTP error codes, respond with somewhat inaccurate content, etc
I like how this kind of response is very difficult for them to detect when I turn it on, and as a bonus, it pollutes their data. They stopped trying a few days after that.
It’s also interesting that merchants (presumably) don’t have a mechanism to flag transactions as being >0% chance of being suspect. Or that you waive any dispute rights.
As a merchant, it would be nice if you could demand the bank verify certain transactions with their customer. If I was a customer, I would want to know that someone tried to use my card numbers to donate to some death metal training school in the Netherlands.
I do wonder whether these people sold their list of "verified" credit card numbers to any criminal enterprises before they realized the data was poisoned. That would be potentially awkward for them.
That means you need to poison the data when you detect a bot.
When you get paid big buck to make the world worse for everyone it's really simple forgetting "little details".
It is completely different if I am hitting it looking for WordPress vulnerabilities or scraping content every minute for LLM training material.
The tech people are all turning against scraping, independent artists are now clamoring for brutal IP crackdowns and Disney-style copyright maximalism (which I never would've predicted just 5 years ago, that crowd used to be staunchly against such things), people everywhere want more attestation and elimination of anonymity now that it's effectively free to make a swarm of convincingly-human misinformation agents, etc.
It's making people worse.
Would be usable to ban the ip for a few hours to have the bot cool down for a bit and move onto a next domain.
The default ban for traffic detected by your crowdsec instance is 4 hours, so that concern isn't very relevant in that case.
The decisions from the Central API from other users can be quite a bit longer (I see some at ~6 days), but you also don't have to use those if you're worried about that scenario.
So would the natural strategy then be to flag some vulnerability of interest? Either one typically requiring more manual effort (waste their time), or one that is easily automated so as to trap a bot in a honeybot i.e. "you got in, what do next? oh upload all your kit and show how you work? sure" see: the cuckoos egg
This is done very efficiently. If you return anything unexpected, they’ll just drop you and move on.
.htaccess diverts suspicious paths (e.g., /.git, /wp-login) to decoy.php and forces decoy.zip downloads (10GB), so scanners hitting common “secret” files never touch real content and get stuck downloading a huge dummy archive.
decoy.php mimics whatever sensitive file was requested by endless streaming of fake config/log/SQL data, keeping bots busy while revealing nothing.
[1] https://github.com/holysoles/bot-wrangler-traefik-plugin
Now I target only the most aggressive bots with zipbombs and the rest get a 403. My new spam strategy seems to work, but I don't know if I should post it on HN again...
I don't know why people would assume these are AI/LLM scrapers seeking PHP source code on random servers(!) short of it being related to this brainless "AI is stealing all the data" nonsense that has infected the minds of many people here.
ArcHound•2mo ago
What you have here is quite close to a honeypot, sadly I don't see an easy way to counter-abuse such bots. If the attack is not following their script, they move on.
jojobas•2mo ago
So as to battles of efficiency, generating a 4kb bullshit PHP is harder than running a regex.