And there are so many of such idiots that it's overwhelming their servers?
Something doesn't math here.
As an SRE, the only legitimate concern here could be the bandwidth costs. But QoS tuning should solve that too.
Supposedly technical people crying out for a journalist to help them is super lame. Everything about this looks super lame.
Every bot is doing something on behalf of a human. Now that LLMs can churn out half-assed bot scripts every "look I installed Arch Linux and ohmyzsh" script kiddie has bots too.
Bots aren't going anywhere.
"Use the web the way it was over 10 years ago plox" isn't going to do it.
The scrapers try hard to make themselves look like valid browsers, sending requests via residential IP addresses (400,000+ IPs at last count).
I reached out to journalists because despite strong technical measures, the abuse will not go away on its own.
molly_radstowe•1w ago
direwolf20•1w ago
petre•1w ago
direwolf20•1w ago
Bender•1w ago
On a separate note have tcpdump captures been done on these excessive connections? Minus the IP, what do their SYN packets look like? Minus the IP what do the corresponding log entries look like in the web server? Are they using HTTP/1.1 or HTTP/2.0? Are they missing any expected headers for a real person such as cors, no-cors, navigate, accept_language?
Is there someone at OpenStreetMap that can answer these questions?KomoD•1w ago
Bender•1w ago
Firefishy•1w ago
Technically we able to block and restrict the scrapers after the initial request from an IP. We've seen 400,000 IPs in the last 24 hours. Each IP only does a few requests. Most are not very good at faking browsers, but they are getting better. (HTTP/1.1 vs HTTP/2, obviously faked headers etc)
The problem has been going on for over a year now. It isn't going away. We need journalists and others to help us push back.
Bender•1w ago
I think your only hope would be to either find subtle differences between them and real legit users or change how your site works so that bots have to be authenticated unless they have a whitelisted IP/CIDR or put your site behind something else that spots the bots. Beyond that all anyone can do is beef up their infrastructure to handle much more than the bots could dish out.
Have you tried silly simple things like hidden javascript puzzles the browser has to solve?
Kodiack•4d ago
It's been getting worse over the past year, with the past few weeks in particular seeing a massive change literally overnight. I had to aggressively tune my WAF rules to even remotely get things under control. With Cloudflare I'm aggressively issuing browser challenges to any browser that looks remotely suspicious, and the pass rate is currently below 0.5%. For my users' sake, a successful browser challenge is "valid" for over a month, but this still feels like another thing that'll eventually be bypassed.
I'd be keen to know if you've found any other effective ways of mitigating these most recent aggressive scraping requests. Even a simple "yes" or "no" would be appreciated; I think it's fair to be apprehensive about sharing some specific details publicly since even a lot of folks here on HN seem to think it's their right to scrape content with orders of magnitude higher throughput than all users combined.
I really don't know how this is sustainable long-term. It's eaten up quite a lot of my personal time and effort just for the sake of a hobby that I otherwise greatly enjoy.