Google indexes in country, as does a few other search engines..
Would recommend.
I looked at all the IP ranges delegated by APNIC, along with every local ISP that I could find, unioned this with
https://lite.ip2location.com/australia-ip-address-ranges
And so far i've not had any complaints. and I think that I have most of them.
At some time in the future, i'll start including https://github.com/ebrasha/cidr-ip-ranges-by-country
Source: stopping attacks that involve thousands of IPs at my work.
Are you really? How likely do you think is a legit customer/user to be on the same IP as a residential proxy? Sure residential IPS get reused, but you can handle that by making the block last 6-8 hours, or a day or two.
My single-layer thought process:
If they're knowingly running a residential proxy then they'll likely know "the cost of doing business". If they're unknowingly running a residential proxy then blocking them might be a good way for them to find out they're unknowingly running a residential proxy and get their systems deloused.
Would it make sense to have a class of ISPs that didn't peer with these "bad" network participants?
Not sure what my point is here tbh. The internet sucks and I don't have a solution
It should be illegal, at least for companies that still charge me while I’m abroad and don’t offer me any other way of canceling service or getting support.
Say you whitelist an address/range and some systems detect "bad things". Now what? You remove that address/range from whitelist? Doo you distribute the removal to your peers? Do you communicate removal to the owner of unwhitelisted address/range? How does owner communicate dealing with the issue back? What if the owner of the range is hosting provider where they don't proactively control the content hosted, yet have robust anti-abuse mechanisms in place? And so on.
Whitelist-only is a huge can of worms and whitelists works best with trusted partner you can maintain out-of-band communication with. Similarly blacklists work best with trusted partners, however to determine addresses/ranges that are more trouble than they are worth. And somewhere in the middle are grey zone addresses, e.g. ranges assigned to ISPs with CGNATs: you just cannot reliably label an individual address or even a range of addresses as strictly troublesome or strictly trustworthy by default.
Implement blacklists on known bad actors, e.g. the whole of China and Russia, maybe even cloud providers. Implement whitelists for ranges you explicitly trust to have robust anti-abuse mechanisms, e.g. corporations with strictly internal hosts.
- Blacklisted IP (Google Cloud, AWS, etc), those were always blocked
- Untrusted IPs (residential IPs) were given some leeway, but quickly got to 429 if they started querying too much
- Whitelisted IPs (IPV4 addresses are used legitimately by many people), for example, my current data plan tells me my IP is from 5 states over, so anything behind a CGNAT.
You can probably guess what happens next. Most scrapers were thrown out, but the largest ones just got a modem device farm and ate the cost. They successfully prevented most users from scraping locally, but were quickly beaten by companies profiting from scraping.
I think this was one of many bad decisions Pokémon Go made. Some casual players dropped because they didn't want to play without a map, while the hardcore players started paying for scraping, which hammered their servers even more.
The known good list is IPs and ranges I know are good. The known bad list is specific bad actors. The data center networks list is updated periodically based on a list of ASNs belonging to data centers.
There are a lot of problems with using ASNs, even for well-known data center operators. First, they update so often. Second, they often include massive subnets like /13(!), which can apparently overlap with routes announced by other networks, causing false positives. Third, I had been merging networks (to avoid overlaps causing problems in nginx) with something like https://github.com/projectdiscovery/mapcidr but found that it also caused larger overlaps that introduced false positives from adjacent networks where apparently some legitimate users are. Lastly, I had seen suspicious traffic from data center operators like CATO Networks Ltd and ZScaler that are some kind of enterprise security products that route clients through their clouds. Blocking those resulted in some angry users in places I didn't expect...
And none of the accounts for the residential ISPs that bots use to appear like legitimate users https://www.trendmicro.com/vinfo/us/security/news/vulnerabil....
It's not like we can capitalize on commerce in China anyway, so I think it's a fairly pragmatic approach.
If it works for my health insurance company, essentially all streaming services (including not even being able to cancel service from abroad), and many banks, it’ll work for you as well.
Surely bad actors wouldn’t use VPNs or botnets, and your customers never travel abroad?
I was in UK. I wanted to buy a movie ticket there. Fuck me, because I have an Austrian ip address, because modern mobile backends pass your traffic through your home mobile operator. So I tried to use a VPN. Fuck me, VPN endpoints are blocked also.
I wanted to buy a Belgian train ticket still from home. Cloudflare fuck me, because I’m too suspicious as a foreigner. It broke their whole API access, which was used by their site.
I wanted to order something while I was in America at my friend’s place. Fuck me of course. Not just my IP was problematic, but my phone number too. And of course my bank card… and I just wanted to order a pizza.
The most annoying is when your fucking app is restricted to your stupid country, and I should use it because your app is a public transport app. Lovely.
And of course, there was that time when I moved to an other country… pointless country restrictions everywhere… they really helped.
I remember the times when the saying was that the checkout process should be as frictionless as possible. That sentiment is long gone.
The blocks don't stay in place forever, just a few months.
I say that because I can't count how many times Google has taken me to a foreign site that either doesn't even ship to the US, or doesn't say one way or another and treat me like a crazy person for asking.
The only way of communicating with such companies are chargebacks through my bank (which always at least has a phone number reachable from abroad), so I’d make sure to account for these.
Visa/Mastercard chargeback rules largely apply worldwide (with some regional exceptions, but much less than many banks would make you believe).
In my experience running rather lowish traffic(thousands hits a day) sites, doing just that brought every single annoyance from thousands per day to zero.
Yes, people -can- easily get around it via various listed methods, but don't seem to actually do that unless you're a high value target.
Capitalism is a means to an end, and allowable business practices are a two-way street between corporations and consumers, mediated by regulatory bodies and consumer protection agencies, at least in most functioning democracies.
There are some that do not provide services in most countries but Netflix, Disney, paramount are pretty much global operations.
HBO and peacock might not be available in Europe but I am guessing they are in Canada.
Funny to see how narrow perspective some people have…
In several European countries, there is no HBO since Sky has some kind of exclusive contract for their content there, and that's where I was accordingly unable to unsubscribe from an US HBO plan.
Netflix doesn't have this issue but I've seen services that seem to make it tough. Though sometimes that's just a phone call away.
Though OTOH whining about this and knowing about VPNs and then complaining about the theoretical non-VPN-knower-but-having-subscriptions-to-cancel-and-is-allergic-to-phone-calls-or-calling-their-bank persona... like sure they exist but are we talking about any significant number of people here?
How so? They did not let me unsubscribe via blocking my IP.
Instead of being able to access at least my account (if not the streaming service itself, which I get – copyright and all), I'd just see a full screen notice along the lines of "we are not available in your market, stay tuned".
Traffic should be "privatize" as much as possible between IPv6 addresses (because you still have 'scanners' doing the whole internet all the time... "the nice guys scanning the whole internet for your protection... never to sell any scan data ofc).
Public IP services are done for: going to be hell whatever you do.
The right answer seems significantly big 'security and availability teams' with open and super simple internet standards. Yep the javascript internet has to go away and the app private protocols have too. No more whatng cartel web engine, or the worst: closed network protocols for "apps".
And the most important: hardcore protocol simplicity, but doing a good enough job. It is common sense, but the planned obsolescence and kludgy bloat lovers won't let you...
Re: China, their cloud services seem to stretch to Singapore and beyond. I had to blacklist all of Alibaba Cloud and Tencent and the ASNs stretched well beyond PRC borders.
It wouldn't surprise me if this is related somehow. Like maybe these are Indian corporations using a Seychelles offshore entity to do their scanning because then they can offset the costs against their tax or something. It may be that Cyprus has similar reasons. Istr that Cyprus was revealed to be important in providing a storefront to Russia and Putin-related companies and oligarchs.[2]
So Seychelles may be India-related bots and Cyprus Russia-related bots.
[1] https://taxjustice.net/faq/what-is-transfer-pricing/#:~:text...
[2] Yup. My memory originated in the "Panama Papers" leaks https://www.icij.org/investigations/cyprus-confidential/cypr...
So the seychelles traffic is likely really disguised chinese traffic.
[1] https://mybroadband.co.za/news/internet/350973-man-connected...
The explanation is that easy??
Soon: chineseplayer.io
it wont be all chinese companies or ppl doing the scraping. its well known that a lot of countries dont mind such traffic as long as it doesnt target themselves or for the west also some allies.
laws arent the same everywhere and so companies can get away with behavior in one place which seem almost criminal in another.
and what better place to put your scrapers than somewhere where there is no copyright.
russia also had same but since 2012 or so they changed laws and a lot of traffic reduced. companies moved to small islands or small nation states (favoring them with their tax payouts, they dont mind if j bring money for them) or few remaining places like china who dont care for copyrights.
its pretty hard to get really rid of such traffic. you can block stuff but mostly it will just change the response your server gives. flood still knockin at the door.
id hope someday maybe ISPs or so get more creative but maybe they dont have enough access and its hard to do this stuff without the right access into the traffic (creepy kind) or running into accidentally censoring the whole thing.
I am getting down-voted for saying I ban whole Singapore and China? Oh lord... OK. Please all the down-voters list your public facing websites. I do not care if people from China cannot access my website. They are not the target audience, and they are free to use VPNs if they so wish, or Tor, or whatever works for them, I have not banned them yet, for my OWN PERSONAL SHITTY WEBSITE, inb4 you want to moderate the fuck out of what I can and cannot do on my own server(s). Frankly, fuck off, or be a hero and die a martyr. :D
hostpapa in the US seems to become the new main issue (via what seems a 'ip colocation service'... yes, you read well).
In fact, I bet it would choke on a small amount of traffic from here considering it has a shitty vCPU with 512 MB RAM.
The internet has become a hostile place for any public server, and with the advent of ML tools, bots will make up far more than the current ~50% of all traffic. Captchas and bot detection is a losing strategy as bot behavior becomes more human-like.
Governments will inevitably enact privacy-infringing regulation to deal with this problem, but for sites that don't want to adopt such nonsense, allowlists are the only viable option.
I've been experimenting with a system where allowed users can create short-lived tokens via some out-of-band mechanism, which they can use on specific sites. A frontend gatekeeper then verifies the token, and if valid, opens up the required public ports specifically for the client's IP address, and redirects it to the service. The beauty of this system is that the service itself remains blocked at the network level from the world, and only allowed IP addresses are given access. The only publicly open port is the gatekeeper, which only accepts valid tokens, and can run from a separate machine or network. It also doesn't involve complex VPN or tunneling solutions, just a standard firewall.
This should work well for small personal sites, where initial connection latency isn't a concern, but obviously wouldn't scale well at larger scales without some rethinking. For my use case, it's good enough.
CloudFront is fairly good at marking if someone is accessing from a data centre or a residential/commercial endpoint. It's not 100% accurate and really bad actors can still use infected residential machines to proxy traffic, but this fix was simple and reduced the problem to a negligent level.
/128: single application
/64: single computer
/56: entire building
/48: entire (digital) neighborhood
IP blocking is useless if your sources are hundreds of thousands of people worldwide just playing a "free" game on their phone that once in a while on wifi fetches some webpages in the background for the game publisher's scraping as a service side revenue deal.
I can't believe the entitlement.
And no, I do not use those paid services, even though it would make it much easier.
If you feel like you need to do anything at all, I would suggest treating it like any other denial-of-service vulnerability: Fix your server or your application. I can handle 100k clients on a single box, which equates to north of 8 billion daily impressions, and so I am happy to ignore bots and identify them offline in a way that doesn't reveal my methodologies any further than I absolutely have to.
That's traffic I want to block, and that's behaviour that I want to punish / discourage. If a set of users get caught up in that, even when they've just been given recycled IP addresses, then there's more chance to bring the shitty 'scraping as a service' behaviour to light, thus to hopefully disinfect it.
(opinion coming from someone definitely NOT hosting public information that must be accessible by the common populace - that's an issue requiring more nuance, but luckily has public funding behind it to develop nuanced solutions - and can just block China and Russia if it's serving a common populace outside of China and Russia).
We have no chinese users/customers so in theory this does not effect business at all. Also russia is sanctioned and our russian userbase does not actually live in russia, so blocking russia did not effect users at all.
I did a quick search and found a few databases but none of them looks like the obvious winner.
If your site is behind cloudflare, blocking/challenging by country is a built-in feature.
None of these are main main traffic drivers, just the main resource hogs. And the main reason when my site turns slow (usually an AI, microsoft or Facebook ignoring any common sense)
China and co is only a very small portion of my malicious traffic. Gladly. It's usually US companies who disrespect my robots.txt and DNS rate limits who make me the most problems.
43.131.0.0/18 43.129.32.0/20 101.32.0.0/20 101.32.102.0/23 101.32.104.0/21 101.32.112.0/23 101.32.112.0/24 101.32.114.0/23 101.32.116.0/23 101.32.118.0/23 101.32.120.0/23 101.32.122.0/23 101.32.124.0/23 101.32.126.0/23 101.32.128.0/23 101.32.130.0/23 101.32.13.0/24 101.32.132.0/22 101.32.132.0/24 101.32.136.0/21 101.32.140.0/24 101.32.144.0/20 101.32.160.0/20 101.32.16.0/20 101.32.17.0/24 101.32.176.0/20 101.32.192.0/20 101.32.208.0/20 101.32.224.0/22 101.32.228.0/22 101.32.232.0/22 101.32.236.0/23 101.32.238.0/23 101.32.240.0/20 101.32.32.0/20 101.32.48.0/20 101.32.64.0/20 101.32.78.0/23 101.32.80.0/20 101.32.84.0/24 101.32.85.0/24 101.32.86.0/24 101.32.87.0/24 101.32.88.0/24 101.32.89.0/24 101.32.90.0/24 101.32.91.0/24 101.32.94.0/23 101.32.96.0/20 101.33.0.0/23 101.33.100.0/22 101.33.10.0/23 101.33.10.0/24 101.33.104.0/21 101.33.11.0/24 101.33.112.0/22 101.33.116.0/22 101.33.120.0/21 101.33.128.0/22 101.33.132.0/22 101.33.136.0/22 101.33.140.0/22 101.33.14.0/24 101.33.144.0/22 101.33.148.0/22 101.33.15.0/24 101.33.152.0/22 101.33.156.0/22 101.33.160.0/22 101.33.164.0/22 101.33.168.0/22 101.33.17.0/24 101.33.172.0/22 101.33.176.0/22 101.33.180.0/22 101.33.18.0/23 101.33.184.0/22 101.33.188.0/22 101.33.24.0/24 101.33.25.0/24 101.33.26.0/23 101.33.30.0/23 101.33.32.0/21 101.33.40.0/24 101.33.4.0/23 101.33.41.0/24 101.33.42.0/23 101.33.44.0/22 101.33.48.0/22 101.33.52.0/22 101.33.56.0/22 101.33.60.0/22 101.33.64.0/19 101.33.64.0/23 101.33.96.0/22 103.52.216.0/22 103.52.216.0/23 103.52.218.0/23 103.7.28.0/24 103.7.29.0/24 103.7.30.0/24 103.7.31.0/24 43.130.0.0/18 43.130.64.0/18 43.130.128.0/19 43.130.160.0/19 43.132.192.0/18 43.133.64.0/19 43.134.128.0/18 43.135.0.0/18 43.135.64.0/18 43.135.192.0/19 43.153.0.0/18 43.153.192.0/18 43.154.64.0/18 43.154.128.0/18 43.154.192.0/18 43.155.0.0/18 43.155.128.0/18 43.156.192.0/18 43.157.0.0/18 43.157.64.0/18 43.157.128.0/18 43.159.128.0/19 43.163.64.0/18 43.164.192.0/18 43.165.128.0/18 43.166.128.0/18 43.166.224.0/19 49.51.132.0/23 49.51.140.0/23 49.51.166.0/23 119.28.64.0/19 119.28.128.0/20 129.226.160.0/19 150.109.32.0/19 150.109.96.0/19 170.106.32.0/19 170.106.176.0/20
Here's a useful tool/site:
You can feed it an ip address to get an AS ("Autonomous System"), then ask it for all prefixes associated with that AS.
I fed it that first ip address from that list (43.131.0.0) and it showed my the same Tencent owned AS132203, and it gives back all the prefixes they have here:
https://bgp.tools/as/132203#prefixes
(Looks like roguebloodrage might have missed at least the 1.12.x.x and 1.201.x.x prefixes?)
I started searching about how to do that after reading a RachelByTheBay post where she wrote:
Enough bad behavior from a host -> filter the host.
Enough bad hosts in a netblock -> filter the netblock.
Enough bad netblocks in an AS -> filter the AS. Think of it as an "AS death penalty", if you like.
(from the last part of https://rachelbythebay.com/w/2025/06/29/feedback/ )
eg. Chuck 'Tencent' into the text box and execute.
You can blunt instrument 403 geoblock entire countries if you want, or any user agent, or any netblock or ASN. It’s entirely up to you and it’s your own server and nobody will be legitimately mad at you.
You can rate limit IPs to x responses per day or per hour or per week, whatever you like.
This whole AI scraper panic is so incredibly overblown.
I’m currently working on a sniffer that tracks all inbound TCP connections and UDP/ICMP traffic and can trigger firewall rule addition/removal based on traffic attributes (such as firewalling or rate limiting all traffic from certain ASNs or countries) without actually having to be a reverse proxy in the HTTP flow. That way your in-kernel tables don’t need to be huge and they can just dynamically be adjusted from userspace in response to actual observed traffic.
The problem is that its eating into peoples costs, and if you're not concerned with money, I'm just asking, can you send me $50.00 USD ?
Blocking IPS is much cheaper for the blocker.
if ($http_user_agent ~* "BadBot") {
limit_rate 1k;
default_type application/octet-stream;
proxy_buffering off;
alias /dev/zero;
return 200;
}
Then turn the tables on them and make the Great Firewall do your job! Just choose a random snippet about illegal Chinese occupation of Tibet or human rights abuses of Uyghur people each time you generate a page and insert it as a breaker between paragraphs. This should get you blocked in no time :)
Chuck 'Tencent' into the text box and execute.
I've been playing cat and mouse trying to block them for the past week and here are a couple of observations/ideas, in case this is helpful to someone:
* As mentioned above, the bulk of the traffic comes from a large number of IPs, each issuing only a few requests a day, and they pretend to be real UAs.
* Most of them don't bother sending the referrer URL, but not all (some bots from Huawei Cloud do, but they currently don't generate much traffic).
* The first thing I tried was to throttle bandwidth for URLs that contain id= (which on a cgit instance generate the bulk of the bot traffic). So I set the bandwidth to 1Kb/s and thought surely most of the bots will not be willing to wait for 10-20s to download the page. Surprise: they didn't care. They just waited and kept coming back.
* BTW, they also used keep alive connections if ones were offered. So another thing I did was disable keep alive for the /cgit/ locations. Failed that enough bots would routinely hog up all the available connections.
* My current solution is to deny requests for all URLs containing id= unless they also contain the `notbot` parameter in the query string (and which I suggest legitimate users add in the custom error message for 403). I also currently only do this if the referrer is not present but I may have to change that if the bots adapt. Overall, this helped with the load and freed up connections to legitimate users, but the bots didn't go away. They still request, get 403, but keep coming back.
My conclusion from this experience is that you really only have two options: either do something ad hoc, very specific to your site (like the notbot in query string) that whoever runs the bots won't bother adapting to or you have to employ someone with enough resources (like Cloudflare) to fight them for you. Using some "standard" solution (like rate limit, Anubis, etc) is not going to work -- they have enough resources to eat up the cost and/or adapt.
There’s a great talk on this: Defense by numbers: Making Problems for Script Kiddies and Scanner Monkeys https://www.youtube.com/watch?v=H9Kxas65f7A
What I’d really love to see - but probably never will—is companies joining forces to share data or support open projects like Common Crawl. That would raise the floor for everyone. But, you know… capitalism, so instead we all reinvent the wheel in our own silos.
An open project that classifies and records this - would need a fair bit of on-going protection, ironically.
I have a firewall that logs every incoming connection to every port. If I get a connection to a port that has nothing behind it, then I consider the IP address that sent the connection to be malicious, and I block the IP address from connecting to any actual service ports.
This works for me, but I run very few things to serve very few people, so there's minimal collateral damage when 'overblocking' happens - the most common thing is that I lock myself out of my VPN (lolfacepalm).
I occasionally look at the database of IP addresses and do some pivot tabling to find the most common networks and have identified a number of cough security companies that do incessant scanning of the IPv4 internet among other networks that give me the wrong vibes.
[0]: Uninvited Activity: https://github.com/UninvitedActivity/UninvitedActivity
P.S. If there aren't any Chinese or Russian IP addresses / networks in my lists, then I probably block them outright prior to the logging.
The internet was a big level-playing field, but for the past half century corporations and state actors managed to keep control and profit to themselves while giving the illusion that us peasants could still benefit from it and had a shot at freedom. Now that computing power is so vast and cheap, it has become an arms race and the cyberpunk dystopia has become apparent.
This is like saying “All the “ sugar-sweetened beverages are bad for you” people will sooner or later realize it is imperative to drink liquids”. It is perfectly congruent to believe trustless systems are important and that the way the blockchain works is more harmful than positive.
Additionally, the claim is that cryptocurrencies are used like that. Blockchains by themselves have a different set of issues and criticisms.
I don't know if its because they operate in the service of capital rather than China, as here, but use of those methods in the former case seems to get more of a pass here.
The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product. This is good news, because your actual human customers would really enjoy this too.
In that regard reading my logs led me sometimes to interesting articles about cyber security. Also log flooding may result in your journaling service truncating the log and you miss something important.
If this is actually impacting perceived QoS then I think a gitea bug report would be justified. Clearly there's been some kind of a performance regression.
Just looking at the logs seems to be an infohazard for many people. I don't see why you'd want to inspect the septic tanks of the internet unless absolutely necessary.
The bonus is my actual customers get the same benefits and don't notice any material loss from my content _not_ being scraped. How you see this as me being secretly taken advantage of is completely beyond me.
_def•3h ago