Ban me at the IP level if you don't like me

https://boston.conman.org/2025/08/21.1

255•classichasclass•5h ago

Comments

_def•3h ago

I've seen blocks like that for e.g. alibaba cloud. It's sad indeed, but it can be really difficult to handle aggressive scrapers.

Etheryte•3h ago

One starts to wonder, at what point might it be actually feasible to do it the other way around, by whitelisting IP ranges. I could see this happening as a community effort, similar to adblocker list curation etc.

worthless-trash•3h ago

I admin a few local business sites.. I whitelist all the countries isps and the strangeness in the logs and attack counts have gone down.

Google indexes in country, as does a few other search engines..

Would recommend.

coffee_am•3h ago

Is there a public curated list of "good ips" to whitelist ?

worthless-trash•3h ago

So, its relatively easy because there is limited ISP's in my country. I imagine its a much harder option for the US.

I looked at all the IP ranges delegated by APNIC, along with every local ISP that I could find, unioned this with

https://lite.ip2location.com/australia-ip-address-ranges

And so far i've not had any complaints. and I think that I have most of them.

At some time in the future, i'll start including https://github.com/ebrasha/cidr-ip-ranges-by-country

partyguy•2h ago

> Is there a public curated list of "good ips" to whitelist ?

https://github.com/AnTheMaker/GoodBots

ygritte•3h ago

Came here to say something similar. The sheer amount of IP addresses one has to block to keep malware and bots at bay is becoming unmanageable.

bobbiechen•3h ago

Unfortunately, well-behaved bots often have more stable IPs, while bad actors are happy to use residential proxies. If you ban a residential proxy IP you're likely to impact real users while the bad actor simply switches. Personally I don't think IP level network information will ever be effective without combining with other factors.

Source: stopping attacks that involve thousands of IPs at my work.

throwawayffffas•1h ago

> If you ban a residential proxy IP you're likely to impact real users while the bad actor simply switches.

Are you really? How likely do you think is a legit customer/user to be on the same IP as a residential proxy? Sure residential IPS get reused, but you can handle that by making the block last 6-8 hours, or a day or two.

richardwhiuk•45m ago

In these days of CGNAT, a residential IP is shared by multiple customers.

BLKNSLVR•1h ago

Blocking a residential proxy doesn't sound like a bad idea to me.

My single-layer thought process:

If they're knowingly running a residential proxy then they'll likely know "the cost of doing business". If they're unknowingly running a residential proxy then blocking them might be a good way for them to find out they're unknowingly running a residential proxy and get their systems deloused.

delusional•3h ago

At that point it almost sounds like we're doing "peering" agreements at the IP level.

Would it make sense to have a class of ISPs that didn't peer with these "bad" network participants?

JimDabell•3h ago

If this didn’t happen for spam, it’s not going to happen for crawlers.

shortrounddev2•3h ago

Why not just ban all IP blocks assigned to cloud providers? Won't halt botnets but the IP range owned by AWS, GCP, etc is well known

hnlmorg•2h ago

Because crawlers would then just use a different IP which isn’t owned by cloud vendors.

jjayj•2h ago

But my work's VPN is in AWS, and HN and Reddit are sometimes helpful...

Not sure what my point is here tbh. The internet sucks and I don't have a solution

aorth•1h ago

Tricky to get a list of all cloud providers, all their networks, and then there are cases like CATO Networks Ltd and ZScaler, which are apparently enterprise security products that route clients traffic through their clouds "for security".

lxgr•3h ago

Many US companies do it already.

It should be illegal, at least for companies that still charge me while I’m abroad and don’t offer me any other way of canceling service or getting support.

withinboredom•2h ago

I'm pretty sure I still owe t-mobile money. When I moved to the EU, we kept our old phone plans for awhile. Then, for whatever reason, the USD didn't make it to the USD account in time and we missed a payment. Then t-mobile cut off the service and you need to receive a text message to login to the account. Obviously, that wasn't possible. So, we lost the ability to even pay, even while using a VPN. We just decided to let it die, but I'm sure in t-mobile's eyes, I still owe them.

thenthenthen•1h ago

This! Dealing with European services from China is also terrible. As is the other way around. Welcome to the intranet!

partyguy•2h ago

That's what I'm trying to do here, PRs welcome: https://github.com/AnTheMaker/GoodBots

aorth•1h ago

Noble effort. I might make some pull requests, though I kinda feel it's futile. I have my own list of "known good" networks.

friendzis•2h ago

It's never either/or: you don't have to choose between white and black lists exclusively and most of the traffic is going to come from grey areas anyway.

Say you whitelist an address/range and some systems detect "bad things". Now what? You remove that address/range from whitelist? Doo you distribute the removal to your peers? Do you communicate removal to the owner of unwhitelisted address/range? How does owner communicate dealing with the issue back? What if the owner of the range is hosting provider where they don't proactively control the content hosted, yet have robust anti-abuse mechanisms in place? And so on.

Whitelist-only is a huge can of worms and whitelists works best with trusted partner you can maintain out-of-band communication with. Similarly blacklists work best with trusted partners, however to determine addresses/ranges that are more trouble than they are worth. And somewhere in the middle are grey zone addresses, e.g. ranges assigned to ISPs with CGNATs: you just cannot reliably label an individual address or even a range of addresses as strictly troublesome or strictly trustworthy by default.

Implement blacklists on known bad actors, e.g. the whole of China and Russia, maybe even cloud providers. Implement whitelists for ranges you explicitly trust to have robust anti-abuse mechanisms, e.g. corporations with strictly internal hosts.

jampa•2h ago

The Pokémon Go company tried that shortly after launch to block scraping. I remember they had three categories of IPs:

- Blacklisted IP (Google Cloud, AWS, etc), those were always blocked

- Untrusted IPs (residential IPs) were given some leeway, but quickly got to 429 if they started querying too much

- Whitelisted IPs (IPV4 addresses are used legitimately by many people), for example, my current data plan tells me my IP is from 5 states over, so anything behind a CGNAT.

You can probably guess what happens next. Most scrapers were thrown out, but the largest ones just got a modem device farm and ate the cost. They successfully prevented most users from scraping locally, but were quickly beaten by companies profiting from scraping.

I think this was one of many bad decisions Pokémon Go made. Some casual players dropped because they didn't want to play without a map, while the hardcore players started paying for scraping, which hammered their servers even more.

aorth•1h ago

I have an ad hoc system that is similar, comprised of three lists of networks: known good, known bad, and data center networks. These are rate limited using a geo map in nginx for various expensive routes in my application.

The known good list is IPs and ranges I know are good. The known bad list is specific bad actors. The data center networks list is updated periodically based on a list of ASNs belonging to data centers.

There are a lot of problems with using ASNs, even for well-known data center operators. First, they update so often. Second, they often include massive subnets like /13(!), which can apparently overlap with routes announced by other networks, causing false positives. Third, I had been merging networks (to avoid overlaps causing problems in nginx) with something like https://github.com/projectdiscovery/mapcidr but found that it also caused larger overlaps that introduced false positives from adjacent networks where apparently some legitimate users are. Lastly, I had seen suspicious traffic from data center operators like CATO Networks Ltd and ZScaler that are some kind of enterprise security products that route clients through their clouds. Blocking those resulted in some angry users in places I didn't expect...

And none of the accounts for the residential ISPs that bots use to appear like legitimate users https://www.trendmicro.com/vinfo/us/security/news/vulnerabil....

gunalx•1h ago

This really seems like they did everything they could and still got abused by borderline criminal activity from scrapers. But i do really think it had an impact on scraping, it is just a matter of attrition and raising the cost so it should hurt more to scrape, the problem really never can go away, because at some point the scrapers can just start paying regular users to collect the data.

lwansbrough•3h ago

We solved a lot of our problems by blocking all Chinese ASNs. Admittedly, not the friendliest solution, but there were so many issues originating from Chinese clients that it was easier to just ban the entire country.

It's not like we can capitalize on commerce in China anyway, so I think it's a fairly pragmatic approach.

lxgr•3h ago

Why stop there? Just block all non-US IPs!

If it works for my health insurance company, essentially all streaming services (including not even being able to cancel service from abroad), and many banks, it’ll work for you as well.

Surely bad actors wouldn’t use VPNs or botnets, and your customers never travel abroad?

lwansbrough•3h ago

Don't care, works fine for us.

yupyupyups•3h ago

And that's perfectly fine. Nothing is completely bulletproof anyway. If you manage to get rid of 90% of the problem then that's a good thing.

lxgr•3h ago

And if your competitor manages to do so without annoying the part of their customer base that occasionally leaves the country, everybody wins!

yupyupyups•1h ago

Fair point, that's something to consider.

ruszki•46m ago

Okay, but this causes me about 90% of my major annoyances. Seriously. It’s almost always these stupid country restrictions.

I was in UK. I wanted to buy a movie ticket there. Fuck me, because I have an Austrian ip address, because modern mobile backends pass your traffic through your home mobile operator. So I tried to use a VPN. Fuck me, VPN endpoints are blocked also.

I wanted to buy a Belgian train ticket still from home. Cloudflare fuck me, because I’m too suspicious as a foreigner. It broke their whole API access, which was used by their site.

I wanted to order something while I was in America at my friend’s place. Fuck me of course. Not just my IP was problematic, but my phone number too. And of course my bank card… and I just wanted to order a pizza.

The most annoying is when your fucking app is restricted to your stupid country, and I should use it because your app is a public transport app. Lovely.

And of course, there was that time when I moved to an other country… pointless country restrictions everywhere… they really helped.

I remember the times when the saying was that the checkout process should be as frictionless as possible. That sentiment is long gone.

raffraffraff•3h ago

And across the water, my wife has banned US IP addresses from her online shop once or twice. She runs a small business making products that don't travel well, and would cost a lot to ship to the US. It's a huge country with many people. Answering pointless queries, saying "No, I can't do that" in 50 different ways and eventually dealing with negative reviews from people you've never sold to and possibly even never talked to... Much easier to mass block. I call it network segmentation. She's also blocked all of Asia, Africa, Australia and half of Europe.

The blocks don't stay in place forever, just a few months.

silisili•3h ago

Google Shopping might be to blame here, and I don't at all blame the response.

I say that because I can't count how many times Google has taken me to a foreign site that either doesn't even ship to the US, or doesn't say one way or another and treat me like a crazy person for asking.

lxgr•3h ago

As long as your customer base never travels and needs support, sure, I guess.

The only way of communicating with such companies are chargebacks through my bank (which always at least has a phone number reachable from abroad), so I’d make sure to account for these.

closewith•2h ago

Chargebacks aren't the panacea you're used to outside the US, so that's a non-issue.

lxgr•27m ago

Only if your bank isn't competent in using them.

Visa/Mastercard chargeback rules largely apply worldwide (with some regional exceptions, but much less than many banks would make you believe).

closewith•2m ago

[delayed]

silisili•3h ago

I'm not precisely sure the point you're trying to make.

In my experience running rather lowish traffic(thousands hits a day) sites, doing just that brought every single annoyance from thousands per day to zero.

Yes, people -can- easily get around it via various listed methods, but don't seem to actually do that unless you're a high value target.

lxgr•2h ago

It definitely works, since you’re externalizing your annoyance to people you literally won’t ever hear from because you blanket banned them based. Most of them will just think your site is broken.

aspenmayer•1h ago

It seems to be a choice they’re making with their eyes open. If folks running a storefront don’t want to associate with you, it’s not a personal in that context. It’s business.

mort96•3h ago

The percentage of US trips abroad which are to China must be minuscule, and I bet nobody in the US regularly uses a VPN to get a Chinese IP address. So blocking Chinese IP addresses is probably going to have a small impact on US customers. Blocking all abroad IP addresses, on the other hand, would impact people who just travel abroad or use VPNs. Not sure what your point is or why you're comparing these two things.

thrown-0825•2h ago

If you are traveling without a vpn then you are asking for trouble

lxgr•2h ago

Yes, and I’m arguing that that’s due to companies engaging in silly pseudo-security. I wish that would stop.

ordu•2h ago

It is not silly pseudo-security, it is economics. Ban Chinese, lower your costs while not losing any revenue. It is capitalism working as intended.

lxgr•16m ago

Not sure I'd call dumping externalities on a minority of your customer base without recourse "capitalism working as intended".

Capitalism is a means to an end, and allowable business practices are a two-way street between corporations and consumers, mediated by regulatory bodies and consumer protection agencies, at least in most functioning democracies.

mvdtnz•2h ago

You think all streaming services have banned non US IPs? What world do you live in?

lxgr•2h ago

This is based on personal experience. At least two did not let me unsubscribe from abroad in the past.

throwawayffffas•2h ago

Not letting you unsubscribe and blocking your IP are very different things.

There are some that do not provide services in most countries but Netflix, Disney, paramount are pretty much global operations.

HBO and peacock might not be available in Europe but I am guessing they are in Canada.

misiek08•1h ago

In Europe we have all of them, with only few movies unavailable or additionally paid occasionally. Netflix, Disney, HBO, Prime and others work fine.

Funny to see how narrow perspective some people have…

johnisgood•38m ago

Okay, and I did not even mention torrent yet because many people come to downvote me. Eh, too late. I did mention torrents now.

lxgr•22m ago

Obligatory side note of "Europe is not a country".

In several European countries, there is no HBO since Sky has some kind of exclusive contract for their content there, and that's where I was accordingly unable to unsubscribe from an US HBO plan.

rtpg•1h ago

I think a lot of services end up sending you to a sort of generic "not in your country yet!" landing page in an awkward way that can make it hard to "just" get to your account page to do this kind of stuff.

Netflix doesn't have this issue but I've seen services that seem to make it tough. Though sometimes that's just a phone call away.

Though OTOH whining about this and knowing about VPNs and then complaining about the theoretical non-VPN-knower-but-having-subscriptions-to-cancel-and-is-allergic-to-phone-calls-or-calling-their-bank persona... like sure they exist but are we talking about any significant number of people here?

lxgr•25m ago

> Not letting you unsubscribe and blocking your IP are very different things.

How so? They did not let me unsubscribe via blocking my IP.

Instead of being able to access at least my account (if not the streaming service itself, which I get – copyright and all), I'd just see a full screen notice along the lines of "we are not available in your market, stay tuned".

sylware•42m ago

Won't help: I get scans and script kiddy hack attempts from digital ocean, microsoft cloud (azure, stretchoid.com), google cloud, aws, and lately "hostpapa" via its 'IP colocation service'. Ofc it is instant fail-to-ban (it is not that hard to perform a basic email delivery to an existing account...).

Traffic should be "privatize" as much as possible between IPv6 addresses (because you still have 'scanners' doing the whole internet all the time... "the nice guys scanning the whole internet for your protection... never to sell any scan data ofc).

Public IP services are done for: going to be hell whatever you do.

The right answer seems significantly big 'security and availability teams' with open and super simple internet standards. Yep the javascript internet has to go away and the app private protocols have too. No more whatng cartel web engine, or the worst: closed network protocols for "apps".

And the most important: hardcore protocol simplicity, but doing a good enough job. It is common sense, but the planned obsolescence and kludgy bloat lovers won't let you...

sugarpimpdorsey•3h ago

There's some weird ones you'd never think of that originate an inordinate amount of bad traffic. Like Seychelles. A tiny little island nation in the middle of the ocean inhabited by... bots apparently? Cyprus is another one.

Re: China, their cloud services seem to stretch to Singapore and beyond. I had to blacklist all of Alibaba Cloud and Tencent and the ASNs stretched well beyond PRC borders.

seanhunter•2h ago

The Seychelles has a sweetheart tax deal with India such that a lot of corporations who have an India part and a non-India part will set up a Seychelles corp to funnel cash between the two entities. Through the magic of "Transfer Pricing"[1] they use this to reduce the amount of tax they need to pay.

It wouldn't surprise me if this is related somehow. Like maybe these are Indian corporations using a Seychelles offshore entity to do their scanning because then they can offset the costs against their tax or something. It may be that Cyprus has similar reasons. Istr that Cyprus was revealed to be important in providing a storefront to Russia and Putin-related companies and oligarchs.[2]

So Seychelles may be India-related bots and Cyprus Russia-related bots.

[1] https://taxjustice.net/faq/what-is-transfer-pricing/#:~:text...

[2] Yup. My memory originated in the "Panama Papers" leaks https://www.icij.org/investigations/cyprus-confidential/cypr...

grandinj•2h ago

There is a Chinese player that has taken effective control of various internet-related entities in the Seychelles. Various ongoing court-cases currently.

So the seychelles traffic is likely really disguised chinese traffic.

supriyo-biswas•1h ago

I don't think these are "Chinese players" and is linked to [1], although it may be that the hands changed many times that the IP addresses have been leased or bought by Chinese entities.

[1] https://mybroadband.co.za/news/internet/350973-man-connected...

galaxy_gas•1h ago

this all from Cloud Innovation vpns,proxies,spam,bots CN Seychelles IP holder

sylware•1h ago

omg... that's why my self-hosted servers are getting nasty trafic from SC all the time.

The explanation is that easy??

lwansbrought•41m ago

> So the seychelles traffic is likely really disguised chinese traffic.

Soon: chineseplayer.io

sylware•19m ago

I forgot about that: all the nice game binaries from them running directly on nearly all systems...

sim7c00•1h ago

its not weird .its companies putting themselves in places where regulations favor their business models.

it wont be all chinese companies or ppl doing the scraping. its well known that a lot of countries dont mind such traffic as long as it doesnt target themselves or for the west also some allies.

laws arent the same everywhere and so companies can get away with behavior in one place which seem almost criminal in another.

and what better place to put your scrapers than somewhere where there is no copyright.

russia also had same but since 2012 or so they changed laws and a lot of traffic reduced. companies moved to small islands or small nation states (favoring them with their tax payouts, they dont mind if j bring money for them) or few remaining places like china who dont care for copyrights.

its pretty hard to get really rid of such traffic. you can block stuff but mostly it will just change the response your server gives. flood still knockin at the door.

id hope someday maybe ISPs or so get more creative but maybe they dont have enough access and its hard to do this stuff without the right access into the traffic (creepy kind) or running into accidentally censoring the whole thing.

johnisgood•1h ago

Yeah I am banning whole Singapore and China, for one.

I am getting down-voted for saying I ban whole Singapore and China? Oh lord... OK. Please all the down-voters list your public facing websites. I do not care if people from China cannot access my website. They are not the target audience, and they are free to use VPNs if they so wish, or Tor, or whatever works for them, I have not banned them yet, for my OWN PERSONAL SHITTY WEBSITE, inb4 you want to moderate the fuck out of what I can and cannot do on my own server(s). Frankly, fuck off, or be a hero and die a martyr. :D

sylware•50m ago

ucloud ("based in HK") has been an issue (much less lately though), and I had to ban the whole digital ocean AS (US). google cloud, aws and microsoft have also some issues...

hostpapa in the US seems to become the new main issue (via what seems a 'ip colocation service'... yes, you read well).

pabs3•28m ago

Which site is it?

johnisgood•27m ago

My own shitty personal website that is so uninteresting that I do not even wish to disclose here. Hence my lack of understanding of the down-votes for me doing what works for my OWN shitty website, well, server.

In fact, I bet it would choke on a small amount of traffic from here considering it has a shitty vCPU with 512 MB RAM.

thrown-0825•2h ago

Block Russia too, thats where i see most of my bot traffic coming from

imiric•1h ago

Lately I've been thinking that the only viable long-term solution are allowlists instead of blocklists.

The internet has become a hostile place for any public server, and with the advent of ML tools, bots will make up far more than the current ~50% of all traffic. Captchas and bot detection is a losing strategy as bot behavior becomes more human-like.

Governments will inevitably enact privacy-infringing regulation to deal with this problem, but for sites that don't want to adopt such nonsense, allowlists are the only viable option.

I've been experimenting with a system where allowed users can create short-lived tokens via some out-of-band mechanism, which they can use on specific sites. A frontend gatekeeper then verifies the token, and if valid, opens up the required public ports specifically for the client's IP address, and redirects it to the service. The beauty of this system is that the service itself remains blocked at the network level from the world, and only allowed IP addresses are given access. The only publicly open port is the gatekeeper, which only accepts valid tokens, and can run from a separate machine or network. It also doesn't involve complex VPN or tunneling solutions, just a standard firewall.

This should work well for small personal sites, where initial connection latency isn't a concern, but obviously wouldn't scale well at larger scales without some rethinking. For my use case, it's good enough.

snickerbockers•50m ago

Lmao I came here to post this. My personal server was making constant hdd grinding noises before I banned the entire nation of China. I only use this server for jellyfin and datahoarding so this was all just logs constantly rolling over from failed ssh auth attempts (PSA: always use public-key, don't allow root, and don't use really obvious usernames like "webadmin" or <literally just the domain>).

Xiol32•40m ago

Changing the SSH port also helps cut down the noise, as part of a layered strategy.

dotancohen•24m ago

Are you familiar with port knocking? My servers will only open port 22, or some other port, after two specific ports have been knocked on in order. It completely eliminates the log files getting clogged.

azthecx•22m ago

Did you really notice a significant drop off in connection attempts? I tried this some years ago and after a few hours on a random very high port number I was already seeing connections.

johnisgood•36m ago

Most of the traffic comes from China and Singapore, so I banned both. I might have to re-check and ban other regions who would never even visit my stupid website anyway. The ones who want to are free to, through VPN. I have not banned them yet.

adzicg•39m ago

We solved a similar issue by blocking free user traffic from data centres (and whitelisted crawlers for SEO). This eliminated most fraudulent usage over VPNs. Commercial users can still access, but free just users get a prompt to pay.

CloudFront is fairly good at marking if someone is accessing from a data centre or a residential/commercial endpoint. It's not 100% accurate and really bad actors can still use infected residential machines to proxy traffic, but this fix was simple and reduced the problem to a negligent level.

ta8645•3h ago

If ipv6 ever becomes a thing, it'll make blocking all that much harder.

snerbles•3h ago

For ipv6 you just start nuking /64s and /48s if they're really rowdy.

rnhmjoj•3h ago

No, it's really the same thing with just different (and more structured) prefix lengths. In IPv4 you usually block a single /32 address first, then a /24 block, etc. In IPv6 you start with a single /128 address, a single LAN is /64, an entire site is usually /56 (residential) or /48 (company), etc.

withinboredom•2h ago

Hmmm... that isn't my experience:

/128: single application

/64: single computer

/56: entire building

/48: entire (digital) neighborhood

rnhmjoj•1h ago

A /64 is the smallest network on which you can run SLAAC, so almost all VLANs should use this. /56 and /48 for end users is what RIRs are recommending, in reality the prefixes are longer, because ISPs and hosting providers wants you to pay like IPv6 space is some scarse resource.

[1]: https://www.ripe.net/publications/docs/ripe-690/

withinboredom•54m ago

Everyone at my isp is issued a /56 (and as far as I can tell, the entire country is this way).

firefoxd•3h ago

Since I posted an article here about using zip bombs [0], I'm flooded with bots. I'm constantly monitoring and tweaking my abuse detector, but this particular bot mentioned in the article seemed to be pointing to an RSS reader. I white listed it at first. But now that I gave it a second look, it's one of the most rampant bot on my blog.

[0]: https://news.ycombinator.com/item?id=43826798

dmurray•2h ago

If I had a shady web crawling bot and I implemented a feature for it to avoid zip bombs, I would probably also test it by aggressively crawling a site that is known to protect itself with hand-made zip bombs.

Moru•2h ago

Rule number one: You do not talk about fight club.

popcorncowboy•57m ago

Dark forest theory taking root.

PeterStuer•3h ago

FAFO from both sides. Not defending this bot at all. That said, the shenanigans some rogue or clueless webmasters are up to blocking legitimate and non intrusive or load causing M2M trafic is driving some projects into the arms of 'scrape services' that use far less considerate nor ethical means to get to the data you pay them for.

IP blocking is useless if your sources are hundreds of thousands of people worldwide just playing a "free" game on their phone that once in a while on wifi fetches some webpages in the background for the game publisher's scraping as a service side revenue deal.

ahtihn•2h ago

What? Are you trying to say it's legitimate to want to scrape websites that are actively blocking you because you think you are "not intrusive"? And that this justifies paying for bad actors to do it for you?

I can't believe the entitlement.

PeterStuer•2h ago

No. I'm talking about literally legitimate, information that has to be public by law and/or regulation (typically gov stuff), in formats specifically meant for m2m consuption, and still blocked by clueless or malicious outsourced lowest bidder site managers.

And no, I do not use those paid services, even though it would make it much easier.

geocar•1h ago

Exactly. If someone can harm your website on accident, they can absolutely harm it on purpose.

If you feel like you need to do anything at all, I would suggest treating it like any other denial-of-service vulnerability: Fix your server or your application. I can handle 100k clients on a single box, which equates to north of 8 billion daily impressions, and so I am happy to ignore bots and identify them offline in a way that doesn't reveal my methodologies any further than I absolutely have to.

BLKNSLVR•50m ago

> IP blocking is useless if your sources are hundreds of thousands of people worldwide just playing a "free" game on their phone that once in a while on wifi fetches some webpages in the background for the game publisher's scraping as a service side revenue deal.

That's traffic I want to block, and that's behaviour that I want to punish / discourage. If a set of users get caught up in that, even when they've just been given recycled IP addresses, then there's more chance to bring the shitty 'scraping as a service' behaviour to light, thus to hopefully disinfect it.

(opinion coming from someone definitely NOT hosting public information that must be accessible by the common populace - that's an issue requiring more nuance, but luckily has public funding behind it to develop nuanced solutions - and can just block China and Russia if it's serving a common populace outside of China and Russia).

phplovesong•2h ago

We block China and Russia. DDOS attacks and other hack attempts went down by 95%.

We have no chinese users/customers so in theory this does not effect business at all. Also russia is sanctioned and our russian userbase does not actually live in russia, so blocking russia did not effect users at all.

mavamaarten•2h ago

Same here. It sucks. But it's just cost vs reward at some point.

praptak•2h ago

How did you choose where to get the IP addresses to block? I guess I'm mostly asking where this problem (i.e. "get all IPs for country X") is on the scale from "obviously solved" to "hard and you need to play catch up constantly".

I did a quick search and found a few databases but none of them looks like the obvious winner.

bakugo•2h ago

Maxmind's GeoIP database is the industry standard, I believe. You can download a free version of it.

If your site is behind cloudflare, blocking/challenging by country is a built-in feature.

tietjens•2h ago

The common cloud platforms allow you to do geo-blocking.

spc476•1h ago

I used CYMRU <https://www.team-cymru.com/ip-asn-mapping> to do the mapping for the post.

herbst•2h ago

More than half of my traffic is Bing, Claude and for whatever reason the Facebook bots.

None of these are main main traffic drivers, just the main resource hogs. And the main reason when my site turns slow (usually an AI, microsoft or Facebook ignoring any common sense)

China and co is only a very small portion of my malicious traffic. Gladly. It's usually US companies who disrespect my robots.txt and DNS rate limits who make me the most problems.

devoutsalsa•2h ago

There are a lot of dumb questions, and I pose all of them to Claude. There's no infrastructure in place for this, but I would support some business model where LLM-of-choice compensates website operators for resources consumed by my super dumb questions. Like how content creators get paid when I watch with a YouTube Premium subscription. I doubt this is practical in practice.

roguebloodrage•2h ago

This is everything I have for AS132203 (Tencent). It has your addresses plus others I have found and confirmed using ipinfo.io

43.131.0.0/18 43.129.32.0/20 101.32.0.0/20 101.32.102.0/23 101.32.104.0/21 101.32.112.0/23 101.32.112.0/24 101.32.114.0/23 101.32.116.0/23 101.32.118.0/23 101.32.120.0/23 101.32.122.0/23 101.32.124.0/23 101.32.126.0/23 101.32.128.0/23 101.32.130.0/23 101.32.13.0/24 101.32.132.0/22 101.32.132.0/24 101.32.136.0/21 101.32.140.0/24 101.32.144.0/20 101.32.160.0/20 101.32.16.0/20 101.32.17.0/24 101.32.176.0/20 101.32.192.0/20 101.32.208.0/20 101.32.224.0/22 101.32.228.0/22 101.32.232.0/22 101.32.236.0/23 101.32.238.0/23 101.32.240.0/20 101.32.32.0/20 101.32.48.0/20 101.32.64.0/20 101.32.78.0/23 101.32.80.0/20 101.32.84.0/24 101.32.85.0/24 101.32.86.0/24 101.32.87.0/24 101.32.88.0/24 101.32.89.0/24 101.32.90.0/24 101.32.91.0/24 101.32.94.0/23 101.32.96.0/20 101.33.0.0/23 101.33.100.0/22 101.33.10.0/23 101.33.10.0/24 101.33.104.0/21 101.33.11.0/24 101.33.112.0/22 101.33.116.0/22 101.33.120.0/21 101.33.128.0/22 101.33.132.0/22 101.33.136.0/22 101.33.140.0/22 101.33.14.0/24 101.33.144.0/22 101.33.148.0/22 101.33.15.0/24 101.33.152.0/22 101.33.156.0/22 101.33.160.0/22 101.33.164.0/22 101.33.168.0/22 101.33.17.0/24 101.33.172.0/22 101.33.176.0/22 101.33.180.0/22 101.33.18.0/23 101.33.184.0/22 101.33.188.0/22 101.33.24.0/24 101.33.25.0/24 101.33.26.0/23 101.33.30.0/23 101.33.32.0/21 101.33.40.0/24 101.33.4.0/23 101.33.41.0/24 101.33.42.0/23 101.33.44.0/22 101.33.48.0/22 101.33.52.0/22 101.33.56.0/22 101.33.60.0/22 101.33.64.0/19 101.33.64.0/23 101.33.96.0/22 103.52.216.0/22 103.52.216.0/23 103.52.218.0/23 103.7.28.0/24 103.7.29.0/24 103.7.30.0/24 103.7.31.0/24 43.130.0.0/18 43.130.64.0/18 43.130.128.0/19 43.130.160.0/19 43.132.192.0/18 43.133.64.0/19 43.134.128.0/18 43.135.0.0/18 43.135.64.0/18 43.135.192.0/19 43.153.0.0/18 43.153.192.0/18 43.154.64.0/18 43.154.128.0/18 43.154.192.0/18 43.155.0.0/18 43.155.128.0/18 43.156.192.0/18 43.157.0.0/18 43.157.64.0/18 43.157.128.0/18 43.159.128.0/19 43.163.64.0/18 43.164.192.0/18 43.165.128.0/18 43.166.128.0/18 43.166.224.0/19 49.51.132.0/23 49.51.140.0/23 49.51.166.0/23 119.28.64.0/19 119.28.128.0/20 129.226.160.0/19 150.109.32.0/19 150.109.96.0/19 170.106.32.0/19 170.106.176.0/20

bigiain•2h ago

For anyone wondering how to do this (like me from a month or two back).

Here's a useful tool/site:

https://bgp.tools

You can feed it an ip address to get an AS ("Autonomous System"), then ask it for all prefixes associated with that AS.

I fed it that first ip address from that list (43.131.0.0) and it showed my the same Tencent owned AS132203, and it gives back all the prefixes they have here:

https://bgp.tools/as/132203#prefixes

(Looks like roguebloodrage might have missed at least the 1.12.x.x and 1.201.x.x prefixes?)

I started searching about how to do that after reading a RachelByTheBay post where she wrote:

Enough bad behavior from a host -> filter the host.

Enough bad hosts in a netblock -> filter the netblock.

Enough bad netblocks in an AS -> filter the AS. Think of it as an "AS death penalty", if you like.

(from the last part of https://rachelbythebay.com/w/2025/06/29/feedback/ )

BLKNSLVR•56m ago

This is what I've used to find ASs to block: https://hackertarget.com/as-ip-lookup/

eg. Chuck 'Tencent' into the text box and execute.

sneak•2h ago

I feel like people seem to forget that an HTTP request is, after all, a request. When you serve a webpage to a client, you are consenting to that interaction with a voluntary response.

You can blunt instrument 403 geoblock entire countries if you want, or any user agent, or any netblock or ASN. It’s entirely up to you and it’s your own server and nobody will be legitimately mad at you.

You can rate limit IPs to x responses per day or per hour or per week, whatever you like.

This whole AI scraper panic is so incredibly overblown.

I’m currently working on a sniffer that tracks all inbound TCP connections and UDP/ICMP traffic and can trigger firewall rule addition/removal based on traffic attributes (such as firewalling or rate limiting all traffic from certain ASNs or countries) without actually having to be a reverse proxy in the HTTP flow. That way your in-kernel tables don’t need to be huge and they can just dynamically be adjusted from userspace in response to actual observed traffic.

worthless-trash•1h ago

> This whole AI scraper panic is so incredibly overblown.

The problem is that its eating into peoples costs, and if you're not concerned with money, I'm just asking, can you send me $50.00 USD ?

znpy•2h ago

Oh i recognise those ip addresses… they gave us quite an headache a while ago

yumraj•2h ago

Wouldn't it be better, if there's an easy way, to just feed such bots shit data instead of blocking them. I know it's easier to block and saves compute and bandwidth, but perhaps feeding them shit data at scale would be a much better longer term solution.

throwawayffffas•1h ago

No serving shit data costs bandwidth and possibly compute time.

Blocking IPS is much cheaper for the blocker.

fuckaj•35m ago

Zip bomb?

sotspecatcle•1h ago

    if ($http_user_agent ~* "BadBot") {
        limit_rate 1k;
        default_type application/octet-stream;
        proxy_buffering off;
        alias /dev/zero;
        return 200;
    }

internet_points•1h ago

https://zadzmo.org/code/nepenthes/

praptak•1h ago

"I'm seriously thinking that the CCP encourage this with maybe the hope of externalizing the cost of the Great Firewall to the rest of the world. If China scrapes content, that's fine as far as the CCP goes; If it's blocked, that's fine by the CCP too (I say, as I adjust my tin foil hat)."

Then turn the tables on them and make the Great Firewall do your job! Just choose a random snippet about illegal Chinese occupation of Tibet or human rights abuses of Uyghur people each time you generate a page and insert it as a breaker between paragraphs. This should get you blocked in no time :)

that_lurker•1h ago

Why not just block the User Agent?

N_Lens•19m ago

Bots often rotate the UA too, their entire goal is to get through and scrape as much content as possible, using any means possible.

geokon•1h ago

Is there a way to reverse look up IPs by company? Like a list off all IPs owned by Alphabet, Meta Bing etc?

BLKNSLVR•57m ago

https://hackertarget.com/as-ip-lookup/

Chuck 'Tencent' into the text box and execute.

boris•1h ago

Yes, I've seen this one in our logs. Quite obnoxious, but at least it identifies itself as a bot and, at least in our case (cgit host), does not generate much traffic. The bulk of our traffic comes from bots that pretend to be real browsers and that use a large number of IP addresses (mostly from Brazil and Asia in our case).

I've been playing cat and mouse trying to block them for the past week and here are a couple of observations/ideas, in case this is helpful to someone:

* As mentioned above, the bulk of the traffic comes from a large number of IPs, each issuing only a few requests a day, and they pretend to be real UAs.

* Most of them don't bother sending the referrer URL, but not all (some bots from Huawei Cloud do, but they currently don't generate much traffic).

* The first thing I tried was to throttle bandwidth for URLs that contain id= (which on a cgit instance generate the bulk of the bot traffic). So I set the bandwidth to 1Kb/s and thought surely most of the bots will not be willing to wait for 10-20s to download the page. Surprise: they didn't care. They just waited and kept coming back.

* BTW, they also used keep alive connections if ones were offered. So another thing I did was disable keep alive for the /cgit/ locations. Failed that enough bots would routinely hog up all the available connections.

* My current solution is to deny requests for all URLs containing id= unless they also contain the `notbot` parameter in the query string (and which I suggest legitimate users add in the custom error message for 403). I also currently only do this if the referrer is not present but I may have to change that if the bots adapt. Overall, this helped with the load and freed up connections to legitimate users, but the bots didn't go away. They still request, get 403, but keep coming back.

My conclusion from this experience is that you really only have two options: either do something ad hoc, very specific to your site (like the notbot in query string) that whoever runs the bots won't bother adapting to or you have to employ someone with enough resources (like Cloudflare) to fight them for you. Using some "standard" solution (like rate limit, Anubis, etc) is not going to work -- they have enough resources to eat up the cost and/or adapt.

palmfacehn•48m ago

Pick an obscure UA substring like MSIE 3.0 or HP-UX. Preemptively 403 these User Agents, (you'll create your own list). Later in the week you can circle back and distill these 403s down to problematic ASNs. Whack moles as necessary.

teekert•1h ago

These IP addresses being released at some point, and making their way into something else is probably the reason I never got to fully run my mailserver from my basement. These companies are just massively giving IP addresses a bad reputation, messing them up for any other use and then abandoning them. I wonder what this would look like when plotted: AI (and other toxic crawling) companies slowly consuming the IPv4 address space? Ideally we'd forced them into some corner of the IPv6 space I guess. I mean robots.txt seems not to be of any help here.

niczem•1h ago

I think banning IPs is a treadmill you never really get off of. Between cloud providers, VPNs, CGNAT, and botnets, you spend more time whack-a-moling than actually stopping abuse. What’s worked better for me is tarpitting or just confusing the hell out of scrapers so they waste their own resources.

There’s a great talk on this: Defense by numbers: Making Problems for Script Kiddies and Scanner Monkeys https://www.youtube.com/watch?v=H9Kxas65f7A

What I’d really love to see - but probably never will—is companies joining forces to share data or support open projects like Common Crawl. That would raise the floor for everyone. But, you know… capitalism, so instead we all reinvent the wheel in our own silos.

BLKNSLVR•59m ago

If you can automate the treadmill and set a timeout at which point the 'bad' IPs will go back to being 'not necessarily bad', then you're minimising the effort required.

An open project that classifies and records this - would need a fair bit of on-going protection, ironically.

BLKNSLVR•1h ago

I've mentioned my project[0] before, and it's just as sledgehammer-subtle as this bot asks.

I have a firewall that logs every incoming connection to every port. If I get a connection to a port that has nothing behind it, then I consider the IP address that sent the connection to be malicious, and I block the IP address from connecting to any actual service ports.

This works for me, but I run very few things to serve very few people, so there's minimal collateral damage when 'overblocking' happens - the most common thing is that I lock myself out of my VPN (lolfacepalm).

I occasionally look at the database of IP addresses and do some pivot tabling to find the most common networks and have identified a number of cough security companies that do incessant scanning of the IPv4 internet among other networks that give me the wrong vibes.

[0]: Uninvited Activity: https://github.com/UninvitedActivity/UninvitedActivity

P.S. If there aren't any Chinese or Russian IP addresses / networks in my lists, then I probably block them outright prior to the logging.

rglullis•1h ago

All of the "blockchain is only for drug dealing and scams" people will sooner or later realize that it is the exact type of scenarios that makes it imperative to keep developing trustless systems.

The internet was a big level-playing field, but for the past half century corporations and state actors managed to keep control and profit to themselves while giving the illusion that us peasants could still benefit from it and had a shot at freedom. Now that computing power is so vast and cheap, it has become an arms race and the cyberpunk dystopia has become apparent.

latexr•11m ago

> All of the "blockchain is only for drug dealing and scams" people will sooner or later realize that it is the exact type of scenarios that makes it imperative to keep developing trustless systems.

This is like saying “All the “ sugar-sweetened beverages are bad for you” people will sooner or later realize it is imperative to drink liquids”. It is perfectly congruent to believe trustless systems are important and that the way the blockchain works is more harmful than positive.

Additionally, the claim is that cryptocurrencies are used like that. Blockchains by themselves have a different set of issues and criticisms.

mellosouls•1h ago

Unfortunately, HN itself is occasionally used for publicising crawling services that rely on underhand techniques that don't seem terribly different to the ones here.

I don't know if its because they operate in the service of capital rather than China, as here, but use of those methods in the former case seems to get more of a pass here.

bob1029•55m ago

I think a lot of really smart people are letting themselves get taken for a ride by the web scraping thing. Unless the bot activity is legitimately hammering your site and causing issues (not saying this isn't happening in some cases), then this mostly amounts to an ideological game of capture the flag. The difference being that you'll never find their flag. The only thing you win by playing is lost time.

The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product. This is good news, because your actual human customers would really enjoy this too.

phito•50m ago

My friend has a small public gitea instance, only use by him a a few friends. He's getting thousounds of requests an hour from bots. I'm sorry but even if it does not impact his service, at the very least it feels like harassment

dmesg•45m ago

Yes and it makes reading your logs needlessly harder. Sometimes I find an odd password being probed, search for it on the web and find an interesting story, that a new backdoor was discovered in a commercial appliance.

In that regard reading my logs led me sometimes to interesting articles about cyber security. Also log flooding may result in your journaling service truncating the log and you miss something important.

wvbdmp•37m ago

You log passwords?

bob1029•26m ago

Thousands of requests per hour? So, something like 1-3 per second?

If this is actually impacting perceived QoS then I think a gitea bug report would be justified. Clearly there's been some kind of a performance regression.

Just looking at the logs seems to be an infohazard for many people. I don't see why you'd want to inspect the septic tanks of the internet unless absolutely necessary.

johnnyfaehell•24m ago

While we may be smart, a lot of us are extremely pedantic about tech things. I think for many if they did nothing it would wind them up the wall while doing something the annoyance is smaller.

themafia•19m ago

The way I get a fast web product is to pay a premium for data. So, no, it's not "lost time" by banning these entities, it's actual saved costs on my bandwidth and compute bills.

The bonus is my actual customers get the same benefits and don't notice any material loss from my content _not_ being scraped. How you see this as me being secretly taken advantage of is completely beyond me.

baxuz•7m ago

Geoblocking China and Russia should be the default.

What are OKLCH colors?

Git-Annex

SmallJS: Smalltalk-80 that compiles to JavaScript

MCP Gateway and Registry

Ban me at the IP level if you don't like me

Buypass Discontinues Issuance of TLS/SSL Certificates

Busy beaver hunters reach numbers that overwhelm ordinary math

Show HN: Sping – An HTTP/TCP latency tool that's easy on the eye

The Unix-Haters Handbook (1994) [pdf]

In-Memory Filesystems in Rust

We put a coding agent in a while loop

Is 4chan the perfect Pirate Bay poster child to justify wider UK site-blocking?

From Hackathon to YC

Show HN: CasCache – multi-generational cache with optimistic concurrency control

Uncle Sam shouldn't own Intel stock

Burner Phone 101

The two versions of Parquet

Trees on city streets cope with drought by drinking from leaky pipes

Making games in Go: 3 months without LLMs vs. 3 days with LLMs

A bubble that knows it's a bubble

Claim: GPT-5-pro can prove new interesting mathematics

A Brilliant and Nearby One-off Fast Radio Burst Localized to 13 pc Precision

Everything I know about good API design

Cloudflare incident on August 21, 2025

Show HN: Clearcam – Add AI object detection to your IP CCTV cameras

YouTube made AI enhancements to videos without warning or permission

Show HN: I Built a XSLT Blog Framework

Y Combinator files brief supporting Epic Games, says store fees stifle startups

Stepanov's biggest blunder? The curious case of adjacent difference

Ghrc.io appears to be malicious

What are OKLCH colors?

Git-Annex

SmallJS: Smalltalk-80 that compiles to JavaScript

MCP Gateway and Registry

Ban me at the IP level if you don't like me

Buypass Discontinues Issuance of TLS/SSL Certificates

Busy beaver hunters reach numbers that overwhelm ordinary math

Show HN: Sping – An HTTP/TCP latency tool that's easy on the eye

The Unix-Haters Handbook (1994) [pdf]

In-Memory Filesystems in Rust

We put a coding agent in a while loop

Is 4chan the perfect Pirate Bay poster child to justify wider UK site-blocking?

From Hackathon to YC

Show HN: CasCache – multi-generational cache with optimistic concurrency control

Uncle Sam shouldn't own Intel stock

Burner Phone 101

The two versions of Parquet

Trees on city streets cope with drought by drinking from leaky pipes

Making games in Go: 3 months without LLMs vs. 3 days with LLMs

A bubble that knows it's a bubble

Claim: GPT-5-pro can prove new interesting mathematics

A Brilliant and Nearby One-off Fast Radio Burst Localized to 13 pc Precision

Everything I know about good API design

Cloudflare incident on August 21, 2025

Show HN: Clearcam – Add AI object detection to your IP CCTV cameras

YouTube made AI enhancements to videos without warning or permission

Show HN: I Built a XSLT Blog Framework

Y Combinator files brief supporting Epic Games, says store fees stifle startups

Stepanov's biggest blunder? The curious case of adjacent difference

Ghrc.io appears to be malicious

Ban me at the IP level if you don't like me

Comments