Guarding My Git Forge Against AI Scrapers

https://vulpinecitrus.info/blog/guarding-git-forge-ai-scrapers/

182•todsacerdoti•1mo ago

Comments

pabs3•1mo ago

> the difference in power usage caused by scraping costs us ~60 euros a year

dirkc•1mo ago

I'm not 100% against AI, but I do cheer loudly when I see things like this!

I'm also left wondering about what other things you could do? For example - I have several friends that built their own programming languages, I wonder what the impact would be if you translate lots of repositories to your own language and host it for bots to scrape? Could you introduce sufficient bias in a LLM to make an esoteric programming language popular?

zwnow•1mo ago

> Could you introduce sufficient bias in a LLM to make an esoteric programming language popular?

Wasn't there a study a while back showing that a small sample of data is good enough to poison an LLM? So I'd say it for sure is possible.

hurturue•1mo ago

Russia already does that - poisons the net for future LLM pretraining data.

it's called "LLM grooming"

https://thebulletin.org/2025/03/russian-networks-flood-the-i...

brabel•1mo ago

This article shows no evidence for anything it claims. None. All of that while claiming we can’t believe almost anything we read online… well you’re god damn right.

> undermining democracy around the globe is arguably Russia’s foremost foreign policy objective.

Right, because Russia is such a cartoonish villain it has no interest in pursuing its own development and good relations with any other country, all it cares about is annoying the democratic countries with propaganda about their own messed up politics.

When did it become acceptable for journalists to make bold, generalizing claims against whole nations without a single direct, falsifiable evidence of what they claim and worse, making claims like this that can be easily dismissed as obviously false by quickly looking at the policies and their diplomatic interactions with other countries?!

nutjob2•1mo ago

> Right, because Russia is such a cartoonish villain it has no interest in pursuing its own development and good relations with any other country, all it cares about is annoying the democratic countries with propaganda about their own messed up politics.

That's actually pretty much spot on.

brabel•1mo ago

When you start believing that there are only good and bad, black and white, them vs us, you know for sure you’ve been brainwashed. Goes to both sides.

hurturue•1mo ago

so between 0 (good) and 100 (bad), what would be your gray score "badness/evilness" value for the following: Russia, US, China, EU

yes, i know, it's not a linear axis, it's multi-dimensional perspective thing. so do a PCA/projection and spit one number, according to your values/beliefs

tkfoss•1mo ago

95,95,95,{depends on the country, from 30 to 100}

nutjob2•1mo ago

For someone who complains about unsupported claims, you seem to make a lot of them.

The fact that you think this is something to do with "both sides" instead of a simple question of facts really gives you away.

brabel•1mo ago

What?? I am just saying that if you think the world is made of black and white villains vs heroes, you are buying into the propaganda from one side or another. This is not a bold claim, this is basic logic from anyone mature enough to know that no country, and no person, is just simply either good or bad. They do bad things in order to accomplish what they believe to be good things. The US drop two atomic bombs on Japan, a horrifically evil act, but it did so in order to defeat what it believed to be an even bigger evil. Russia invaded Ukraine, a violent, barbaric act that caused the deaths of at least a million on both sides, but it did so because it, like the US, believed to be doing what's right to ensure their country's independence in the longer term since, as they'd been saying for decades, Ukraine must never be allowed to join a hostile military alliance as that would compromise forever Russia's own ability to defend itself from invasion from western powers, when Operation Barbarossa is still very, very alive in their minds to this very day. It doesn't matter if you agree with either the US or Russia on whether they were actually right, what matters is that they themselves thought they were right, given their own circumstances, and people love to ignore that and judge them by their own perceptions of what they should think. This is a sign of immaturity: you probably judge people around you in your life like that as well, by what you see from the outside without any idea what's going on inside their heads.

mopsi•1mo ago

Ukraine was not joining NATO. Nor do foreign policy professionals, both in Russia and abroad, consider NATO a threat to Russia. Nor does the war have anything to do with Barbarossa or many other historical comparisons; Russian propaganda generally avoids drawing comparisons to Barbarossa because Ukraine was at the forefront of the invasion and the historical parallels between the invaders would be too obvious. The sudden and devastating attack and siege of Kyiv in 1941 was a major traumatic early-war event that occupies a similar place in Russian mass consciousness as Pearl Harbor does in American consciousness. Instead, Russian propaganda frames the war as a continuation of a civilizational mission: the reclamation of "historic Russian lands" and the reunification of the Russian people.

This is at odds with the propaganda for foreign audiences that presents the war as a modern conflict with NATO, but thankfully, people like you who talk about listening to Russia don't actually know what's going on there and flat out refuse to listen what Russians are saying.

For instance, the commander of the 2014 invasion of Donbas is a prominent public figure, a mentor and ideologue, who used to host lengthy livestreams in which he discussed how and why the war happened. Have you watched any of his long talks about the restoration of the Russian imperial province of Novorossiya through war against Ukraine, or do you prefer to pretend that none of this exists?

Not to mention the entire pre-Putin generation of Russian politicians and diplomats, who are very active on Twitter and readily explain how NATO is beneficial to Russia by imposing extensive standards on its members along Russia's western border.

Putin's own former senior advisor recently got so pissed about dumbasses placing blame on NATO that he published a video on his personal Youtube channel explaining why the entire narrative a malicious misrepresentation of the facts and bullshit from the start. According to him, Putin held secret staff meetings (which the advisor attended) about the invasion of Ukraine as early as 2005, which predates the common excuses for the war by many years.

But no. Instead of listening to Russians, you just repeat hollow Russian war propaganda that echoes across the internet without any real people behind it, believing that you have some insight that others lack.

brabel•1mo ago

> Ukraine was not joining NATO

Oh my god, you're just ignorant.

NATO itself said they would join the alliance back in 2008.

> At the 2008 Bucharest summit, NATO declined to offer Ukraine a Membership Action Plan, but said that Ukraine would eventually join the alliance.

Source: https://en.wikipedia.org/wiki/Ukraine%E2%80%93NATO_relations

Also mentioned even in NATO's own website:

https://www.nato.int/en/what-we-do/partnerships-and-cooperat...

You can read the full statement from 2008 here: https://www.nato.int/en/about-us/official-texts-and-resource...

This was the trigger for the Russian invasion of Georgia, which was also mentioned in the above statement.

Do you think NATO is spreading Russian propaganda on its website??

mopsi•1mo ago

  > NATO itself said they would join the alliance back in 2008.

No. That was merely a polite statement without any timetable or actionable steps, made after the allies had decided at the 2008 summit not to invite Ukraine and Georgia into NATO, leaving both countries exposed to Russian pressure and eventual military aggression.

Nowadays, this is widely considered a severe mistake. For example, Anders Fogh Rasmussen, the former NATO secretary-general, has on many occasions referred to the exclusion of Ukraine and Georgia as mistakes that emboldened Putin to invade them.

Perhaps you should drop him an email to explain that he doesn't know what he's talking about and that Ukraine was "actually" on the path to NATO membership. So much ignorance in the world, ain't there.

frogperson•1mo ago

Can you point me to any examples of russia doing something good or helping anyone except billionaires? No? Then their reputation is well deserved.

nightpool•1mo ago

They link multiple sources, including a Sunshine Foundation report summarizing other research into the area, and a NewsGuard report where they tested claims from the Pravda network directly against leading LLM chatbots: https://static1.squarespace.com/static/6612cbdfd9a9ce56ef931... https://www.newsguardtech.com/special-reports/generative-ai-...

brabel•1mo ago

I think this research seems a little bit suspicious. First of all, it focuses almost entirely on the brawl the NewsGuard is having with one guy[1]. Notice how that's where most of the "fake news" they mention come from. Secondly, asking the LLM a "leading question" is a very well known way to get biased answers and they did that to an extreme extent in this piece. Read this article to understand how you can get LLMs to say almost anything that supports whatever you want it to support[2]. That unfortunately weakens what may have been good points they had: that LLMs seem to just trust any news websites equally regardless of their "accuracy". I would like to point out that American news websites are not known for their accuracy either and the proliferation of fact-checking websites that routinely debunk their half lies proves that.

I do agree with you that there are many news sites spreading misinformation, but I think that most of it is not coming from governments... and while governments are also doing this, most, I would think, do it with good intentions (they do believe the information is true and barely verify that when it favours their preconceived points of view). When propaganda spreads information you like, you tend to call it just news.

The way current Western media is currently dismissing anything at all that comes from Russian sources as lies and propaganda, however, is way overblown in my opinion. That's causing a huge blind spot in the public discourse which just makes the fake news sources seem even more attractive, since they seem to be whistleblowers fighting against a campaign of silence from the mainstream media, which is not completely incorrect.

[1] https://www.newsguardtech.com/special-reports/john-mark-doug...

[2] https://medium.com/@amithnmbr/why-its-important-to-know-how-...

ekropotin•1mo ago

As a Russian, I have to say that Putin is indeed way too focused on geopolitics instead of internal state of affairs.

hashar•1mo ago

I do not understand why the scrappers do not do it in a smarter way: clone the repositories and fetches from there on a daily or so basis. I have witnessed one going through every single blame and log links across all branches and redoing it every few hours! It sounds like they did not even tried to optimize their scrappers.

FieryMechanic•1mo ago

The way most scrapers work (I've written plenty of them) is that you just basically get the page and all the links and just drill down.

tigranbs•1mo ago

And obviously, you need things fast, so you parallelize a bunch!

FieryMechanic•1mo ago

I was collecting UK bank account sort code numbers (to a buy a database at the time costs a huge amount of money). I had spent a bunch of time using asyncio to speed up scraping and wondered why it was going so slow, I had left Fiddler profiling in the background.

conartist6•1mo ago

So the easiest strategy to hamper them if you know you're serving a page to an AI bot is simply to take all the hyperlinks off the page...?

That doesn't even sound all that bad if you happen to catch a human. You could even tell them pretty explicitly with a banner that they were browsing the site in no-links mode for AI bots. Put one link to an FAQ page in the banner since that at least is easily cached

FieryMechanic•1mo ago

When I used to build these scrapers for people, I would usually pretend to be a browser. This normally meant changing the UA and making the headers look like a read browser. Obviously more advanced techniques of bot detection technique would fail.

Failing that I would use Chrome / Phantom JS or similar to browse the page in a real headless browser.

conartist6•1mo ago

I guess my point is since it's a subtle interference that leaves the explicitly requested code/content fully intact you could just do it as a blanket measure for all non-authenticated users. The real benefit is that you don't need to hide that you're doing it or why...

conartist6•1mo ago

You could add a feature kind of like "unlocked article sharing" where you can generate a token that lives in a cache so that if I'm logged in and I want to send you a link to a public page and I want the links to display for you, then I'd send you a sharing link that included a token good for, say, 50 page views with full hyperlink rendering. After that it just degrades to a page without hyperlinks again and you need someone with an account to generate you a new token (or to make an account yourself).

Surely someone would write a scraper to get around this, but it couldn't be a completely-plain https scraper, which in theory should help a lot.

conartist6•1mo ago

I would build a little stoplight status dot into the page header. Red if you're fully untrusted. Yellow if you're semi-trusted by a token, and it shows you the status of the token, e.g. the number of requests remaining on it. Green if you're logged in or on a trusted subnet or something. The status widget would links to all the relevant docs about the trust system. No attempt would be made to hide the workings of the trust system.

ACCount37•1mo ago

Because that kind of optimization takes effort. And a lot of it.

Recognize that a website is a Git repo web interface. Invoke elaborate Git-specific logic. Get the repo link, git clone it, process cloned data, mark for re-indexing, and then keep re-indexing the site itself but only for things that aren't included in the repo itself - like issues and pull request messages.

The scrapers that are designed with effort usually aren't the ones webmasters end up complaining about. The ones that go for quantity over quality are the worst offenders. AI inference-time data intake with no caching whatsoever is the second worst offender.

immibis•1mo ago

Because they don't have any reason to give any shits. 90% of their collected data is probably completely useless, but they don't have any incentive to stop collecting useless data, since their compute and bandwidth is completely free (someone else pays for it).

They don't even use the Wikipedia dumps. They're extremely stupid.

Actually there's not even any evidence they have anything to do with AI. They could be one of the many organisations trying to shut down the free exchange of knowledge, without collecting anything.

dspillett•1mo ago

> I do not understand why the scrappers do not do it in a smarter way

If you mean scrapers in terms of the bots, it is because they are basically scraping web content via HTTP(S) generally, without specific optimisations using other protocols at all. Depending on the use case intended for the model being trained, your content might not matter at all, but it is easier just to collect it and let it be useless than to optimise it away⁰. For models where your code in git repos is going to be significant for the end use, the web scraping generally proves to be sufficient so any push to write specific optimisations for bots for git repos would come from academic interest rather than an actual need.

If you mean scrapers in terms of the people using them, they are largely akin to “script kiddies” just running someone else's scraper to populate their model.

If by scrapers in terms of people writing them, then the fact that just web scraping is sufficient as mentioned above is likely the significant factor.

> why the scrappers do not do it in a smarter way

A lot of the behaviours seen are easier to reason if you stop considering scrapers (the people using scraper bots) to be intelligent, respectful, caring, people who might give a damn about the network as a whole, or who might care about doing things optimally. Things make more sense if you consider them to be in the same bucket as spammers, who are out for a quick lazy gain for themselves and don't care, or even have the foresight to realise, how much it might inconvenience¹ anyone else.

----

[0] the fact this load might be inconvenient to you is immaterial to the scraper

[1] The ones that do realise that they might cause an inconvenience usually take the view that it is only a small one, and how can the inconvenience little them are imposing really be that significant? They don't think the extra step of considering how many people like them are out there thinking the same. Or they think if other people are doing it, what is the harm in just one more? Or they just take the view “why should I care if getting what I want inconveniences anyone else?”.

captn3m0•1mo ago

I switched to rgit instead of running Gitea.

ArcHound•1mo ago

Seems like you're cooking up a solid bot detection solution. I'd recommend adding JA3/JA4+ into the mix, I had good results against dumb scrapers.

Also, have you considered Captchas for first contact/rate-limit?

If you have smart scrapers, then good luck. I recall that bot farms use pre-paid SIM cards for their data connections so that their traffic comes from a good residential ASN. They also have a lot of IPs and overall well-made headless browsers with JS support. Then it's a battle of JS quirks where the official implementation differs from headless one.

mappu•1mo ago

Gitea has a builtin defense against this, `REQUIRE_SIGNIN_VIEW=expensive`, that completely stopped AI traffic issues for me and cut my VPS's bandwidth usage by 95%.

01HNNWZ0MV43FF•1mo ago

Neat https://docs.gitea.com/administration/config-cheat-sheet#ser...

> Enable this to force users to log in to view any page or to use API. It could be set to "expensive" to block anonymous users accessing some pages which consume a lot of resources, for example: block anonymous AI crawlers from accessing repo code pages. The "expensive" mode is experimental and subject to change.

Forgejo doesn't seem to have copied that feature yet

greenavocado•1mo ago

Are you the only user of your web-facing Gitea? If so, put it behind Wireguard VPN, and basically never worry about bandwidth and security again.

fragmede•1mo ago

So much this. Wireguard is so easy to do and no, the whole world doesn't need access to my shit, just me and a couple of close friends.

jauntywundrkind•1mo ago

This is the most assured best way to make sure your remain the only user of your stuff.

I highly encourage folks to put stuff out there! Put your stuff on the internet! Even if you don't need it even if you don't think you'll necessarily benefit: leave the door open to possibility!

wiether•1mo ago

I don't understand the purpose of this parameter value?

I have `REQUIRE_SIGNIN_VIEW=true` and I see nothing but my own traffic on Gitea's logs.

Is it because I'm using a subdomain that doesn't imply there's a Gitea instance behind?

mappu•1mo ago

Crawlers will find everything on the internet eventually regardless of subdomain (e.g. from crt.sh logs, or Google finds them from 8.8.8.8 queries).

REQUIRE_SIGNIN_VIEW=true means signin is required for all pages - that's great and definitely stops AI bots. The signin page is very cheap for Gitea to render. However, it is a barrier for the regular human visitors to your site.

'expensive' is a middle-ground that lets normal visitors browse and explore repos, view README, and download release binaries. Signin is only required for "expensive" pageloads, such as viewing file content at specific commits git history.

wiether•1mo ago

Thanks for the clarification!

From Gitea's doc I was under the impression that it was going further than "true" so I didn't understood why because "true" was enough for me to not be bothered by bots.

But in your case you want a middle-ground, which is provided by "expensive"!

nextaccountic•1mo ago

oh.. that's why 8.8.8.8 is free

sodimel•1mo ago

I, too, am selfhosting some projects on an old computer. And the fact that you can "hear internet" (with the fans going on) is really cool (unless you're trying to sleep while being scrapped).

xyzal•1mo ago

Does anyone have an idea how to generate, say, insecure code, en masse? I think it should be the next frontier. Not feed them random bytestream, but toxic waste.

moooo99•1mo ago

Ironically, probably the fastest way to create insecure code is by asking AI chatbots to code

tpxl•1mo ago

Create a few insecure implementations, parse them into an AST, then turn them back into code (basically compile/decompile) except rename the variables and reorder stuff where you can without affecting the result.

hurturue•1mo ago

in general the consensus on HN is that the web should be free, scraping public content should be allowed, and net neutrality is desired.

do we want to change that? do we want to require scrapers to pay for network usage, like the ISPs were demanding from Netflix? is net neutrality a bad thing after all?

WhyOhWhyQ•1mo ago

If net neutrality is a trojan horse for 'Sam Altman and the Antrhopic guy own everything I do' then I voice my support for a different path.

dns_snek•1mo ago

Net neutrality has nothing to do with how content publishers treat visitors, it's about ISPs who try to interfere based on the content of the traffic instead of just providing "dumb pipes" (infrastructure) like they're supposed to.

I can't speak for everyone, but the web should be free and scraping should be allowed insofar that it promotes dissemination of knowledge and data in a sustainable way that benefits our society and generations to come. You're doing the thing where you're trying to pervert the original intent behind those beliefs.

I see this as a clear example of the paradox of tolerance.

pelotron•1mo ago

Just as private businesses are allowed "no shirt, no shoes, no service" policies, my website should be allowed a "no heartbeat, no qualia, no HTTP 200".

komali2•1mo ago

I'm completely happy for everything to be free. Free as in freedom, especially! Agpl3, creative commons, let's do it!

But for some reason corporations don't want that, I guess they want to be allowed to just take from the commons and give nothing in return :/

wrxd•1mo ago

The general consensus here is also that a DDOS attack is bad. I haven't seen objections against respectful scraping. You can say many things about AI scrapers but I wouldn't call them respectful at all.

charcircuit•1mo ago

Yet HN does it when linking to poorly optimized sites. I doubt people running forges would complain about AI scrapers if their sites were optimized for serving the static content that is being requested.

BenjiWiebe•1mo ago

Do people truly dislike an organic DDoS?

So much real human traffic that it brings their site down?

I mean yes it's a problem, but it's a good problem.

voidUpdate•1mo ago

If my website got hugged to death, I would be very happy. If my website got scraped to hell and back by people putting it into the plagiarism machine so that it can regurgitate my content without giving me any attribution, I would be very displeased

microtherion•1mo ago

a) There are too damn many of them.

b) They have a complete lack of respect for robots.txt

I'm starting to think that aggressive scrapers are part of an ongoing business tactic against the decentralized web. Gmail makes self hosted mail servers jump through arduous and poorly documented hoops, and now self hosted services are being DDOSed by hordes of scrapers…

johneth•1mo ago

I think, for many, the web should be free for humans.

When scraping was mainly used to build things like search indexes which are ultimately mutually beneficial to both the website owner and the search engine, and the scrapers were not abusive, nobody really had a problem.

But for generative AI training and access, with scrapers that DDoS everything in sight, and which ultimately cause visits to the websites to fall significantly and merely return a mangled copy of its content back to the user, scraping is a bad thing. It also doesn't help that the generative AI companies haven't paid most people for their training data.

evgpbfhnr•1mo ago

I had the same problem on our home server.. I just stopped the git forge due to lack of time.

For what it's worth, most requests kept coming in for ~4 days after -everything- returned plain 404 errors. millions. And there's still some now weeks later...

FabCH•1mo ago

If you don't need global access, I have found that Geoblocking is the best first step. Especially if you are in a small country with a small footprint and you can get away at blocking the rest of the world. But even if you live in the US, excluding Russia, India, Iran and a few others will cut your traffic by double digit percent.

In the article, quite a few listed sources of traffic would simply be completely unable to access the server if the author could get away with a geoblock.

komali2•1mo ago

Reminds me of when 4chan banned Russia entirely to stop DDOSes. I can't find it but there was a funny post from Hiro saying something like "couldn't figure out how to stop the ddos. Banned Russia. Ddos ended. So Russia is banned. /Shrug"

ralferoo•1mo ago

Similarly, for my e-mail server, I manually add spammers into my exim local_sender_blacklist a single domain at a time. About a month into doing this, I just gave up and added * @* .ru and that instantly cut out around 80% of the spam e-mail.

It's funny observing their tactics though. On the whole, spammers have moved from bare domain to various prefixes like @outreach.domain, @msg.domain, @chat.domain, @mail.domain, @contact.domain and most recently @email.domain.

It's also interesting watching the common parts before the @. Most recently I've seen a lot of marketing@, before that chat@ and about a month after I blocked that chat1@. I mostly block *@domain though, so I'm less aware of these trends.

ThatPlayer•1mo ago

We've had a similar discussion at my work. E-commerce that only ships to North America. So blocking anyone outside of that is an option.

Or I might try and put up Anubis only for them.

FabCH•1mo ago

Be slightly careful with commerce websites, because GeoIP databases are not perfect in my experience.

I got accidentally locked out from my server when I connected over Starlink that IP-maps to the US even though I was physically in Greece.

As a practical advice, I would use a blocklist for commerce websites, and allowlist for infra/personal.

ThatPlayer•1mo ago

That's a good point! I'll probably start with a blocklist.

dotancohen•1mo ago

There is a small OTC medical device that is about $60 in the US, quadruple the price in my country. I tried to order one to be sent to a US family member's house, who was coming the following month to visit. However I could not order because I was not in the US.

In the end I found another online store, paid $74, and got the device. So the better store lost the sale due to blocking non-US orders.

I don't know how much of a corner case this is.

lsaferite•1mo ago

Just keep in mind, that could block legit users who are outside the country. One case being someone traveling and wanting to buy something to deliver home. Another case being a non-resident wanting to buy something to send to family in the service zone.

I'm not saying don't block, just saying be aware of the unintended blocks and weigh them.

fragmede•1mo ago

Also consider tourists outside of their home country. If, eg I'm in Indonesia when Black Friday hits and I'm trying to buy things back home and the site is blocked; shit. I mean, personally I can just use my house as as a VPJ exit node thanks to Tailscale, but most people aren't technical enough to do that.

DANmode•1mo ago

Great comment - thank you.

krupan•1mo ago

This makes me a little sad. There's an ideal built into the Internet, that it has no borders, that individuals around the world can connect directly. Blocking an entire geographic region because of a few bad actors kills that. I see why it's done, but it's unfortunate

FabCH•1mo ago

I know what you mean.

But the numbers don't lie. In my case, I locked down to a fairly small group of European countries and the server went down from about 1500 bot scans per day down to 0.

The tradeoff is just too big to ignore.

BobaFloutist•1mo ago

It's not because of a few bad actors, it's because of a hostile or incompetent government.

Every country has (at the very least) a few bad actors, it's a small handful of countries that actively protect their bad actors from any sort of accountability or identification.

victorbjorklund•1mo ago

To be fair most of my bad traffic is from the US.

BobaFloutist•1mo ago

I mean if that's the case, the conversation obviously changes.

halJordan•1mo ago

You can't make the argument that it's a small group of bad actors. It's quite a massive group of unrelentingly malicious actors

01HNNWZ0MV43FF•1mo ago

Massive in terms of money and power, small in terms of souls

tkfoss•1mo ago

I read it as small compared to total population affected by the block

anon7000•1mo ago

But that’s not the case either. A large attack or scrape generates far more traffic than legitimate users.

dspillett•1mo ago

> VNPT and Bunny Communications are home/mobile ISPs. i cannot ascertain for sure that their IPs are from domestic users, but it seems worrisome that these are among the top scraping sources once you remove the most obviously malicious actors.

This will be in part people on home connections tinkering with LLMs at home, blindly running some scraper instead of (or as well as) using the common pre-scraped data-sets and their own data. A chunk of it will be from people who have been compromised (perhaps by installing/updating a browser add-in or “free” VPN client that has become (or always was) nefarious) and their home connection is being farmed out by VPN providers selling “domestic IP” services that people running scrapers are buying.

ArcHound•1mo ago

Disagree on the method:

I recall that bot farms use pre-paid SIM cards for their data connections so that their traffic comes from a good residential ASN.

No client compromise required, it's a networking abuse that gives you good reputation of you use mobile data.

But yes, selling botnets made of compromised devices is also a thing.

Nextgrid•1mo ago

SIM cards is (one) of the ways the big boys do it. It gives you a nice CGNAT to hide behind and essentially can’t be blocked without blocking a nontrivial chunk of the country. Although more and more fixed-line ISPs are moving to CGNAT too so you can get that advantage there as well.

simonw•1mo ago

I have trouble imagining any home LLM tinkerer who tries to run a naive scraper against the rest of the internet as part of their experiments.

Much more likely are those companies that pay people (or trick people) into running proxies on their home networks to help with giant scrapping projects what want to rotate through thousands of "real" IPs.

st3fan•1mo ago

Correct. These are called "residential proxies".

wrxd•1mo ago

I wonder if this is going to push more and more services to be hidden from the public internet.

My personal services are only accessible from my own LAN or via a VPN. If I wanted to share it with a few friends I would use something like Tailscale and invite them to my tailnet. If the number of people grows I would put everything behind a login-wall.

This of course doesn't cover services I genuinely might want to be exposed to the public. In that case the fight with the bots is on, assuming I decide I want to bother at all

klaussilveira•1mo ago

I wish there was a public database of corporate ASNs and IPs, so we wouldn't have to rely on Cloudflare or any third-party service to detect that an IP is not from a household.

wrxd•1mo ago

Scrapers use residential VPNs so such a database would help only up to a certain point

eddyg•1mo ago

Just search for "residential proxies" and you'll see why this wouldn't help.

ronsor•1mo ago

There is... It's literally available in every RIR database through WHOIS.

immibis•1mo ago

There is one. It's called the RIRs.

qudat•1mo ago

This is a great reason why letting websites have direct access to git is not a great idea. I started creating static versions of my projects with great success: https://git.erock.io

drzaiusx11•1mo ago

Do solutions like gitea not have prebuilt indexes of the git file contents? I know GitHub does this to some extent, especially for main repo pages. Seems wild that the default of a web forge would be to hit the actual git server on every http GET request.

danudey•1mo ago

The author discusses his efforts in trying caching; in most use cases, it makes no sense to pre-cache every possible piece of content (because real users don't need to load that much of the repository that fast), and in the case of bot scrapers it doesn't help to cache because they're only fetching each file once.

drzaiusx11•1mo ago

I'd argue every git-backed loadable page in a web forge should be "that fast", at least in this particular use-case.

Hitting the backing git implementation directly within the request/response loop seems like a good way to burn cpu cycles and create unnecessary disk reads from .git folders, possibly killing you drives prematurely. Just stick a memcache in front and call it a day, no?

In the age of cheap and reliable SSDs (approaching memory read speeds), you should just be batch rendering file pages from git commit hooks. Leverage external workers for rendering the largely static content. Web hosted git code is more often read than written in these scenarios, so why hit the underlying git implementation or DB directly at all? Do that for POSTs, sure but that's not what we're talking about (I think?)

lsaferite•1mo ago

Why not render the markdown as HTML in this scenario?

qudat•1mo ago

Markdown is readable as-is I didn’t see the need to add more complexity here.

zoobab•1mo ago

Use stagit, static pages served with a simple nginx is blazing fast and should resist any scrapers.

toastal•1mo ago

Darcs by it’s nature can just be hosted by HTTP server too, but without needing a special tool. I use H2O with a small mruby script to throttle IPs.

frogperson•1mo ago

Could this be solved with an EULA and some language that non-human readers will be billed at $1 per page? Make all users agree to it. They either pay up or they are breaching contract.

Is this viable?

grayhatter•1mo ago

> Is this viable?

for many reasons

kstrauser•1mo ago

Most of my scraper traffic came from China and Brazil. How am I going to enforce that?

hamdingers•1mo ago

Say you have identified a non-human reader, you have a (probably fake) user agent and an IP address. How do you imagine you'll extract a dollar from that?

krupan•1mo ago

I'm case you didn't read to the end:

"This is depressing. Profoundly depressing. i look at the statistics board for my reverse-proxy and i never see less than 96.7% of requests classified as bots at any given moment. The web is filled with crap, bots that pretend to be real people to flood you. All of that because i want to have my little corner of the internet where i put my silly little code for other people to see."

stevetron•1mo ago

I was setting up a small system to do web site serving. Mostly just experimental to try out some code. Like learning how to use nginx as a reverse proxy. And learing how to use dynamic dns services since I am on dynamic dns at home. Early-on, I discovered lot's of traffic, and lot's of hard drive activity. The HD activity was from logging. It seemed I was under incessant polling from china. Strange: It's a new dynamic url. I eventually got this down to almost nothing by setting up the firewall to reject traffic from China. That was, of course, before AI scrapers. I don't know what it would do, now.

kstrauser•1mo ago

Anubis cut the accesses on my little personal Forgejo instance with nothing particularly interesting on it from about 600K hits per day to about 1000.

That’s the kind of result that ensures we’ll be seeing anime girls all over the web in the near future.

PeterStuer•1mo ago

"Self-hosting anything that is deemed "content" openly on the web in 2025 is a battle of attrition between you and forces who are able to buy tens of thousands of proxies to ruin your service for data they can resell."

I do wonder though. Content scrapers that truly value data would stand to benefit from deploying heuristics that value being as efficient as possible in the info per query space. Wastefullness of the desctbed type not just loads your servers, but also their whole processing pipeline on their end.

But there is a different class of player that gains more from nuisance maximization: dominant anti-bot/ddos service providers, especially those with ambitions of becoming the ultimate internet middleman. Their cost for creating this nuisance is near 0 as they have 0 interest in doing anyting with the responses. They just want to annoy until you cave and install their "free" service, then they can turn around as ask for a pay to access your data to interested parties.

Bender•1mo ago

Do git clients support HTTP/2.0 yet? Or could they use SSH? I ask because I block most of the bots by requiring HTTP/2.0 even on my silliest of throw-away sites. I agree their caching method is good and should be done when much of the content is cachable. Blocking specific IP's is a never-ending game of whack-a-mole. I do block some data-centers ASN's as I do not expect real people to come from them even though they could. It's an acceptable trade-off for my junk. There are many things people can learn from capturing TCP SYN packets for a day and comparing to access logs sorting out bots vs legit people. There are quite a few headers that a browser will send that most bots do not. Many bots also lack sending a valid TCP MSS and TCP WINDOW.

Anyway, test some scrapers and bots here [1] and let me know if they get through. A successful response will show "Can your bot see this? If so you win 10 bot points." and a figlet banner. Read-only SFTP login is "mirror" and no pw.

[Edit] - I should add that I require bots to tell me they speak English optionally in addition to other languages but not a couple that are blocked, e.g. en,de-DE,de good, de-DE,de will fail, because. Not suggesting anyone do this.

[1] - https://mirror.newsdump.org/bot_test.txt

cortesoft•1mo ago

> I do block some data-centers ASN's as I do not expect real people to come from them even though they could.

My company runs our VPN from our datacenter (although we have our own IP block, which hopefully doesn’t get blocked)

Bender•1mo ago

It's of course optional to block whatever one finds appropriate for their use case. My hobby stuff is not revenue generating so I have more options at my disposal.

Those with revenue generating systems should capture TCP SYN traffic for while, monitor access logs and give it that college try to correlate bots vs legit users with traffic characteristics. Sometimes generalizations can be derived from the correlation and some of those generalizations can be permitted or denied. There really isn't a one size fits all solution but hopefully my example can give ideas in additional directions to go. Git repos are probably the hardest to protect since I presume many of the git libraries and tools are using older protocols and may look a lot like bots. If one could get people to clone/commit with SSH there are additional protections that can be utilized at that layer.

[Edit] Other options lay outside of ones network such as either doing pull requests for or making feature requests for the maintainers of the git libraries so that HTTP requests look a lot more like a real browser to stand out from 99% of the bots. The vast majority of bots use really old libraries.

GoblinSlayer•1mo ago

>Iocaine has served 38.16GB of garbage

And what is the effect?

I opened https://iocaine.madhouse-project.org/ and it gave the generated maze thinking I'm an AI :)

>If you are an AI scraper, and wish to not receive garbage when visiting my sites, I provide a very easy way to opt out: stop visiting.

nitwit005•1mo ago

I got the 418 I'm a teapot response.

oconnore•1mo ago

The only disappointing aspect of the Iocaine maze is that it is not a literal maze. There should be a narrow, treacherous path through the interconnected web of content that lets you finally escape after many false starts.

jepj57•1mo ago

What about a copyright on websites stating anyone using your site for training would be giving the owner of the site an eternal non-revocable license to the model, and must provide a copy of the model upon request? At least then there would be SOME benefit.

adastra22•1mo ago

Contract law doesn’t work that way.

craftkiller•1mo ago

On my forge, I mirror some large repos that I use for CI jobs so I'm not putting unfair load on the upstream project's repos. Those are the only repos large enough to cause problems with the asshole AI scrapers. My solution was to put the web interface for those repos behind oauth2-proxy (while leaving the direct git access open to not impact my CI jobs). It made my CPU usage drop 80% instantly, while still leaving my (significantly smaller) personal projects fully open for anyone to browse unimpeded.

yunnpp•1mo ago

Thanks for putting that together. Not my daily cup but it seems like a good reference for server setup.

reactordev•1mo ago

I host all my stuff behind a vpn. No one but authorized users can get access.

benlivengood•1mo ago

It would be nice if there was a common crawler offering deltas on top of base checkpoints of the entire crawl; I am guessing most AI companies would prefer not having to mess with their own scrapers. Google could probably make a mint selling access.

ccgreg•1mo ago

commoncrawl.org

Our public web dataset goes back to 2008, and is widely used by academia and startups.

pdimitar•1mo ago

I always wanted to ask:

- How often is that updated?

- How current is it at any point in time?

- Does it have historical / temporal access i.e. be able to check the history of a page a la The Internet Archive?

ccgreg•1mo ago

- monthly

- it's a historical archive, the concept of "current" is hard to turn into a metric

- not only is our archive historical, it is included in the Internet Archive's wayback machine.

overfeed•1mo ago

For private instances, you can get down to 0 scrapers by firewalling http/s ports from the Internet and using Wireguard. I knew it was time to batten down the hatches when fail2ban became the top process by bytes written in iotop (between ssh log in attempts and nginx logs).

The cost of the open, artisanal web has shot up due to greed and incompetence, the crawlers are poorly written.

frozenseven•1mo ago

[flagged]

bavent•1mo ago

So? Does that somehow invalidate this article?

frozenseven•1mo ago

[flagged]

Artoooooor•1mo ago

Why the hell these bots don't just do a git clone and analyse the source code locally? Much less impact on the server and they would be able to perform the same analysis on all repositories, regardless of what particular git forge offers.

grayhatter•1mo ago

what makes you think the webscrapers care what pages they request?

lgeek•1mo ago

> Worryingly, VNPT and Bunny Communications are home/mobile ISPs

VNPT is a residential / mobile ISP, but they also run datacentres (e.g. [1]) and offer VPS, dedicated server rentals, etc. Most companies would use separate ASes for residential vs hosting use, but I guess they don't, which would make them very attractive to someone deploying crawlers.

And Bunny Communications (AS5065) is a pretty obvious 'residential' VPN / proxy provider trying to trick IP geolocation / reputation providers. Just look at the website [2], it's very low effort. They have a page literally called 'Sample page' up and the 'Blog' is all placeholder text, e.g. 'The Art of Drawing Readers In: Your attractive post title goes here'.

Another hint is that some of their upstreams are server-hosting companies rather than transit providers that a consumer ISP would use [3].

[1] https://vnpt.vn/doanh-nghiep/tu-van/vnpt-idc-data-center-gia... [2] https://bunnycommunications.com/ [3] https://bgp.tools/as/5065#upstreams

cookiengineer•1mo ago

I don't think that the author's proposed cat and mouse game is giving you any chance, because it requires a lot of maintenance and architectural changes. And the proposed changes and tools all run in userspace, so there's still the DDoS problem.

I have the same problem, but I decided to maintain ASN lists of known spammers [1] and combine that with my eBPF based firewall that just drops their connections before it reaches the kernel [2].

So my websites, wikis and other things are protected by the same firewall architecture, for which I can deploy a unified "blockmap" so to speak. Probably gonna open source the dashboard for maintaining that over the holidays, too, as I'm trying to make everything combinable in the plug and play for Go backends sense similar to my markdown editor UI [3].

I also open sourced my LPM hashset map library which allows to process large quantities of prefixes, because it's way faster than LPM tries (read as: takes less than 100ms to process all RIR and WHOIS data compared to around an hour with LPM tries) [4].

[1] https://github.com/cookiengineer/antispam

[2] https://github.com/tholian-network/firewall

[3] https://github.com/cookiengineer/golocron

[4] https://github.com/cookiengineer/golpm

SectorC: A C Compiler in 512 bytes

The F Word

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Speed up responses with fast mode

Software factories and the agentic moment

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

I write games in C (yes, C)

First Proof

Show HN: A luma dependent chroma compression algorithm (image compression)

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection Rather Than Prediction

Coding agents have replaced every framework I used

The AI boom is causing shortages everywhere else

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

72M Points of Interest

We mourn our craft

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

History and Timeline of the Proco Rat Pedal (2021)

SectorC: A C Compiler in 512 bytes

The F Word

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Speed up responses with fast mode

Software factories and the agentic moment

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

I write games in C (yes, C)

First Proof

Show HN: A luma dependent chroma compression algorithm (image compression)

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection Rather Than Prediction

Coding agents have replaced every framework I used

The AI boom is causing shortages everywhere else

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

72M Points of Interest

We mourn our craft

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

History and Timeline of the Proco Rat Pedal (2021)

Guarding My Git Forge Against AI Scrapers

Comments