LWN is currently under the heaviest scraper attack seen yet

https://social.kernel.org/notice/B2JlhcxNTfI8oDVoyO

200•luu•3w ago

Comments

zahlman•3w ago

Is it still ongoing? The thread appears to be over 24 hours old and as a quick test I had no issue loading the main page (which is as snappy and responsive as expected from a low-bandwidth site like LWN).

jzb•3w ago

Not at the moment. It’s subsided for now.

blibble•3w ago

the perverse incentive is if you ddos the website such that it shuts down, no other "AI" parasites can get the valuable data

big tech incentivised to ddos... what a world they've built

ronsor•3w ago

This sounds like a conspiracy theory.

MBCook•3w ago

I don’t think they’re saying that’s actually happening here, just that it could happen and is accidentally incentivized.

pwdisswordfishy•3w ago

If it's a conspiracy, it would be one where the Minimum Viable Conspirator Count is 1 (inclusive of one's own self).

In that case, by that rubric literally anything that you conspire with yourself to accomplish (buying next week's groceries, making a turkey sandwich...) would also be a conspiracy.

amlib•3w ago

The dead internet theory also sounded unhinged and conspiracy theory-ish a decade or so ago... yet here we are.

phkahler•3w ago

Its called pulling up the ladder behind you, or building a moat!

NitpickLawyer•3w ago

Umm... what data? That's a very old newsletter-like site. Everything that's public on it has been long scraped and parsed by whoever needed it. There's 0 valuable data there for "parasites" to parasite off of.

I also don't get the comments on the linked social site. IIUC the users posting there are somehow involved with kernel work, right? So they should know a thing or two about technical stuff? How / why are they so convinced that the big bad AI baddies are scraping them, and not some miss-configured thing that someone or another built? Is this their first time? Again, there's nothing there that hasn't been indexed dozens of times already. And... sorry to say it, but neither newsletters nor the 1-3 comments on each article are exactly "prime data" for any kind of training.

These people have gone full tinfoil hat and spewing hate isn't doing them any favours.

MBCook•3w ago

I don’t think they were talking about LWN specifically but just in general.

homebrewer•3w ago

Because it started in 2022 and hasn't subsided since? This is just the latest iteration of "AI" scrapers destroying the site, and the worst one yet.

https://lwn.net/Articles/1008897

Your nonsense about LWN being a "newsletter" and having "zero valuable data" isn't doing you any favors. It is the prime source of information about Linux kernel development, and Linux development in general.

"AI" cancer scraping the same thing over and over and over again is not news for anybody even with a cursory interest in this subject. They've been doing it for years.

NitpickLawyer•3w ago

> LWN.net is a reader-supported news site

I mean...

Again, the site is so old that anything worth while is already in cc or any number of crawls. I am not saying they weren't scraped. I'm saying they likely weren't scraped by the bad AI people. And certainly not by AI companies trying to limit others from accessing that data (as the person who I replied to stated).

MBCook•3w ago

Why is it each of your comments seems to include a dig attacking LWN?

spinningslate•3w ago

I’m going to presume good faith rather than trolling. Some questions for you:

1. Coding assistants have emerged as as one of the primary commercial opportunities for AI models. As GP pointed out, LWN is the primary discussion for kernel development. If you were gathering training data for a model, and coding assistance is one of your goals, and you know of a primary sources of open source development expertise, would you:

  (a) ignore it because it’s in a quaint old format, or

  (b) slurp up as much as you can?

2. If you’d previously slurped it up, and are now collating data for a new training run, and you know it’s an active mailing list that will have new content since you last crawled it, would you:

  (a) carefully and respectfully leave it be, because you still get benefit from the previous content even though there’s now more and it’s up to date, or

  (b) hoover up every last drop because anything you can do to get an edge over your competitors means you get your brief moment of glory in the benchmarks when you release?

NitpickLawyer•3w ago

I train coding models with RLVR because that's what works. There's ~0.000x good signal in mailing lists that isn't in old mailing lists. (and, since I can't reply to the other person, I mean old as in established, it is in no way a dig to lwn).

You seem to be missing my point. There is 0 incentives for AI training companies to behave like this. All that data is already in the common crawls that every lab uses. This is likely from other sources. Yet they always blame big bad AI...

dzaima•3w ago

Old scrapes can't have data about new things though; have to continously re-scan to not be stuck with ancient info.

some scrapers might skip out on already-scraped sources, but easy to imagine that some/many just would not bother (you don't know if it's updated until you've checked, after all). And to some extend you do have to re-scrape, if just to find links to the new stuff.

gulugawa•3w ago

I've had luck blocking scrapers by overwriting JavaScript methods

" a.getElementsByTagName = function (...args) {//Clear page content}"

One can also hide components inside Shadow DOM to make it harder to scrape.

However, these methods will interfere with automated testing tools such as Playwright and Selenium. Also, search engine indexing is likely to be affected.

bogwog•3w ago

This is a fun idea, especially if you make those functions procedurally generate garbage to get them stuck

TurdF3rguson•3w ago

You think you've had luck. The truth is you have no idea of knowing if this ever had any effect at all.

chrisjj•3w ago

So which is it? DDOS attack or "AI" scrapers?

fabian2k•3w ago

Sufficiently aggressive and inconsiderate scraping is indistinguishable from a DDOS attack.

chrisjj•3w ago

No scraper seeks to deny the service it needs.

And no responsible site operator unable to distinguish should claim DDOS.

antod•3w ago

No well behaved scraper at least.

chrisjj•3w ago

No any scraper seeks.

Y-bar•3w ago

A sufficiently stupid and egregious AI scraper is indistinguishable from a DDOS attack.

Edit: Fabian2k was ten seconds ahead. Damn!

TurdF3rguson•3w ago

Scrapers because DDOS implies that it's malicious rather than accidental and there's no reason to think that.

chrisjj•3w ago

Right, so probably the site should not be claiming "It is a DDOS attack".

jacquesm•3w ago

AI allows companies to resell open source code as if they wrote it themselves doing an end run around all license terms. This is a major problem.

Of course they're not going to stop at just code. They need all the rest of it as well.

zipy124•3w ago

From the creators of easy money laundering (crypto bros), we now bring you easy money laundering 2: intellectual property laundering, coming to a theatre near you soon!

gruez•3w ago

>From the creators of easy money laundering (crypto bros),

Is there even any evidence that "crypto bros" and "AI bros" are even the same set of people other than being vaguely "tech" and hated by HN? At best you have someone like Altman who founded openai and had a crypto project (worldcoin), but the latter was approximately used by nobody. What about everyone else? Did Ilya Sutskever have a shitcoin a few years ago? Maybe Changpeng Zhao has an AI lab?

themafia•3w ago

> and had a crypto project (worldcoin)

That was a biometric surveillance project disguised as a crypto project.

> Is there even any evidence that "crypto bros" and "AI bros" are even the same set of people

No, the "AI" people are far worse. I always had a choice to /not/ use crypto. The "AI" people want to hamfistedly shove their flawed investment into every product under the sun.

prussia•3w ago

I think the worst of the crypto people are probably the worst of the AI people too. Power/money-hungry grifters naturally move on to the most profitable grift when the old one peters out.

pkaeding•3w ago

Okay, but "fad-use-of-GPU bros" doesn't roll off the tongue as well.

redwall_hp•3w ago

Elmo Musk was pumping and dumping dogecoin for years before pivoting to starting an LLM Ponzi.

tjons•3w ago

It's never/rarely the leaders. It's always the followers. This site has a high bar for engagement given the depth of subject matter, but if you check another social medium I promise you'll see this pattern. All over.

palmotea•3w ago

> AI allows companies to resell open source code as if they wrote it themselves doing an end run around all license terms. This is a major problem.

Has it been adjudicated that AI use actually allows that? That's definitely what the AI bros want (and will loudly assert), but that doesn't mean it's true.

Sharlin•3w ago

I don't think so. Because LLMs aren't legal persons (yet?!), they can neither have copyright to anything nor violate someone else's copyright. IANAL but the most reasonable legal interpretation is likely that any IP violations are actually committed by whoever it was who asked an LLM to "rewrite" something in a way that obviously counts as a derived work rather than a cleanroom implementation.

jacquesm•3w ago

Correct.

jacquesm•3w ago

You are misinterpreting my use of the word 'allow'. Think of it as 'enables' or 'makes it possible'. It is fairly obvious that AI does not grant permission as well as that there was no reference to any legal proceedings.

redwall_hp•3w ago

We're just seeing that the copyright emperor has no clothes: companies want to infringe upon others' rights through a copyright laundering machine, but of course will insist that the laundered code they use is their property, protected by the holy copyright cudgel.

kimixa•3w ago

I worked on an extremely niche project revolving around an old DOS game. Code I worked on is often pretty much the only reference for some things.

It's trivially easy to get claude to scrape that and regurgitate it under any requested licence (some variable names changes, but exactly the same structure - though it got one of the lookup tables wrong, which is one of the few things you could argue aren't copyrighted there).

It'll even cheerfully tell you it's fetching the repository while "thinking". And it's clearly already in the training data - you can get it to detail specifics even disallowing that.

If I referenced copywritten code we didn't have the license for (as is the case for copyleft licenses if you don't follow the restrictions) while employed as a software engineer I'd be fired pretty quick from any corporation. And rightfully so.

People seem to have a strange idea with AI that "copyleft" code is free game to unilaterally re-license. Try doing that with leaked Microsoft code - you're breaking copyright just as much there, but a lot of people seem to perceive it very differently - and not just because of risk of enforcement but in moralizing about it too.

sigseg1v•3w ago

The overwhelming majority of devs do not concern themselves with nor are even familiar with the concept of software licenses, let alone how to abide by them. I argue that it's not that they think it's "free game to [...] re-license" so much so as they think it's just code and they can use it without the idea of a licence ever even crossing their mind.

Source: find literally anything on GitHub using dependencies that are MIT licensed and being distributed without following the terms that state you must also redistribute the licence for each

jacquesm•3w ago

It goes much further than that. They use LLMs to create code for them and then they claim ownership to the code.

I think that is one of the main reasons there is so much pushback against this, a lot of people are now addicted to their stream of washed code and want to claim ownership over what is essentially a derived work. The key then becomes 'if a work could not have been written by the author that claims it does that claim survive'. I think it should not but there is plenty of disagreement on this.

sanbor•3w ago

Isn’t similar to looking up/copying code from stack overflow, Google or books? Use it as reference to write the code and claim ownership. My little understanding is that the whole copyright free ride for LLMS is because it is similar to the process of humans using content under copyright as reference to create something new and claim ownership.

jacquesm•2w ago

No, it is not similar at all. There is the pesky little thing called 'copyright' which allows you to set terms under which you license your code. Stackoverflow, google and books all come with strings attached.

Your idea of how humans use content under copyright is mistaken.

sanbor•2w ago

Why it’s not similar at all?

I want to make an ajax request using jQuery. I look up an example in StackOverflow. I use a very similar code to the example given in the post and by not giving any attribution I just claim ownership.

Same with Spring in action books or looking up Java class references. Many times I look something up and use it as reference just tweaking the examples given.

Millions of programmers have done this.

LLMS in principle use the training data to generate an answer to the prompt, similar to the process I described.

blakesterz•3w ago

  "It is a DDOS attack involving tens of thousands of addresses"

It is amazing just how distributed some of these things are. Even on the small sites that I help host we see these types of attacks from very large numbers of diverse IPs. I'd love to know how these are being run.

smitty1e•3w ago

Call it a "Distributed Intelligence Logic Denial Of Service" (DILDOS) attack both to name it distinctly and characterize the source.

random1234user•3w ago

Might as well call it "Artificial Intelligence Distributed Intelligence Logic Denial Of Service" (AIDILDOS) sounds about right.

PaulDavisThe1st•3w ago

another reference point: we've had well over 1M unique IP addresses hit git.ardour.org as part of stupid as hell git scraping effort. 1M !!!

wongarsu•3w ago

There are plenty of providers selling "residential proxies", distributing your crawler traffic through thousands of residential IPs. BrightData is probably the biggest, but its a big and growing market.

And if you don't care about the "residential" part you can get proxies with data center IPs for much cheaper from the same providers. But those are easily blocked

quectophoton•3w ago

And how do you get those residential IP addresses?

Well, you just need people to install your browser extension. Or your proprietary web browser. Or your mobile app. Or your nice MCP. Maybe get them to add your PPA repository so they automatically install your sneakily-overriden package the next time they upgrade their system.

Anything goes as long as your software has access to outgoing TCP port 443, which almost nobody blocks, so even if it's being run from within a Docker container or a VM it probably doesn't affect you.

TurdF3rguson•3w ago

Bright Data specifically offers a sdk that app developers can use monetize free games. A lot of free games and VPN apps are using it. Check out how they market it, it's wild... - https://bright-sdk.com/

giantrobot•3w ago

In the most charitable case it's some "AI" companies with an X/Y problem. They want training data so they vibe code some naive scraper (requests is all you need!) and don't ever think to ask if maybe there's some sort of common repository of web crawls, a CommonCrawl if you will.

They don't really need to scrape training data as CommonCrawl or other content archives would be fine for training data. They don't think/know to ask what they really want: training data.

In the least charitable interpretation it's anti-social assholes that have no concept or care about negative externalities that write awful naive scrapers.

tedivm•3w ago

I solved this problem for my blog by simply not being interesting.

fancyfredbot•3w ago

If you can bore an LLM that's exciting.

chuckadams•3w ago

Bore-a-Bot, the new service from Confuse-a-Cat.

antod•3w ago

That sounds like Elon merging xAI and The Boring Company

sandworm101•3w ago

I would rather setup a "shadow" site designed only for LLMs. I would stuff it with ao much insanity that Grok would not be able to leave. How about a billion blog post where every use of "American" is replaced with "Canadian". By the time im done, grok will be spouting conspiracy theories about the decline of the strategic bacon reserve.

antod•3w ago

> By the time im done, grok will be spouting conspiracy theories about the decline of the strategic bacon reserve.

grok will blame the zionists rather than the freemasons for that one.

naiv•3w ago

TIL about Git Brag because of your blog. It is interesting.

fancyfredbot•3w ago

Who are these agressive scrapers run by?

It is difficult to figure out the incentives here. Why would anyone want to pull data from LWN (or any other site) at a rate which would cause a DDOS like attack?

If I run a big data hungry AI lab consuming training data at 100Gb/s it's much much easier to scrape 10,000 sites at 10Mb/s than DDOS a smaller number of sites with more traffic. Of course the big labs want this data but why would they risk the reputational damage of overloading popular sites in order to pull it in an hour instead of a day or two?

kylehotchkiss•3w ago

china (alibaba and tencent)

fancyfredbot•3w ago

I'm not at all sure alibaba or tencent would actually want to DDOS LWN or any other popular website.

They may face less reputational damage than say Google or OpenAI would but I expect LWN has Chinese readers who would look dimly on this sort of thing. Some of those readers probably work for Alibaba and Tencent.

I'm not necessarily saying they wouldn't do it if there was some incentive to do so but I don't see the upside for them.

philipkglass•3w ago

I don't think that most of them are from big-name companies. I run a personal web site that has been periodically overwhelmed by scrapers, prompting me to update my robots.txt with more disallows.

The only big AI company I recognized by name was OpenAI's GPTBot. Most of them are from small companies that I'm only hearing of for the first time when I look at their user agents in the Apache logs. Probably the shadiest organizations aren't even identifying their requests with a unique user agent.

As for why a lot of dumb bots are interested in my web pages now, when they're already available through Common Crawl, I don't know.

iamnothere•3w ago

Maybe someone is putting out public “scraper lists” that small companies or even individuals can use to find potentially useful targets, perhaps with some common scraper tool they are using? That could explain it? I am also mystified by this.

bjackman•3w ago

LWN includes archives of a bunch of mailing lists so that might be a factor. There are a LOT of web on that domain.

mikkupikku•3w ago

NSA, trying to force everybody onto their Cloudflare reservation.

velox_neb•3w ago

I bet some guy just told Claude Code to archive all of LWN for him on a whim.

tux3•3w ago

Some guy doesn't show up with 10k residential IPs. This is deliberate and organized.

kleene_op•3w ago

LLMs just do be paperclipping

slicerdicer2•3w ago

There are multiple israeli companies who will provide you with millions of residential proxies at a per gb usage rate and a very easy API. You can set this up in minutes with claude code.

fancyfredbot•3w ago

These IP providers aren't cheap (cost per GB seems to be $4 but there are bulk discounts). The cost to grab all of LWN isn't prohibitively high for an individual but it's enough that most people probably wouldn't do it on a whim.

I suppose it only needs one person though. So it's probably a pretty plausible explanation.

chrisjj•3w ago

Can Claude Code even do that? Rather than provide code to do that.

dannyobrien•3w ago

I've been asking this for a while, especially as a lot of the early blame went on the big, visible US companies like OpenAI and Anthropic. While their incentives are different from search engines (as someone said early on in this onslaught, "a search engine needs your site to stay up; an AI company doesn't"), that's quite a subtle incentive difference. Just avoiding the blocks that inevitably spring up when you misbehave is a incentive the other way -- and probably the biggest reason robots.txt obedience, delays between accesses, back-off algorithms etc are widespread. We have a culture that conveys all of these approaches, and reciprocality has its part, but I suspect that's part of the encouragement to adopt them. It could that they're just too much of a hurry not to follow the rules, or it could be others hiding behind those bot-names (or others). Unsure.

Anyway, I think the (currently small[1]) but growing problem is going to be individuals using AI agents to access web-pages. I think this falls under the category of the traffic that people are concerned about, even though it's under an individual users' control, and those users are ultimately accessing that information (though perhaps without seeing the ads that pay of it). AI agents are frequently zooming off and collecting hundreds of citations for an individual user, in the time that a user-agent under manual control of a human would click on a few links. Even if those links aren't all accessed, that's going to change the pattern of organic browsing for websites.

Another challenge is that with tools like Claude Cowork, users are increasingly going to be able to create their own, one-off, crawlers. I've had a couple of occasions when I've ended up crafting a crawler to answer a question, and I've had to intervene and explicitly tell Claude to "be polite", before it would build in time-delays and the like (I got temporarily blocked by NASA because I hadn't noticed Claude was hammering a 404 page).

The Web was always designed to be readable by humans and machines, so I don't see a fundamental problem now that end-users have more capability to work with machines to learn what they need. But even if we track down and sucessfully discourage bad actors, we need to work out how to adapt to the changing patterns of how good actors, empowered by better access to computation, can browse the web.

[1] - https://radar.cloudflare.com/ai-insights#ai-bot-crawler-traf...

dannyobrien•3w ago

(and if anyone from Anthropic or OpenAI is reading this: teach your models to be polite when they write crawlers! It's actually an interesting alignment issue that they don't consider the externalities of their actions right now!)

pstuart•3w ago

Hell, they should at least be caching those requests rather than hitting the endpoint on every single AI request that needs the info.

ks2048•3w ago

Perhaps incompetence instead of malice - a misconfigured or buggy scraper, etc.

overfeed•3w ago

> If I run a big data hungry AI lab consuming training data at 100Gb/s it's much much easier to...

You are incorrectly assuming competency, thoughtful engineering and/or some modicum of care for negative externalities. The scraper may have been whipped up by AI, and shipped an hour later after a quick 15-minute test against en.wikipedia.org.

Whoever the perpetrator is, they are hiding behind "residential IP providers" so there's no reputational risks. Further, AI companies already have a reputation for engaging in distasteful practices, but popular wisdom claims that they make up for the awfulness with utility, so even if it turns out to be a big org like OpenAI or Anthropic, people will shrug their shoulders and move on.

fancyfredbot•3w ago

Yes I agree it's more likely incompetence than malice. That's another reason I don't think it's a lab. Even if you don't like the big labs you can probably admit they are reasonably smart/competent.

Residential IP providers definitely don't remove reputational risk. There are many ways people can find out what you are doing. The main one being that your employees might decide to tell on you.

The IP providers are a great way of getting around cloud flare etc. They are also reasonably expensive! I find it very plausible that these IP providers are involved but I still don't understand who is paying them.

jacobgkau•3w ago

This is just an anecdote, but having been dealing with similar problems on one of my websites for the past year or so, I was experiencing a huge number of hits from different residential IP addresses (mostly Latin American) at the same time once every 5-10 minutes (which started crashing my site regularly). Digging through my server's logs and watching them in real-time, I noticed one or two Huawei IP's making requests at the same time as the dozens or hundreds of residential IP's. Blocking the Huawei IP's seemed to mysteriously cut back the residential IP requests, at least for a short amount of time (i.e. a couple of hours).

This isn't to say every attack that looks similar is being done by Huawei (which I can't say for certain, anyway). But to me, it does look an awful lot like even large organizations you'd think would be competent can stoop to these levels. I don't have an answer for you as to why.

danaris•3w ago

Sufficiently advanced incompetence is indistinguishable from malice.

Whether or not they had malice in their hearts when they implemented the scraper bots, their impact is still very much malicious.

pancsta•3w ago

“Residential IP” sounds expensive though…

overfeed•3w ago

Scraper proxying is another way to monetize botnets, and I suspect those "providers" are not expensive.

alephnerd•3w ago

> If I run a big data hungry AI lab consuming training data at 100Gb/s it's much much easier to scrape 10,000 sites at 10Mb/s than DDOS a smaller number of sites with more traffic

A little over a decade ago (f*ck I'm old now [0]), I had a similar conversation with an ML Researcher@Nvidia. Their response was "even if we are overtraining, it's a good problem to have because we can reduce our false negative rate".

Everyone continues to have an incentive to optimize for TP and FP at the expense of FN.

[0] - https://m.youtube.com/watch?v=BGrfhsxxmdE

phil21•3w ago

I'd guess some sort of middle management local maxima. Someone set some metric of X pages per day scraped, or Y bits per month - whatever. CEO gets what he wants.

Then that got passed down to the engineers and those engineers got ridden until they turned the dial to 11. Some VP then gets to go to the quarterly review with a "we beat our data ingestion metrics by 15%!".

So any engineer that pushes back basically gets told too bad, do it anyways.

debo_•3w ago

Why is it in these invented HN scenarios that the engineers just happen to have absolutely no agency?

oblio•3w ago

Because: who would refuse more money?

phil21•3w ago

Because I've personally seen it. Engineer says this is silly, it will blow up in the long run - told to implement it anyways. Not much to lose for the engineer to simply do it. Substitute engineer for any line level employee in any industry and it works just as well.

I've also run into these local maxima stupidities dozens or more time in my career where it was obvious someone was gaming a performance metric at the expense of the bigger picture - which required escalation to someone who could see said bigger picture to get fixed. Happens all the time as a customer where some sales rep or sales manager wants to game short-term numbers at the expense of long-term relationships. Smaller companies you can usually get it fixed pretty quickly, larger companies tend to do more doubling down.

It usually starts with generally well-intentioned goal setting but devolves into someone optimizing a number on a spreadsheet without care (or perhaps knowledge) of the damage it can cause.

Hell, for the most extreme example look at Dieselgate. Those things don't start from some evil henchman at the top saying "lets cheat and game the metrics" - it often starts with someone setting impossible to achieve goals unknowingly in service of "setting the bar high for the organization", and by the time the backpressure filters up through the org it's oftentimes too late to fix the damage.

fancyfredbot•3w ago

I don't think this evil boss and downtrodden engineer situation can explain what we're seeing.

Your theoretical engineers would figure out pretty quickly that crashing a server slows you down and the only way to keep the boss happy is to avoid the DDOS.

delfinom•3w ago

As someone that runs the infrastructure for a large OSS project. Mostly Chinese AI firms. All the big name brand AI firms play reasonably nice and respect robots.txt.

The Chinese ones are hyper aggressive, with no rate limit and pure greed scraping. They'll scrape the same content hundreds of times the same day

rfmoz•3w ago

Chinese AI is doing large amounts of request in the past weeks.

tjons•3w ago

how is this showing up for you? site you host or bigger scale? I'm not surprised but rather curious.

suburban_strike•3w ago

The Chinese are also sloppy. They will run those scrapers until they get banned and not give a fuck.

In my experience, they do not bother putting in the effort to obfuscate source or evade bans in the first place. They might try again later, but this particular setup was specifically engineered for resiliency.

bediger4000•3w ago

Is this an example of that "chabuduo" we read about now and then?

suburban_strike•3w ago

When faced with evidence of operating procedure for the malicious, we forever take them at their word when they insist they're just incompetent.

The spirit of this site is so dead. Where are the hackers? Scraping is the best anyone is coming up with?

It's not scraping. They'd notice themselves getting banned everywhere for abuse of this magnitude, which is counterproductive to scraping goals. Rather than rate-limit the queries to avoid that attention, they're going out of their way to (pay to?) route traffic through a residential botnet so they can sustain it. This is not by accident, nor a byproduct of sloppy code Claude shat out. Someone wants to operate with this degree of aggressiveness, and they do not want to be detected or stopped.

This setup is as close to real-time surveillance as can be. Someone really wants to know what is being published on target sites with as minimal a refresh rate as possible and zero interference. It's not a western governmental entity or they'd just tap it.

As for who...there's only one group on the planet so obsessed with monitoring and policing everything everyone else is doing.

ofrzeta•3w ago

Recently I needed to block some scrapers to execessive load on a server, and here are some that I identified:

BOTS=( "semrushbot" "petalbot" "aliyunsecbot" "amazonbot" "claudebot" "thinkbot" "perplexitybot" "openai.com/bot" )

This was really just emergency blocking and it included more than 1500 IP addresses.

Here's Amazon's page about their bot with more information including IP addresses

https://developer.amazon.com/amazonbot

ccgreg•3w ago

One way to figure that out is to look at which companies claim to have foundation models, but no one knows what their crawler is named.

I also suspect that there are a bunch of sub-contractors involved, working for companies that don't supervise them very carefully.

bloppe•3w ago

I'm curious how they concluded this was done to scrape for AI training. If the traffic was easily distinguishable from regular users, they would be able to firewall it. If it was not, then how can they be sure it wasn't just a regular old malicious DDOS? Happens way more often than you might think. Sometimes a poorly-managed botnet can even misfire.

MBCook•3w ago

Why would anyone ever DDOS them? They’ve been around for about three decades now, I don’t know if they’ve ever had a DDOS attack before the AI crawling started.

iamnothere•3w ago

I am starting to think these are not just AI scrapers blindly seeking out data. All kinds of FOSS sites including low volume forums and blogs have been under this kind of persistent pressure for a while now. Given the cost involved in maintaining this kind of widespread constant scraping, the economics don’t seem to line up. Surely even big budget projects would adjust their scraping rates based on how many changes they see on a given site. At scale this could save a lot of money and would reduce the chance of blocking.

I haven’t heard of the same attacks facing (for instance) niche hobby communities. Does anyone know if those sites are facing the same scale of attacks?

Is there any chance that this is a deniable attack intended to disrupt the tech industry, or even the FOSS community in particular, with training data gathered as a side benefit? I’m just struggling to understand how the economics can work here.

zomiaen•3w ago

How many of these scrapers are written by AI by data-science folks who don't remotely care how often they're hitting the sites, and is data they wouldn't even think to give or ask the LLM about?

iamnothere•3w ago

But does that explain all of the various scrapers doing the same thing across the same set of sites? And again, the sheer bandwidth and CPU time involved should eventually bother the bean counters.

I did think of a couple of possibilities:

- Someone has a software package or list of sites out there that people are using instead of building their own scrapers, so everyone hits the same targets with the same pattern.

- There are a bunch of companies chasing a (real or hoped for) “scraped data” market, perhaps overseas where overhead is lower, and there’s enough excess AI funding sloshing around that they able to scrape everything mindlessly for now. If this is the case then the problem should fix itself as funding gets tighter.

TurdF3rguson•3w ago

My theory on this one is some serial wantrepreneur came up with a business plan of scraping the archive and feeding it into a LLM to identify some vague opportunity. Then they paid some Fiverr / Upwork kid in India $200 to get the data. The good news is this website and any other can mitigate these things by moving to Cloudflare and it's free.

philipwhiuk•3w ago

> I haven’t heard of the same attacks facing (for instance) niche hobby communities. Does anyone know if those sites are facing the same scale of attacks?

Yes. Fortunately if your hobby community is regional you can be fairly blunt in terms of blocks.

shantara•3w ago

>I haven’t heard of the same attacks facing (for instance) niche hobby communities. Does anyone know if those sites are facing the same scale of attacks?

They are. I participate in modding communities for very niche gaming projects. All of them experienced massive DDOS attacks from AI scrappers on their websites over the past year. They are long running non-commercial projects that don’t present any business interest to anyone to be worth expending resources purely to bring them offline. They had to temporarily put the majority of their discussion boards and development resources behind a login wall to avoid having to go down completely.

iamnothere•3w ago

Thanks. The scale of this is just mind-boggling. Unbelievably wasteful.

hackingonempty•3w ago

A couple of forums I have lurked on for years have closed up and now require a login to read.

Tanoc•3w ago

I've wondered for a while if simple interaction systems would be good enough to fend these things off without building up walls like logins. Things like Anubis do system checks, but I'm wondering if it would be even easier to do something like the oldschool Captchas where you just have a single interactive element that requires user input to redirect to another page. Like you hit a landing page and drag a slider or click and hold to go to the page proper, things that aren't as annoying as modern Captchas and are like a fun little interactive way to enter.

As I'm writing this I'm reminded of Flash based homepages. And it really makes it apparent that Flash would be perfect for impeding these LLM crawlers.

danaris•3w ago

Just as an additional anecdata point:

I run a small, niche browser game (~125 weekly unique users, down from around 1500 at its peak 15 years ago), and until I put its Wiki behind a login wall a few months ago, we were getting absolutely hammered by the bots. Not open source, not anything of particular interest to anyone beyond those already playing the game and the very select group of people who, if they found it, might actually enjoy it. (It's all text, almost-entirely-player-driven, and can be very slow at times, so people used to modern mobile games and similar dopamine factories tend to bounce off of it very quickly.)

Some of the UAs we saw included Claude and OpenAI, but there were a lot of obviously-bot requests to the Wiki that were using generic UAs and residential IPs.

If there's a concerted effort to swamp open-source projects, it's not the only thing going on. I think it's much more likely that the primary cause of this flood is people who a) think they have the right to absolutely everything on the internet, b) expect everyone they scrape from to be actively trying to hide the data from them (so, for instance, they will ignore any exposed API), and c) don't care either how many resources they use, or how much damage they do.

2OEH8eoCRo0•3w ago

When are we going to start suing these assholes? Why isn't anybody leveraging the legal system? You're all searching for technical solutions to a legal problem and fighting with one hand behind your back.

seb1204•3w ago

Is it possible to attribute the attack to a company?

2OEH8eoCRo0•3w ago

Nope. It's impossible to trace anything that happens on the internet.

Havoc•3w ago

That makes no sense.

There is no reason for AI scrappers to use tens of thousands of IPs to scrape one site over and over.

That just sounds like a classic DDOS.

TurdF3rguson•3w ago

Sure there is, scrapers do that to defeat throttling. 10,000 is less than 3 hours of scraping at 1 request per second.

Havoc•3w ago

It's not 10k requests, it's 10k IPs

Having lots of IPs is helpful for scraping, but you don't need 10k. That's a botnet

TurdF3rguson•3w ago

The way it works is this: You can sign up for a proxy rotator service that works like a regular proxy except every request you make goes through a different ip address. Is that a botnet? Yes. Is it also typically used in a scraping project? Yes.

Havoc•3w ago

Yeah I know, I've done scrapping too.

It can absolutely be that, but that requires a confluence of multiple factors - misconfigured scrapper hitting the site over and over, a big bot net like proxy setup that is way overkilled for scrapping, a setup sophisticated enough to do all that yet simultaneously stupid enough to not cope with a site is mostly text and a couple gigs at most and all that over extended timeframe without anyone realising their scrapper is stuck.

Or alternative explanation: It's a DDOS

TurdF3rguson•3w ago

Except that I think it's clear that the motive was getting the data not taking the site offline. The evidence for that is that it stopped on its own without them doing anything to mitigate it.

Also I don't know why you think this is sophisticated, it's probably 40 lines of Python code max.

Havoc•3w ago

No, DDOS do stop on their own too..

This stopping is absolutely not "evidence" that the motive was grabbing data. Honestly...

TurdF3rguson•3w ago

Ok so they spent all that money to... mildly inconvenience users temporarily? Lol.

TiredOfLife•3w ago

If you call it DDOS you can't capitalize on ai hate

CodeBytes•3w ago

It likely is AI scrapers essentially doing a DDoS. They use separate IPs (and vary the UA) to prevent blocking.

I have a site which is currently being hit (over 10k requests today) and it looks like scrapers as every URL is different. If it was a DDoS, they would target costly pages like my search not every single URL.

SQLite had the same thing: https://sqlite.org/forum/forumpost/7d3eb059f81ff694 As have a few other open source repositories. It looks like badly written crawlers trying to crawl sites as fast as possible.

sgc•3w ago

Can somebody tell me what is a normal "cost of doing business" level of bot traffic these days? I have way too much bot traffic like everybody else, but I don't know if I am an outlier or just run of the mill. I get about 100k bot hits a day, presumably because I have about 350k pages on my site.

kay_o•3w ago

Esports vertical: I get about 5-20b bot hits per day (unwanted; includes both IA, brute forcer, "security" scanners, wp-admin/ requests), 1.5m google spider (search; respectful of crawl delay), and about 50-100m human (largely mobile).

For unwanted bots I serve incorrect information -- it's online gaming match history without much text so requests flagged as unwanted bots will, instead of heavy database queries, get plausibly random numbers -- seeded by the user so they stay stable -- KDA, win/loss rates, rankings.

A few dozen million distinct pages but they are numeric stats for user profiles, match stats with little to none paragraph form of text.

sgc•3w ago

Well you must be an outlier! 100-200x bot to human traffic is a lot. AI bots likely focus on stats / technical info if they happen to be tuned to discern, and there is money in sports stats; so you are a target.

samtrack2019•3w ago

I gave up and moved to cloudfare for my static blog, it was too much of a pain to track the spammers. (i am not proud of using big$corp)

xacky•3w ago

Residential proxies need to be classed as malware and added to antivirus definitions and kicked off of app stores. Mass ban evasion needs to be cracked down on.

I Write Games in C (yes, C)

We Mourn Our Craft

SectorC: A C Compiler in 512 bytes

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Hoot: Scheme on WebAssembly

The AI boom is causing shortages everywhere else

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Al Lowe on model trains, funny deaths and working with Disney

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Coding agents have replaced every framework I used

France's homegrown open source online office suite

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

History and Timeline of the Proco Rat Pedal (2021)

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Hackers (1995) Animated Experience

Making geo joins faster with H3 indexes

Sheldon Brown's Bicycle Technical Info

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

I Write Games in C (yes, C)

We Mourn Our Craft

SectorC: A C Compiler in 512 bytes

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Hoot: Scheme on WebAssembly

The AI boom is causing shortages everywhere else

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Al Lowe on model trains, funny deaths and working with Disney

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Coding agents have replaced every framework I used

France's homegrown open source online office suite

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

History and Timeline of the Proco Rat Pedal (2021)

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Hackers (1995) Animated Experience

Making geo joins faster with H3 indexes

Sheldon Brown's Bicycle Technical Info

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

LWN is currently under the heaviest scraper attack seen yet

Comments