big tech incentivised to ddos... what a world they've built
In that case, by that rubric literally anything that you conspire with yourself to accomplish (buying next week's groceries, making a turkey sandwich...) would also be a conspiracy.
I also don't get the comments on the linked social site. IIUC the users posting there are somehow involved with kernel work, right? So they should know a thing or two about technical stuff? How / why are they so convinced that the big bad AI baddies are scraping them, and not some miss-configured thing that someone or another built? Is this their first time? Again, there's nothing there that hasn't been indexed dozens of times already. And... sorry to say it, but neither newsletters nor the 1-3 comments on each article are exactly "prime data" for any kind of training.
These people have gone full tinfoil hat and spewing hate isn't doing them any favours.
https://lwn.net/Articles/1008897
Your nonsense about LWN being a "newsletter" and having "zero valuable data" isn't doing you any favors. It is the prime source of information about Linux kernel development, and Linux development in general.
"AI" cancer scraping the same thing over and over and over again is not news for anybody even with a cursory interest in this subject. They've been doing it for years.
I mean...
Again, the site is so old that anything worth while is already in cc or any number of crawls. I am not saying they weren't scraped. I'm saying they likely weren't scraped by the bad AI people. And certainly not by AI companies trying to limit others from accessing that data (as the person who I replied to stated).
1. Coding assistants have emerged as as one of the primary commercial opportunities for AI models. As GP pointed out, LWN is the primary discussion for kernel development. If you were gathering training data for a model, and coding assistance is one of your goals, and you know of a primary sources of open source development expertise, would you:
(a) ignore it because it’s in a quaint old format, or
(b) slurp up as much as you can?
2. If you’d previously slurped it up, and are now collating data for a new training run, and you know it’s an active mailing list that will have new content since you last crawled it, would you: (a) carefully and respectfully leave it be, because you still get benefit from the previous content even though there’s now more and it’s up to date, or
(b) hoover up every last drop because anything you can do to get an edge over your competitors means you get your brief moment of glory in the benchmarks when you release?" a.getElementsByTagName = function (...args) {//Clear page content}"
One can also hide components inside Shadow DOM to make it harder to scrape.
However, these methods will interfere with automated testing tools such as Playwright and Selenium. Also, search engine indexing is likely to be affected.
Edit: Fabian2k was ten seconds ahead. Damn!
Of course they're not going to stop at just code. They need all the rest of it as well.
Is there even any evidence that "crypto bros" and "AI bros" are even the same set of people other than being vaguely "tech" and hated by HN? At best you have someone like Altman who founded openai and had a crypto project (worldcoin), but the latter was approximately used by nobody. What about everyone else? Did Ilya Sutskever have a shitcoin a few years ago? Maybe Changpeng Zhao has an AI lab?
That was a biometric surveillance project disguised as a crypto project.
> Is there even any evidence that "crypto bros" and "AI bros" are even the same set of people
No, the "AI" people are far worse. I always had a choice to /not/ use crypto. The "AI" people want to hamfistedly shove their flawed investment into every product under the sun.
Has it been adjudicated that AI use actually allows that? That's definitely what the AI bros want (and will loudly assert), but that doesn't mean it's true.
"It is a DDOS attack involving tens of thousands of addresses"
It is amazing just how distributed some of these things are. Even on the small sites that I help host we see these types of attacks from very large numbers of diverse IPs. I'd love to know how these are being run.And if you don't care about the "residential" part you can get proxies with data center IPs for much cheaper from the same providers. But those are easily blocked
They don't really need to scrape training data as CommonCrawl or other content archives would be fine for training data. They don't think/know to ask what they really want: training data.
In the least charitable interpretation it's anti-social assholes that have no concept or care about negative externalities that write awful naive scrapers.
It is difficult to figure out the incentives here. Why would anyone want to pull data from LWN (or any other site) at a rate which would cause a DDOS like attack?
If I run a big data hungry AI lab consuming training data at 100Gb/s it's much much easier to scrape 10,000 sites at 10Mb/s than DDOS a smaller number of sites with more traffic. Of course the big labs want this data but why would they risk the reputational damage of overloading popular sites in order to pull it in an hour instead of a day or two?
They may face less reputational damage than say Google or OpenAI would but I expect LWN has Chinese readers who would look dimly on this sort of thing. Some of those readers probably work for Alibaba and Tencent.
I'm not necessarily saying they wouldn't do it if there was some incentive to do so but I don't see the upside for them.
The only big AI company I recognized by name was OpenAI's GPTBot. Most of them are from small companies that I'm only hearing of for the first time when I look at their user agents in the Apache logs. Probably the shadiest organizations aren't even identifying their requests with a unique user agent.
As for why a lot of dumb bots are interested in my web pages now, when they're already available through Common Crawl, I don't know.
It would be funny if someone vibe coded scraping infrastructure and tested it at a small scale, didn't see a problem, and so turned the throttle all the way up.
I haven’t heard of the same attacks facing (for instance) niche hobby communities. Does anyone know if those sites are facing the same scale of attacks?
Is there any chance that this is a deniable attack intended to disrupt the tech industry, or even the FOSS community in particular, with training data gathered as a side benefit? I’m just struggling to understand how the economics can work here.
I did think of a couple of possibilities:
- Someone has a software package or list of sites out there that people are using instead of building their own scrapers, so everyone hits the same targets with the same pattern.
- There are a bunch of companies chasing a (real or hoped for) “scraped data” market, perhaps overseas where overhead is lower, and there’s enough excess AI funding sloshing around that they able to scrape everything mindlessly for now. If this is the case then the problem should fix itself as funding gets tighter.
zahlman•1h ago
jzb•27m ago