frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

LWN is currently under the heaviest scraper attack seen yet

https://social.kernel.org/notice/B2JlhcxNTfI8oDVoyO
114•luu•2h ago

Comments

zahlman•1h ago
Is it still ongoing? The thread appears to be over 24 hours old and as a quick test I had no issue loading the main page (which is as snappy and responsive as expected from a low-bandwidth site like LWN).
jzb•45m ago
Not at the moment. It’s subsided for now.
blibble•1h ago
the perverse incentive is if you ddos the website such that it shuts down, no other "AI" parasites can get the valuable data

big tech incentivised to ddos... what a world they've built

ronsor•1h ago
This sounds like a conspiracy theory.
MBCook•1h ago
I don’t think they’re saying that’s actually happening here, just that it could happen and is accidentally incentivized.
pwdisswordfishy•1h ago
If it's a conspiracy, it would be one where the Minimum Viable Conspirator Count is 1 (inclusive of one's own self).

In that case, by that rubric literally anything that you conspire with yourself to accomplish (buying next week's groceries, making a turkey sandwich...) would also be a conspiracy.

amlib•25m ago
The dead internet theory also sounded unhinged and conspiracy theory-ish a decade or so ago... yet here we are.
phkahler•1h ago
Its called pulling up the ladder behind you, or building a moat!
NitpickLawyer•1h ago
Umm... what data? That's a very old newsletter-like site. Everything that's public on it has been long scraped and parsed by whoever needed it. There's 0 valuable data there for "parasites" to parasite off of.

I also don't get the comments on the linked social site. IIUC the users posting there are somehow involved with kernel work, right? So they should know a thing or two about technical stuff? How / why are they so convinced that the big bad AI baddies are scraping them, and not some miss-configured thing that someone or another built? Is this their first time? Again, there's nothing there that hasn't been indexed dozens of times already. And... sorry to say it, but neither newsletters nor the 1-3 comments on each article are exactly "prime data" for any kind of training.

These people have gone full tinfoil hat and spewing hate isn't doing them any favours.

MBCook•1h ago
I don’t think they were talking about LWN specifically but just in general.
homebrewer•1h ago
Because it started in 2022 and hasn't subsided since? This is just the latest iteration of "AI" scrapers destroying the site, and the worst one yet.

https://lwn.net/Articles/1008897

Your nonsense about LWN being a "newsletter" and having "zero valuable data" isn't doing you any favors. It is the prime source of information about Linux kernel development, and Linux development in general.

"AI" cancer scraping the same thing over and over and over again is not news for anybody even with a cursory interest in this subject. They've been doing it for years.

NitpickLawyer•1h ago
> LWN.net is a reader-supported news site

I mean...

Again, the site is so old that anything worth while is already in cc or any number of crawls. I am not saying they weren't scraped. I'm saying they likely weren't scraped by the bad AI people. And certainly not by AI companies trying to limit others from accessing that data (as the person who I replied to stated).

MBCook•1h ago
Why is it each of your comments seems to include a dig attacking LWN?
spinningslate•25m ago
I’m going to presume good faith rather than trolling. Some questions for you:

1. Coding assistants have emerged as as one of the primary commercial opportunities for AI models. As GP pointed out, LWN is the primary discussion for kernel development. If you were gathering training data for a model, and coding assistance is one of your goals, and you know of a primary sources of open source development expertise, would you:

  (a) ignore it because it’s in a quaint old format, or

  (b) slurp up as much as you can?
2. If you’d previously slurped it up, and are now collating data for a new training run, and you know it’s an active mailing list that will have new content since you last crawled it, would you:

  (a) carefully and respectfully leave it be, because you still get benefit from the previous content even though there’s now more and it’s up to date, or

  (b) hoover up every last drop because anything you can do to get an edge over your competitors means you get your brief moment of glory in the benchmarks when you release?
NitpickLawyer•9m ago
I train coding models with RLVR because that's what works. There's ~0.000x good signal in mailing lists that isn't in old mailing lists. (and, since I can't reply to the other person, I mean old as in established, it is in no way a dig to lwn).

You seem to be missing my point. There is 0 incentives for AI training companies to behave like this. All that data is already in the common crawls that every lab uses. This is likely from other sources. Yet they always blame big bad AI...

gulugawa•1h ago
I've had luck blocking scrapers by overwriting JavaScript methods

" a.getElementsByTagName = function (...args) {//Clear page content}"

One can also hide components inside Shadow DOM to make it harder to scrape.

However, these methods will interfere with automated testing tools such as Playwright and Selenium. Also, search engine indexing is likely to be affected.

bogwog•1h ago
This is a fun idea, especially if you make those functions procedurally generate garbage to get them stuck
TurdF3rguson•52m ago
You think you've had luck. The truth is you have no idea of knowing if this ever had any effect at all.
chrisjj•1h ago
So which is it? DDOS attack or "AI" scrapers?
fabian2k•1h ago
Sufficiently aggressive and inconsiderate scraping is indistinguishable from a DDOS attack.
Y-bar•1h ago
A sufficiently stupid and egregious AI scraper is indistinguishable from a DDOS attack.

Edit: Fabian2k was ten seconds ahead. Damn!

TurdF3rguson•49m ago
Scrapers because DDOS implies that it's malicious rather than accidental and there's no reason to think that.
jacquesm•1h ago
AI allows companies to resell open source code as if they wrote it themselves doing an end run around all license terms. This is a major problem.

Of course they're not going to stop at just code. They need all the rest of it as well.

zipy124•1h ago
From the creators of easy money laundering (crypto bros), we now bring you easy money laundering 2: intellectual property laundering, coming to a theatre near you soon!
gruez•1h ago
>From the creators of easy money laundering (crypto bros),

Is there even any evidence that "crypto bros" and "AI bros" are even the same set of people other than being vaguely "tech" and hated by HN? At best you have someone like Altman who founded openai and had a crypto project (worldcoin), but the latter was approximately used by nobody. What about everyone else? Did Ilya Sutskever have a shitcoin a few years ago? Maybe Changpeng Zhao has an AI lab?

themafia•1h ago
> and had a crypto project (worldcoin)

That was a biometric surveillance project disguised as a crypto project.

> Is there even any evidence that "crypto bros" and "AI bros" are even the same set of people

No, the "AI" people are far worse. I always had a choice to /not/ use crypto. The "AI" people want to hamfistedly shove their flawed investment into every product under the sun.

palmotea•36m ago
> AI allows companies to resell open source code as if they wrote it themselves doing an end run around all license terms. This is a major problem.

Has it been adjudicated that AI use actually allows that? That's definitely what the AI bros want (and will loudly assert), but that doesn't mean it's true.

Sharlin•5m ago
I don't think so. Because LLMs aren't legal persons (yet?!) they can neither have copyright to anything nor violate someone else's copyright, the most reasonable legal interpretation is likely that any IP violations are actually committed by whoever it was who asked an LLM to "rewrite" something in a way that obviously counts as a derived work rather than a cleanroom implementation.
kimixa•16m ago
I worked on an extremely niche project revolving around an old DOS game. Code I worked on is often pretty much the only reference for some things.

It's trivially easy to get claude to scrape that and regurgitate it under any requested licence (some variable names changes, but exactly the same structure - though it got one of the lookup tables wrong, which is one of the few things you could argue aren't copyrighted there).

It'll even cheerfully tell you it's fetching the repository while "thinking". And it's clearly already in the training data - you can get it to detail specifics even disallowing that.

If I referenced copywritten code we didn't have the license for (as is the case for copyleft licenses if you don't follow the restrictions) while employed as a software engineer I'd be fired pretty quick from any corporation. And rightfully so.

People seem to have a strange idea with AI that "copyleft" code is free game to unilaterally re-license. Try doing that with leaked Microsoft code - you're breaking copyright just as much there, but a lot of people seem to perceive it very differently - and not just because of risk of enforcement but in moralizing about it too.

blakesterz•1h ago

  "It is a DDOS attack involving tens of thousands of addresses"
It is amazing just how distributed some of these things are. Even on the small sites that I help host we see these types of attacks from very large numbers of diverse IPs. I'd love to know how these are being run.
smitty1e•1h ago
Call it a "Distributed Intelligence Logic Denial Of Service" (DILDOS) attack both to name it distinctly and characterize the source.
random1234user•34m ago
Might as well call it "Artificial Intelligence Distributed Intelligence Logic Denial Of Service" (AIDILDOS) sounds about right.
PaulDavisThe1st•55m ago
another reference point: we've had well over 1M unique IP addresses hit git.ardour.org as part of stupid as hell git scraping effort. 1M !!!
wongarsu•51m ago
There are plenty of providers selling "residential proxies", distributing your crawler traffic through thousands of residential IPs. BrightData is probably the biggest, but its a big and growing market.

And if you don't care about the "residential" part you can get proxies with data center IPs for much cheaper from the same providers. But those are easily blocked

giantrobot•33m ago
In the most charitable case it's some "AI" companies with an X/Y problem. They want training data so they vibe code some naive scraper (requests is all you need!) and don't ever think to ask if maybe there's some sort of common repository of web crawls, a CommonCrawl if you will.

They don't really need to scrape training data as CommonCrawl or other content archives would be fine for training data. They don't think/know to ask what they really want: training data.

In the least charitable interpretation it's anti-social assholes that have no concept or care about negative externalities that write awful naive scrapers.

tedivm•1h ago
I solved this problem for my blog by simply not being interesting.
fancyfredbot•1h ago
If you can bore an LLM that's exciting.
chuckadams•41m ago
Bore-a-Bot, the new service from the Confuse-a-Cat company.
sandworm101•10m ago
I would rather setup a "shadow" site designed only for LLMs. I would stuff it with ao much insanity that Grok would not be able to leave. How about a billion blog post where every use of "American" is replaced with "Canadian". By the time im done, grok will be spouting conspiracy theories about the decline of the strategic bacon reserve.
naiv•50m ago
TIL about Git Brag because of your blog. It is interesting.
fancyfredbot•1h ago
Who are these agressive scrapers run by?

It is difficult to figure out the incentives here. Why would anyone want to pull data from LWN (or any other site) at a rate which would cause a DDOS like attack?

If I run a big data hungry AI lab consuming training data at 100Gb/s it's much much easier to scrape 10,000 sites at 10Mb/s than DDOS a smaller number of sites with more traffic. Of course the big labs want this data but why would they risk the reputational damage of overloading popular sites in order to pull it in an hour instead of a day or two?

kylehotchkiss•1h ago
china (alibaba and tencent)
fancyfredbot•1h ago
I'm not at all sure alibaba or tencent would actually want to DDOS LWN or any other popular website.

They may face less reputational damage than say Google or OpenAI would but I expect LWN has Chinese readers who would look dimly on this sort of thing. Some of those readers probably work for Alibaba and Tencent.

I'm not necessarily saying they wouldn't do it if there was some incentive to do so but I don't see the upside for them.

philipkglass•1h ago
I don't think that most of them are from big-name companies. I run a personal web site that has been periodically overwhelmed by scrapers, prompting me to update my robots.txt with more disallows.

The only big AI company I recognized by name was OpenAI's GPTBot. Most of them are from small companies that I'm only hearing of for the first time when I look at their user agents in the Apache logs. Probably the shadiest organizations aren't even identifying their requests with a unique user agent.

As for why a lot of dumb bots are interested in my web pages now, when they're already available through Common Crawl, I don't know.

iamnothere•52m ago
Maybe someone is putting out public “scraper lists” that small companies or even individuals can use to find potentially useful targets, perhaps with some common scraper tool they are using? That could explain it? I am also mystified by this.
bjackman•1h ago
LWN includes archives of a bunch of mailing lists so that might be a factor. There are a LOT of web on that domain.
mikkupikku•57m ago
NSA, trying to force everybody onto their Cloudflare reservation.
velox_neb•47m ago
I bet some guy just told Claude Code to archive all of LWN for him on a whim.
tux3•11m ago
Some guy doesn't show up with 10k residential IPs. This is deliberate and organized.
dannyobrien•15m ago
I've been asking this for a while, especially as a lot of the early blame went on the big, visible US companies like OpenAI and Anthropic. While their incentives are different from search engines (as someone said early on in this onslaught, "a search engine needs your site to stay up; an AI company doesn't"), that's quite a subtle incentive difference. Just avoiding the blocks that inevitably spring up when you misbehave is a incentive the other way -- and probably the biggest reason robots.txt obedience, delays between accesses, back-off algorithms etc are widespread. We have a culture that conveys all of these approaches, and reciprocality has its part, but I suspect that's part of the encouragement to adopt them. It could that they're just too much of a hurry not to follow the rules, or it could be others hiding behind those bot-names (or others). Unsure.

Anyway, I think the (currently small[1]) but growing problem is going to be individuals using AI agents to access web-pages. I think this falls under the category of the traffic that people are concerned about, even though it's under an individual users' control, and those users are ultimately accessing that information (though perhaps without seeing the ads that pay of it). AI agents are frequently zooming off and collecting hundreds of citations for an individual user, in the time that a user-agent under manual control of a human would click on a few links. Even if those links aren't all accessed, that's going to change the pattern of organic browsing for websites.

Another challenge is that with tools like Claude Cowork, users are increasingly going to be able to create their own, one-off, crawlers. I've had a couple of occasions when I've ended up crafting a crawler to answer a question, and I've had to intervene and explicitly tell Claude to "be polite", before it would build in time-delays and the like (I got temporarily blocked by NASA because I hadn't noticed Claude was hammering a 404 page).

The Web was always designed to be readable by humans and machines, so I don't see a fundamental problem now that end-users have more capability to work with machines to learn what they need. But even if we track down and sucessfully discourage bad actors, we need to work out how to adapt to the changing patterns of how good actors, empowered by better access to computation, can browse the web.

[1] - https://radar.cloudflare.com/ai-insights#ai-bot-crawler-traf...

dannyobrien•14m ago
(and if anyone from Anthropic or OpenAI is reading this: teach your models to be polite when they write crawlers! It's actually an interesting alignment issue that they don't consider the externalities of their actions right now!)
bloppe•1h ago
I'm curious how they concluded this was done to scrape for AI training. If the traffic was easily distinguishable from regular users, they would be able to firewall it. If it was not, then how can they be sure it wasn't just a regular old malicious DDOS? Happens way more often than you might think. Sometimes a poorly-managed botnet can even misfire.
MBCook•1h ago
Why would anyone ever DDOS them? They’ve been around for about three decades now, I don’t know if they’ve ever had a DDOS attack before the AI crawling started.
iamnothere•1h ago
I am starting to think these are not just AI scrapers blindly seeking out data. All kinds of FOSS sites including low volume forums and blogs have been under this kind of persistent pressure for a while now. Given the cost involved in maintaining this kind of widespread constant scraping, the economics don’t seem to line up. Surely even big budget projects would adjust their scraping rates based on how many changes they see on a given site. At scale this could save a lot of money and would reduce the chance of blocking.

I haven’t heard of the same attacks facing (for instance) niche hobby communities. Does anyone know if those sites are facing the same scale of attacks?

Is there any chance that this is a deniable attack intended to disrupt the tech industry, or even the FOSS community in particular, with training data gathered as a side benefit? I’m just struggling to understand how the economics can work here.

zomiaen•51m ago
How many of these scrapers are written by AI by data-science folks who don't remotely care how often they're hitting the sites, and is data they wouldn't even think to give or ask the LLM about?
iamnothere•23m ago
But does that explain all of the various scrapers doing the same thing across the same set of sites? And again, the sheer bandwidth and CPU time involved should eventually bother the bean counters.

I did think of a couple of possibilities:

- Someone has a software package or list of sites out there that people are using instead of building their own scrapers, so everyone hits the same targets with the same pattern.

- There are a bunch of companies chasing a (real or hoped for) “scraped data” market, perhaps overseas where overhead is lower, and there’s enough excess AI funding sloshing around that they able to scrape everything mindlessly for now. If this is the case then the problem should fix itself as funding gets tighter.

philipwhiuk•14m ago
> I haven’t heard of the same attacks facing (for instance) niche hobby communities. Does anyone know if those sites are facing the same scale of attacks?

Yes. Fortunately if your hobby community is regional you can be fairly blunt in terms of blocks.

2OEH8eoCRo0•29m ago
When are we going to start suing these assholes? Why isn't anybody leveraging the legal system?

East Germany balloon escape

https://en.wikipedia.org/wiki/East_Germany_balloon_escape
152•robertvc•5h ago•45 comments

Cloudflare acquires Astro

https://astro.build/blog/joining-cloudflare/
633•todotask2•8h ago•307 comments

6-Day and IP Address Certificates Are Generally Available

https://letsencrypt.org/2026/01/15/6day-and-ip-general-availability
288•jaas•7h ago•180 comments

Michelangelo's first painting, created when he was 12 or 13

https://www.openculture.com/2026/01/discover-michelangelos-first-painting.html
268•bookofjoe•8h ago•147 comments

Cursor's latest “browser experiment” implied success without evidence

https://embedding-shapes.github.io/cursor-implied-success-without-evidence/
297•embedding-shape•8h ago•134 comments

Releasing rainbow tables to accelerate Net-NTLMv1 protocol deprecation

https://cloud.google.com/blog/topics/threat-intelligence/net-ntlmv1-deprecation-rainbow-tables
15•linolevan•56m ago•9 comments

Just the Browser

https://justthebrowser.com/
448•cl3misch•10h ago•228 comments

LLM Structured Outputs Handbook

https://nanonets.com/cookbooks/structured-llm-outputs
20•vitaelabitur•1d ago•2 comments

Lock-Picking Robot

https://github.com/etinaude/Lock-Picking-Robot
221•p44v9n•4d ago•99 comments

Slop is everywhere for those with eyes to see

https://www.fromjason.xyz/p/notebook/slop-is-everywhere-for-those-with-eyes-to-see/
107•speckx•2h ago•61 comments

STFU

https://github.com/Pankajtanwarbanna/stfu
524•tanelpoder•5h ago•377 comments

Launch HN: Indy (YC S21) – A support app designed for ADHD brains

https://www.shimmer.care/indy-redirect
57•christalwang•6h ago•67 comments

Reading across books with Claude Code

https://pieterma.es/syntopic-reading-claude/
32•gmays•3h ago•12 comments

Drawbot: Let's hack something cute (2025)

https://www.atredis.com/blog/2025/9/30/drawbot-lets-hack-something-cute
11•notmine1337•25m ago•3 comments

Why DuckDB is my first choice for data processing

https://www.robinlinacre.com/recommend_duckdb/
179•tosh•11h ago•68 comments

Brain: PC virus [audio]

https://www.bbc.com/audio/play/w3ct7479
11•andsoitis•4d ago•1 comments

Zep AI (Agent Context Engineering, YC W24) Is Hiring Forward Deployed Engineers

https://www.ycombinator.com/companies/zep-ai/jobs/
1•roseway4•5h ago

Dev-owned testing: Why it fails in practice and succeeds in theory

https://dl.acm.org/doi/10.1145/3780063.3780066
94•rbanffy•8h ago•118 comments

CLI's completion should know what options you've typed

https://hackers.pub/@hongminhee/2026/optique-context-aware-cli-completion
4•dahlia•3d ago•0 comments

Read_once(), Write_once(), but Not for Rust

https://lwn.net/SubscriberLink/1053142/8ec93e58d5d3cc06/
100•todsacerdoti•7h ago•30 comments

Show HN: 1Code – Open-source Cursor-like UI for Claude Code

https://github.com/21st-dev/1code
41•Bunas•1d ago•22 comments

Independent Guest Virtual Machine (IGVM) File Format

https://github.com/microsoft/igvm
15•ingve•1d ago•0 comments

LWN is currently under the heaviest scraper attack seen yet

https://social.kernel.org/notice/B2JlhcxNTfI8oDVoyO
114•luu•2h ago•59 comments

HTTP RateLimit Headers

https://dotat.at/@/2026-01-13-http-ratelimit.html
4•zdw•2d ago•1 comments

Patching the Wii News Channel to serve local news (2025)

https://raulnegron.me/2025/wii-news-pr/
7•todsacerdoti•9h ago•1 comments

Elasticsearch was never a database

https://www.paradedb.com/blog/elasticsearch-was-never-a-database
75•jamesgresql•5d ago•67 comments

The Alignment Game (2023)

https://dmvaldman.github.io/alignment-game/
28•dmvaldman•22h ago•4 comments

Our approach to advertising

https://openai.com/index/our-approach-to-advertising-and-expanding-access/
170•rvz•4h ago•131 comments

Emoji Use in the Electronic Health Record is Increasing

https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2843883
39•giuliomagnifico•4h ago•37 comments

psc: The ps utility, with an eBPF twist and container context

https://github.com/loresuso/psc
71•tanelpoder•9h ago•25 comments