frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

LLM Botnet: Are companies using botnets to scrape content?

3•flyriver•2d ago
I have a web site with several million pages of articles, generated by the llama/gpt/and gemini. As you can imagine, there is a lot of scraping happening. Generally speaking, I allow the crawlers that respect robots.txt and identity themselves as bots to go wild. I figure it might get the site more exposure if it is "in" the LLMs. Otherwise, I try to block them.

Over time, especially recently, I have seen thousands of diverse IP addresses scraping the site. They use random/varying user-agents. I originally was blocking Brazil /16s since it appeared that most of the traffic was coming from there, but over the past few weeks the IPs come from everywhere. Each IP makes only a few requests, trying to stay under the radar. Right now, I have set some scripts to block and log the IPs as they come in.

I am blocking between 50 and 100 unique IP addresses per minute, and this is after I already blocked the main Chinese LLM scrapers and several /16s. Few of the IPs belong to obvious providers. Many just seem to be home users. Many are from countries that do not have the money to build LLMs. There are even wireless phone company IPs.

None of the requests are particularly malicious. They are just downloading pages.

Am I missing something? Is there a new botnet scraping the web ? A quick grep through my logs shows I have blocked 15,000 requests in the past 90 minutes, but only 1300 of them are repeats of IPs that have been added to my block list. Yesterday, I blocked 220,000 requests and only 13,000 of them were repeats.

Comments

Retr0id•2d ago
They're usually residential proxies, enabled by "SDKs" shipped as a means of monetizing mobile apps. Basically a legalized(ish) botnet.

If you have AI-generated content, expect an AI-generated audience.

tough•1d ago
Garbage In, Garbage Out
alp1n3-dev•13h ago
It isn't a "new" botnet, but just continued use of a rotating array of available addresses. Some are enterprise and will be upfront (as you've observed), some less so.

Different devices also enable this, as mentioned in the comments. Smarthome / IoT devices, cable-brand router/APs, etc. There are also services for rotating residential proxies, that are essentially breaking the ToS of the companies in charge of them, but they trade/buy new IPs constantly.

The larger scale a site is indexed at, the more you'll see this traffic pick up. CloudFlare has rules that can help with it, and you can get stricter with them if you know your audience / customer base via whitelists; geo (can be circumvented ofc, but quiets the noise), user-agent, http version, etc. For the more broad ones, just immediately prompt a challenge if you don't want to outright drop/block them.

It's been this way for a while, LLMs have made it worse, but there was already a ton of garbage requests / scanning going on.

Ask HN: How are you acquiring your first hundred users?

483•amanchanda•21h ago•311 comments

Ask HN: How do you store the knowledge gained in a day?

41•dennisy•11h ago•65 comments

Good luck to everyone applying for YC summer 2925 batch

3•byoung2•6h ago•3 comments

Ask HN: Cursor or Windsurf?

294•skarat•2d ago•373 comments

FlyLoop – AI Agent for Scheduling Meetings and Managing Your Calendar

16•localbuilder•16h ago•2 comments

Ask HN: Economists, what's your opinion on US tariffs?

6•pinkmuffinere•7h ago•1 comments

Ask HN: How do you like the Framework matte screen?

4•christophilus•13h ago•1 comments

New AI Chatbot Apps

2•bennyv1211•10h ago•0 comments

Ask HN: Is Slack Down?

68•abatilo•1d ago•29 comments

Which AI Agent is your favorite?

2•jeyzolo•11h ago•4 comments

Ask HN: How did you fund your early stage hardware startup?

2•mrtb•13h ago•0 comments

Ask HN: What are good high-information density UIs (screenshots, apps, sites)?

524•troupo•5d ago•370 comments

Ask HN: Not sure about the future of tech

16•xblpob•1d ago•13 comments

Ask HN: I burnt out, quit my job – any advice on moving to freelance/consulting?

8•gardennoise•1d ago•13 comments

Ask HN: Should You Include a Certificate in a SAML AuthnRequest?

5•andy89•1d ago•2 comments

Ask HN: How much better are AI IDEs vs. copy pasting into chat apps?

137•lopatin•6d ago•136 comments

Ask HN: Did GitHub UI become unbearably slow?

10•zaphodias•1d ago•8 comments

Ask HN: Any recommendations for a portable music player

6•laserstrahl•1d ago•6 comments

Ask HN: Where to get used hardware cheap?

4•laserstrahl•1d ago•7 comments

Ask HN: Do You Prepare for Job Interviews? If So, How?

5•dovab•1d ago•10 comments

Ask HN: Are LLMs useful or harmful when learning to program?

11•dominicq•2d ago•20 comments

Ask HN: Gemini Reliability Degrading?

7•martinald•2d ago•2 comments

Why is it so hard to find founders to bounce off ideas in city you are visiting?

10•nickevante•2d ago•29 comments

Ask HN: What is the worst communications tool you've ever used?

11•logicallee•3d ago•33 comments

Ask HN: Is big tech still more stable?

9•ronbenton•2d ago•13 comments

Image to 3D

2•theankur7•1d ago•3 comments

Ask HN: Anyone using Chrome ext with AI for daily copywriting/social media?

7•refinedea•2d ago•3 comments

Ask HN: RAG or shared memory for task planning across physical agents?

11•mbbah•4d ago•2 comments

LLM Botnet: Are companies using botnets to scrape content?

3•flyriver•2d ago•3 comments

Ask HN: Fictional business books like The Goal

11•jimnotgym•3d ago•7 comments