frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

AI crawlers, fetchers are blowing up websites; Meta, OpenAI are worst offenders

https://www.theregister.com/2025/08/21/ai_crawler_traffic/
65•rntn•2h ago

Comments

breakyerself•2h ago
There's so much bullshit on the internet how do they make sure they're not training on nonsense?
bgwalter•1h ago
Much of it is not training. The LLMs fetch webpages for answering current questions, summarize or translate a page at the user's request etc.

Any bot that answers daily political questions like Grok has many web accesses per prompt.

8organicbits•1h ago
Is an AI chatbot fetching a web page to answer a prompt a 'web scraping bot'? If there is a user actively promoting the LLM, isn't it more of a user agent? My mental model, even before LLMs, was that a human being present changes a bot into a user agent. I'm curious if others agree.
bgwalter•1h ago
The Register calls them "fetchers". They still reproduce the content of the original website without the website gaining anything but additional high load.

I'm not sure how many websites are searched and discarded per query. Since it's the remote, proprietary LLM that initiates the search I would hesitate to call them agents. Maybe "fetcher" is the best term.

danaris•27m ago
But they're (generally speaking) not being asked for the contents of one specific webpage, fetching that, and summarizing it for the user.

They're going out and scraping everything, so that when they're asked a question, they can pull a plausible answer from their dataset and summarize the page they found it on.

Even the ones that actively go out and search/scrape in response to queries aren't just scraping a single site. At best, they're scraping some subset of the entire internet that they have tagged as being somehow related to the query. So even if what they present to the user is a summary of a single webpage, that is rarely going to be the product of a single request to that single webpage. That request is going to be just one of many, most of which are entirely fruitless for that specific query: purely extra load for their servers, with no gain whatsoever.

snowwrestler•1h ago
While it’s true that chatbots fetch information from websites in response to requests, the load from those requests is tiny compared to the volume of requests indexing content to build training corpuses.

The reason is that user requests are similar to other web traffic because they reflect user interest. So those requests will mostly hit content that is already popular, and therefore well-cached.

Corpus-building crawlers do not reflect current user interest and try to hit every URL available. As a result these hit URLs that are mostly uncached. That is a much heavier load.

shikon7•21m ago
But surely there aren't thousands of new corpuses built every minute.
bgwalter•3m ago
Why would the Register point out Meta and OpenAI as the worst offenders? I'm sure they do not continuously build new corpuses every day. It is probably the search function, as mentioned in the top comments.
prasadjoglekar•1h ago
By paying a pretty penny for non bullshit data (Scale Ai). That along with Nvidia are the shovels in this gold rush.
danaris•40m ago
I mean...they don't. That's part of the problem with "AI answers" and such.
shinycode•1h ago
In the same time it’s so practical to ask a question and it opens 25 pages to search and summarize the answer. Before that’s more or less what I was trying to do by hand. Maybe not 25 websites because of crap SEO the top 10 contains BS content so I curated the list but the idea is the same no ?
pm215•1h ago
Sure, but if the fetcher is generating "39,000 requests per minute" then surely something has gone wrong somewhere ?
miohtama•1h ago
Even if it is generating 39k req/minute I would expect most of the pages already be locally cached by Meta, or served statically by their respective hosts. We have been working hard on catching websites and it has been a solved problem for the last decade or so.
andai•56m ago
They're not very good at web queries, if you expand the thinking box to see what they're searching for, like half of it is nonsense.

e.g. they'll take an entire sentence the user said and put it in quotes for no reason.

Thankfully search engines started ignoring quotes years ago, so it balances out...

rco8786•53m ago
My personal experience is that OpenAI's crawler was hitting a very, very low traffic website I manage 10s of 1000s of times a minute non-stop. I had to block it from Cloudflare.
danaris•38m ago
Same here.

I run a very small browser game (~120 weekly users currently), and until I put its Wiki (utterly uninteresting to anyone who doesn't already play the game) behind a login-wall, the bots were causing massive amounts of spurious traffic. Due to some of the Wiki's data coming live from the game through external data feeds, the deluge of bots actually managed to crash the game several times, necessitating a restart of the MariaDB process.

Leynos•14m ago
Where is caching breaking so badly that this is happening? Are OpenAI failing to use etags or honour cache validity?
Analemma_•3m ago
Their crawler is vibe-coded.
internet_points•1h ago
They mention anubis, cloudflare, robots.txt – does anyone have experiences with how much any of them help?
nromiun•1h ago
CDNs like Cloudflare are the best. Anubis is a rate limitor for small websites where you can't or won't use CDNs like Cloudflare. I have used Cloudflare on several medium sized websites and it works really well.

Anubis's creator says the same thing:

> In most cases, you should not need this and can probably get by using Cloudflare to protect a given origin. However, for circumstances where you can't or won't use Cloudflare, Anubis is there for you.

Source: https://github.com/TecharoHQ/anubis

bakugo•20m ago
robots.txt is obviously only effective against well-behaved bots. OpenAI etc are usually well behaved, but there's at least one large network of rogue scraping bots that ignores robots.txt, fakes the user-agent (usually to some old Chrome version) and cycles through millions of different residential proxy IPs. On my own sites, this network is by far the worst offender and the "well-behaved" bots like OpenAI are barely noticeable.

To stop malicious bots like this, Cloudflare is a great solution if you don't mind using it (you can enable a basic browser check for all users and all pages, or write custom rules to only serve a check to certain users or on certain pages). If you're not a fan of Cloudflare, Anubis works well enough for now if you don't mind the branding.

Here's the cloudflare rule I currently use (vast majority of bot traffic originates from these countries):

  ip.src.continent in {"AF" "SA"} or
  ip.src.country in {"CN" "HK" "SG"} or
  ip.src.country in {"AE" "AO" "AR" "AZ" "BD" "BR" "CL" "CO" "DZ" "EC" "EG" "ET" "ID" "IL" "IN" "IQ" "JM" "JO" "KE" "KZ" "LB" "MA" "MX" "NP" "OM" "PE" "PK" "PS" "PY" "SA" "TN" "TR" "TT" "UA" "UY" "UZ" "VE" "VN" "ZA"} or
  ip.src.asnum in {28573 45899 55836}
hombre_fatal•18m ago
CloudFlare's Super Bot Fight Mode completely killed the surge in bot traffic for my large forum.
rco8786•54m ago
OpenAI straight up DoSed a site I manage for my in-laws a few months ago.
muzani•26m ago
What is it about? I'm curious what kinds of things people ask that floods sites.
average_r_user•23m ago
I suppose that they just keep referring to the website in their chats, and probably they have selected the search function, so before every reply, the crawler hits the website
hereme888•51m ago
I'm absolutely pro AI-crawlers. The internet is so polluted with garbage, compliments of marketing. My AI agent should find and give me concise and precise answers.
exasperaited•30m ago
Xe Iaso is my spirit animal.

> "I don't know what this actually gives people, but our industry takes great pride in doing this"

> "unsleeping automatons that never get sick, go on vacation, or need to be paid health insurance that can produce output that superficially resembles the output of human employees"

> "This is a regulatory issue. The thing that needs to happen is that governments need to step in and give these AI companies that are destroying the digital common good existentially threatening fines and make them pay reparations to the communities they are harming."

<3 <3

delfinom•15m ago
I run a symbol server, as in, PDB debug symbol server. Amazon's crawler and a few others love requesting the ever loving shit out of it for no obvious reason. Especially since the files are binaries.

I just set a rate-limit in cloudflare because no legitimate symbol server user will ever be excessive.

xrd•6m ago
Isn't there a class action lawsuit coming from all this? I see a bunch of people here indicating these scrapers are costing real money to people who host even small niche sites.

Is the reason these large companies don't care because they are large enough to hide behind a bunch of lawyers?

Allan Pinkerton (Aug 21, 1819 – Jul 1, 1884)

https://en.wikipedia.org/wiki/Allan_Pinkerton
1•petethomas•4m ago•0 comments

Taking a look at my old Palm IIIx – by Paul Lefebvre

https://www.goto10retro.com/p/taking-a-look-at-my-old-palm-iiix
2•rbanffy•6m ago•0 comments

Moving to SF, Erdős Style

https://world.hey.com/tratt/moving-to-sf-erdos-style-d362b77e
1•andytratt•7m ago•0 comments

Google yet to take down 'screenshot-grabbing' Chrome VPN extension

https://www.theregister.com/2025/08/21/freevpn_privacy_research/
1•rntn•7m ago•0 comments

Janito 2.32.0: DeepSeek R1 and 128K context support

1•joaompinto•9m ago•0 comments

Eating your veggies isn't easy: they cost more and there aren't enough of them

https://www.foodpolitics.com/2025/08/eat-your-veggies-isnt-easy-they-cost-more/
1•speckx•10m ago•1 comments

Ghanaian Star Shatta Wale Held over Lamborghini Deal

https://www.jphfeeds.top/2025/08/ghanaian-star-shatta-wale-held-over.html
1•jphfeeds•12m ago•1 comments

Effect of Window Structure and Mounting on Sound Insulation

https://www.mdpi.com/2071-1050/17/15/6892
1•PaulHoule•12m ago•0 comments

Show HN: Solar Forth – Forth System with LibUV for Concurrency

https://github.com/RickCarlino/solar-forth
3•rickcarlino•12m ago•0 comments

Are Marathons and Extreme Running Linked to Colon Cancer?

https://www.nytimes.com/2025/08/19/health/running-colon-cancer.html
3•bookofjoe•13m ago•1 comments

Measuring the environmental impact of delivering AI at Google Scale [pdf]

https://services.google.com/fh/files/misc/measuring_the_environmental_impact_of_delivering_ai_at_google_scale.pdf
2•j4mie•13m ago•0 comments

Microsoft QuickBasic Remembered

https://dfarq.homeip.net/microsoft-quickbasic-remembered/
1•BallsInIt•14m ago•1 comments

First live video of embryo attaching itself to uterine wall

https://www.npr.org/sections/shots-health-news/2025/08/15/nx-s1-5498787/embryos-small-but-mighty-first-live-videos-show
1•gmays•15m ago•0 comments

Moving from AWS to Bare-Metal Kubernetes saved us $230k /yr

https://oneuptime.com/blog/post/2023-10-30-moving-from-aws-to-bare-metal/view
2•ndhandala•18m ago•0 comments

OpenCore: Experience macOS just like before

https://github.com/dortania/OpenCore-Legacy-Patcher
1•janandonly•18m ago•0 comments

Show HN: RunMat – a V8 inspired Rust runtime for the landlocked Matlab language

https://runmat.org/blog/introducing-runmat
2•nallana•18m ago•0 comments

MetalBear Reads Mean Comments

https://metalbear.co/blog/mean-comments/
1•aviramha•18m ago•0 comments

Humans intervened every 9 minutes in AAA test of driver assists

https://arstechnica.com/cars/2025/08/humans-intervened-every-9-minutes-in-aaa-test-of-driver-assists/
2•pseudolus•19m ago•0 comments

I guess I was wrong about AI persuasion

https://dynomight.net/persuasion/
1•tobr•20m ago•0 comments

Do blogs need to be so lonely?

https://thehistoryoftheweb.com/do-blogs-need-to-be-so-lonely/
1•speckx•21m ago•0 comments

When AI optimizations miss the mark: A case study in array shape calculation

https://questdb.com/blog/when-ai-optimizations-miss-the-mark/
1•nhourcard•21m ago•0 comments

Show HN: The first generative UI component library

https://ui.tambo.co/
1•grouchy•24m ago•0 comments

Building the Chiplet Ecosystem – By Austin Lyons

https://www.chipstrat.com/p/building-the-chiplet-ecosystem
1•rbanffy•27m ago•0 comments

In a first, Google has released data on how much energy an AI prompt uses

https://www.technologyreview.com/2025/08/21/1122288/google-gemini-ai-energy/
9•jeffbee•28m ago•1 comments

Flashy, Fancy Shortcuts Aren't Always Suitable [Python Shorts]

https://www.thepythoncodingstack.com/p/flashy-fancy-shortcuts-arent-always
1•rbanffy•28m ago•0 comments

Show HN: I built a tool to visualize NYC Urban Chaos

https://dash.hudsonshipping.co/
2•ajd555•30m ago•0 comments

Full Stack Developer and Web Dev, Freelance, Affordable (Portfolio Projects)

https://www.upwork.com/freelancers/fkkarakurt?mp_source=share
1•fatsec•32m ago•1 comments

Web hosts struggle to protect against WordPress vulnerabilities

https://patchstack.com/articles/hosting-security-tested-87-percent-of-vulnerability-exploits-bypassed-hosting-defenses/
2•oliversild•33m ago•0 comments

HN meta: 99% packet loss on news.ycombinator.com

3•eqvinox•34m ago•0 comments

To Download Adult Mods on Nexus, You Need to Show ID

https://www.thegamer.com/nexus-mods-uk-online-safety-act-id-verification-adult-content/
6•throw7•34m ago•0 comments