I also have a faceted search that some stupid crawler has spent the last month iterating through. Also mostly uncached URLs.
But from what I have read time to time the crawler acted magnitudes outside of what could have been accepted as just badly configured.
https://herman.bearblog.dev/the-great-scrape/
https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
https://lwn.net/Articles/1008897/
https://tecnobits.com/en/AI-crawlers-on-Wikipedia-platform-d...
So why are they open to the entire world?
> Loading static pages from CDN to scrape training data takes such minimal amounts of resources that it's never going to be a significant part of my costs. Are there cases where this isn't true?
Why did you bring up static pages served by a CDN, the absolute best case scenario, as your reference for how crawler spam might affect server performance?
An example is NextJS where you're strongly encouraged[0] to run a server (or use a platform like Vercel), even if what you're doing is a fairly simple static site.
Combine inconsiderate crawler (AI or otherwise) with a server-side logic that doesn't really need to be there and you have a recipe for a crash, a big hosting bill, or both.
[0] People see https://nextjs.org/docs/app/guides/static-exports#unsupporte... and go "ah shucks I better have a server component then"
> Why did you bring up static pages served by a CDN...
This is easier said than done, but pushing the latest topic snapshot to the CDN whenever a post is made is doable.
I think that statement is way too strong and obviously not true of businesses. It might be true if hobbyist websites where the creator is personally more interested on the server side but it's definitely not true of professional websites.
Professional websites that have enough of a budget to care about the server side will absolutely care about the client side and will track usage. If 10% fewer people used the website, the analytics would show that and there would be a fire drill.
What I can agree with on the author is more of a nuanced point. Client side problems are a lot harder and have a very long tail due to unique client configurations (OS, browser, extensions, physical hardware). So with thousands of combinations, you end up with some wild and rare issues. It becomes hard to chase all of them down and some you just have to ignore.
This can lead to it feeling like websites don't care about client side but it just shows client side is hard.
Amazon.com Inc is currently worth 2.4 billion dollars and the only reason is that most businesses insist on giving their customers the worst online experience possible. I wish that I could one day understand the logic, which goes like this:
1. Notice that people are on their phones all the time.
2. And notice that when people are looking to buy something they first go on the computer or on the smart phone.
3. Therefore let's make the most godawful experience on our website possible, to make sure that our potential customers hate us and don't make a purchase.
4. Customers make their purchase on Amazon instead.
5. Profit??
This is an incredibly reductive view of how Amazon came to dominate online retail. If you genuinely believe this, I would strongly urge you to research their history and understand how they became the monopoly they are today.
I assure you, it's not primarily because they care more about the end user's experience.
Amazon, on the other hand, is plagued with fake or bad products from copycat sellers. I have no idea what I am going to get when I place an order. Frankly, I'm surprised when I get the actual thing I ordered.
No, they can't, as evidenced by not everyone else in e-commerce doing that.
"Any business can do what Amazon does for their products and their customers."
What I meant is that any business can do for their products and their customers what Amazon does. Not that any business can do everything Amazon does.
There would be little reason for online marketplaces like Amazon to grow so huge, if businesses had cared enough to provide a reasonable online experience. 20 years ago, 10 years ago, or 5 years ago. Now we are in 2025 and most businesses offer worse online customer experience than what good businesses were offering 20 years ago. You can't be 20 years behind the times and say that it's impossible to compete. It's very possible to make a great customer experience and make money online, even for small businesses with limited means. As evidence by many companies doing that.
It's the same with any marketplace like Booking.com or restaurant delivery apps. They wouldn't be half as big if the businesses they serve wouldn't be too lazy and worthless to make a decent online experience for their customers. But here we are.
Statements like this are just staggeringly ignorant of how businesses like Amazon operate.
Businesses don't need to be as good as Amazon or deliver as fast. Amazon is just an example. But business need to take their online experience seriously if they don't want to be pushed aside by Amazon and the likes. And few businesses seem to do that even though it's not hard.
Huh?
Also, at least in Spain, some delivery companies are awful. I have a package delivered to a convenience store right now. They refuse to give it to me because I have no delivery key. The courier didn't send it to me. I try to get assistance in their web... and they ask me the key that I want them to give me. Nice, huh?
I asked for a refund to the shop. They have ghosted me in the chat, their return form doesn't work. Their email addresses are no-reply. The contact form doesn't work either. Now I need to wait until Monday to phone them.
I know the shop is legit. They're just woefully incompetent and don't know they are or think that's the way things work.
For cheap and not too expensive products, Amazon just works. No "but I went to your house and there was nobody there" bullshit. No questions return policy.
I think you're underselling the amount of work it takes to create an experience as smooth as Amazon's.
He makes a statement in an earlier article that I think sums things up nicely:
> One thing I've wound up feeling from all this is that the current web is surprisingly fragile. A significant amount of the web seems to have been held up by implicit understandings and bargains, not by technology. When LLM crawlers showed up and decided to ignore the social things that had kept those parts of the web going, things started coming down all over the place.
This social contract is, to me, built around the idea that a human will direct the operation of a computer in real time (largely by using a web browser and clicking links) but I think that this approach is extremely inefficient of both the computer’s and the human’s resources (cpu and time, respectively). The promise of technology should not be to put people behind desks staring at a screen all day, so this evolution toward automation must continue.
I do wonder what the new social contract will be: Perhaps access to the majority of servers will be gated by micropayments, but what will the “deal” be for those who don’t want to collect payments? How will they prevent abuse while keeping access free?
[1] “The current (2025) crawler plague and the fragility of the web”https://utcc.utoronto.ca/~cks/space/blog/web/WebIsKindOfFrag...
If 1000 AWS boxes start hammering your API you might raise an eyebrow, but 1000 requests coming from residential ISPs around the world could be an organic surge in demand for your service.
Residential proxy services break this - which has been happening on some level for a long time, but the AI-training-set arms race has driven up demand and thus also supply.
It's quite easy to block all of AWS, for example, but it's less easy to figure out which residential IPs are part of a commercially-operated botnet.
Is the client navigating the site faster than humanly possible? It's a bot. This seems like a simple test.
> 1000 requests coming from residential ISPs around the world could be an organic surge
But probably isn't.
Not when the singular bot has a pool of millions of IPs to originate each request from.
If you think there's an easy solution here, productize it and make billions.
As we've seen security is really hard to build in after the fact. It has to be part of your design concept from the very first, and pervades every other decision you make. If you try to layer security on top you will lose.
Of course you may discover that a genuinely secure system is also unusably inconvenient and you lose to someone willing to take risks, and it's all moot.
Directly from the article:
> it's not new, and it goes well beyond anti-crawler and anti-robot defenses. As covered by people like Alex Russell, it's routine for websites to ignore most real world client side concerns (also, and including on desktops). Just recently (as of August 2025), Github put out a major update that many people are finding immensely slow even on developer desktops.
The things he links to are about things that are unrelated to anti-bot measures.
The fact is, the web is an increasingly unpleasant place to visit. Users are subject to terrible UX – dark patterns, tracking, consent popups, ads everywhere, etc.
Then along come chatbots and when somebody asks about something, they are given the response on the spot without having to battle their way through all that crap to get what they want.
Of course users are going to flock to chatbots. If a site owner is worried they are losing traffic to chatbots, perhaps they should take a long, hard look at what kind of user experience they are serving up to people.
This is like streaming media all over again. Would you rather buy a legit DVD and wait for it to arrive in the post, then wait through an unskippable lecture about piracy, then wait through unstoppable trailers, then find your way through a weird, horrible DVD menu… or would you rather download it and avoid all that? The thing that alleviated piracy was not locking things down even more, it was making the legitimate route more convenient.
We need to make websites pleasant experiences again, and we can’t do that when we care about everything else more than the user experience.
No other website can compete with that.
The whole story with streaming media is not just that pay streaming became more convenient. It’s also that content creators used legal and business mechanisms to make piracy inconvenient. They shut down Napster. They send DMCA notices. They got the DMCA enacted. They got YouTube working for them by serving ads with their content and thus monetizing it.
Chat bots are just like Napster. They’re free-riding off the content others worked to create. Just like with Napster, making websites more convenient will be only part of the answer.
Copyright holders, not content creators. Though typically content creators are also copyright holders, copyright holders are not always content creators, esp in this context. To a big degree these practices are not on the behalf of content creators nor are they helping them.
The solution may be elsewhere: starting from creating content that people may actually care about.
> No other website can compete with that.
Copyright infringers uploaded music, television, and films free of charge, yet people still pay for all of that.
> The whole story with streaming media is not just that pay streaming became more convenient. It’s also that content creators used legal and business mechanisms to make piracy inconvenient.
Do you seriously think that copyright infringement ended when Napster went down? Have you never heard of the Pirate Bay or Bittorrent? They didn’t succeed at all in shutting down copyright infringement. People pay for things because it’s convenient, not because copyright infringement is no longer an option.
This is something I've been pondering, and honestly I feel like the author doesn't go far enough. I would go as far as to say a lot of our modern society has been held up by these implicit social contracts. But nowadays we see things like gerrymandering in the US, or overusing the 49-3 in France to pass laws despite the parliament voting against them. Just an overall trend of only feeling constrained by the exact letter of the law and ignoring the spirit of it.
Except it turns out these implicit understandings that you shouldn't do that existed because breaking them makes life shittier for everyone, and that's what we're experiencing now.
The author needs to open with a paragraph that establishes better context. They open with a link to another post where they talk about anti-LLM defenses but it doesn't clarify what they are talking about when they compare server problems with client-side problems.
> You're using a suspiciously old browser
>You're probably reading this page because you've attempted to access some part of my blog (Wandering Thoughts) or CSpace, the wiki thing it's part of. Unfortunately you're using a browser version that my anti-crawler precautions consider suspicious, most often because it's too old (most often this applies to versions of Chrome). Unfortunately, as of early 2025 there's a plague of high volume crawlers (apparently in part to gather data for LLM training) that use a variety of old browser user agents, especially Chrome user agents. To reduce the load on Wandering Thoughts I'm experimenting with (attempting to) block all of them, and you've run into this.
>If this is in error and you're using a current version of your browser of choice, you can contact me at my current place at the university (you should be able to work out the email address from that). If possible, please let me know what browser you're using and so on, ideally with its exactl User-Agent string.
Hopefully I solved his email address riddle.
Safari can't open the page because it couldn't establish a secure connection to the server.
Add "webmasters" or "sysadmins" to the list?
// An eighth reload worked.
https://web.archive.org/web/20250823105045if_/https://utcc.u...
https://utcc.utoronto.ca/~cks/cspace-generic-ua.html
...which complains about an "HTTP User-Agent header value that is too generic or otherwise excessively suspicious. Unfortunately, as of early 2025 there's a plague of high volume crawlers (apparently in part to gather data for LLM training) that behave like this.", and I'm left thinking that the person behind this site does not care about client-side problems...
decremental•5mo ago
nottorp•5mo ago
nulbyte•5mo ago
noirscape•5mo ago
Thinking of the most extreme option (throwing proof of work checks at browsers), the main stuff that jumps to mind is sites like sourcehut, Linux Kernel Archives and so on and the admins for all of those sites have noted that the traffic they get is far outside of expectations[0]. Not whatever blogspam ended up on the top of Google search that day.
The badly designed sites are often the ones that don't care about their bandwidth anyways.
[0]: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
danaris•5mo ago
Do you think that every small personal website is serving nothing but "a 3 line LLM generated blog post"? Do you not think there are some out there that have perfectly reasonable content? Much of it not even monetized?
And yet the bots are causing this problem for everyone. They are completely indiscriminate.
So before you try to dismiss this as a non-issue, maybe consider that there's more out there being affected by this than the absolute worst case possible to imagine.
nottorp•5mo ago