I also have a faceted search that some stupid crawler has spent the last month iterating through. Also mostly uncached URLs.
But from what I have read time to time the crawler acted magnitudes outside of what could have been accepted as just badly configured.
https://herman.bearblog.dev/the-great-scrape/
https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
https://lwn.net/Articles/1008897/
https://tecnobits.com/en/AI-crawlers-on-Wikipedia-platform-d...
> Loading static pages from CDN to scrape training data takes such minimal amounts of resources that it's never going to be a significant part of my costs. Are there cases where this isn't true?
Why did you bring up static pages served by a CDN, the absolute best case scenario, as your reference for how crawler spam might affect server performance?
I think that statement is way too strong and obviously not true of businesses. It might be true if hobbyist websites where the creator is personally more interested on the server side but it's definitely not true of professional websites.
Professional websites that have enough of a budget to care about the server side will absolutely care about the client side and will track usage. If 10% fewer people used the website, the analytics would show that and there would be a fire drill.
What I can agree with on the author is more of a nuanced point. Client side problems are a lot harder and have a very long tail due to unique client configurations (OS, browser, extensions, physical hardware). So with thousands of combinations, you end up with some wild and rare issues. It becomes hard to chase all of them down and some you just have to ignore.
This can lead to it feeling like websites don't care about client side but it just shows client side is hard.
Amazon.com Inc is currently worth 2.4 billion dollars and the only reason is that most businesses insist on giving their customers the worst online experience possible. I wish that I could one day understand the logic, which goes like this:
1. Notice that people are on their phones all the time.
2. And notice that when people are looking to buy something they first go on the computer or on the smart phone.
3. Therefore let's make the most godawful experience on our website possible, to make sure that our potential customers hate us and don't make a purchase.
4. Customers make their purchase on Amazon instead.
5. Profit??
This is an incredibly reductive view of how Amazon came to dominate online retail. If you genuinely believe this, I would strongly urge you to research their history and understand how they became the monopoly they are today.
I assure you, it's not primarily because they care more about the end user's experience.
Amazon, on the other hand, is plagued with fake or bad products from copycat sellers. I have no idea what I am going to get when I place an order. Frankly, I'm surprised when I get the actual thing I ordered.
Huh?
He makes a statement in an earlier article that I think sums things up nicely:
> One thing I've wound up feeling from all this is that the current web is surprisingly fragile. A significant amount of the web seems to have been held up by implicit understandings and bargains, not by technology. When LLM crawlers showed up and decided to ignore the social things that had kept those parts of the web going, things started coming down all over the place.
This social contract is, to me, built around the idea that a human will direct the operation of a computer in real time (largely by using a web browser and clicking links) but I think that this approach is extremely inefficient of both the computer’s and the human’s resources (cpu and time, respectively). The promise of technology should not be to put people behind desks staring at a screen all day, so this evolution toward automation must continue.
I do wonder what the new social contract will be: Perhaps access to the majority of servers will be gated by micropayments, but what will the “deal” be for those who don’t want to collect payments? How will they prevent abuse while keeping access free?
[1] “The current (2025) crawler plague and the fragility of the web”https://utcc.utoronto.ca/~cks/space/blog/web/WebIsKindOfFrag...
If 1000 AWS boxes start hammering your API you might raise an eyebrow, but 1000 requests coming from residential ISPs around the world could be an organic surge in demand for your service.
Residential proxy services break this - which has been happening on some level for a long time, but the AI-training-set arms race has driven up demand and thus also supply.
It's quite easy to block all of AWS, for example, but it's less easy to figure out which residential IPs are part of a commercially-operated botnet.
The author needs to open with a paragraph that establishes better context. They open with a link to another post where they talk about anti-LLM defenses but it doesn't clarify what they are talking about when they compare server problems with client-side problems.
decremental•3h ago
nottorp•2h ago
nulbyte•1h ago