In principle, if you post a webpage, presumably, it's going to be viewed at least a few dozen times. If it's an actually good article, it might be viewed a few hundred or even thousands of times. If each of the 20 or so large AI labs visit it as well, does it just become N+20?
Or am I getting this wrong somehow?
Amazonbot is every day trying to scan every single PDB directory listed. For no real reason. This is something causing 10k+ requests each day when legitimate traffic sits at maybe 50 requests a day.
I think a lot of people confuse scraping for training with on-demand scraping for "agentic use" / "deep research", etc. Today I was testing the new GLM-experimental model, on their demo site. It had "web search", so I enabled that and asked it for something I have recently researched myself for work. It gave me a good overall list of agentic frameworks, after some google searching and "crawling" ~6 sites it found.
As a second message I asked for a list of repo links, how many stars each repo has, and general repo activity. It went on and "crawled" each of the 10 repos on github, couldn't read the stars, but then searched and found a site that reports that, and it "crawled" that site 10 times for each framework.
All in all, my 2 message chat session performed ~ 5-6 searches and 20-30 page "crawls". Imagine what they do when traffic increases. Now multiply that for every "deep research" provider (perplexity, goog, oai, anthropic, etc etc). Now think how many "vibe-coded" projects like this exist. And how many are poorly coded and re-crawl each link every time...
Could also be framed as an API issue, as there is no technical limitations why search provider couldn't provide relevant snapshots of the body of the search results. Then again, might be legal issues behind not providing that information.
I worry that such caching layers might run afoul of copyright, though :(
Though an internal caching layer would work, surely?
Another content-creator avenue might be to move to a 2-tier content serving, where you serve pure html as a public interface, and only allow "advanced" features that take many cpu cycles for authenticated / paying users. It suddenly doesn't make sense to use a huge, heavy and resource intensive framework for things that might be crawled a lot by bots / users doing queries w/ LLMs.
Another idea was recently discussed here, and covers "micropayments" for access to content. Probably not trivial to implement either, even though it sounds easy in theory. We've had an entire web3.0 hype cycle on this, and yet no clear easy solutions for micropayments... Oh well. Web4.0 it is :)
Also guaranteed even if you do block by law there will be actors who will ignore the law. For example people outside of the US. Those people will as a result likely build better AI because they have access to more training data.
First example of this is of course China. One thing with China is there’s no holy sanctity behind data. Whatever is made is copy-able and goes to the public domain regardless of anything. It both causes China to both exceed the US and to be slightly less innovative at the same time.
If these laws come to pass you bet your ass China will be exceeding the US in AI like they already have with stem cell research.
I propose we just remove the charades and require us peasants pay these megacorps 20% of our annual income yearly at tax time.
Still doesn’t hurt to lower taxes for peasants and heighten taxes for corps. That’s how you prevent wealth inequality. And also middlemen are by nature just a bit corrupt as they siphon resources just by being in the middle.
Google is the bringer of traffic and if you want it, then you play by their rules. I don't like that the web is in that position, but here we are.
Some sites don't get enough traffic from google to sustain their business where they previously did. Wholesale blocking google crawlers doesn't seem like a risky move for them.
It makes me wonder if that is a trend? Will more sites go to 'google zero'?
We didn't. Just as we didn't gift all our chocolate-making infrastructure to Hershey's and Cadbury's.
When you have egregious bot traffic, say 10k requests per minute sustained load, it becomes a real problem for webmasters.
This is the next iteration of things like the news snippet case. Publishers are not happy that Google crawls their content (at their expense) and then republishes it on their own site, while serving ads around it and getting user data, without cutting in the publisher who originally made it. And, for what little it's worth, owns the copyright.
Again it sounds like the people who are upset by this really want to publish images rather than web pages.
More like people don't want to lose money because a 3rd party stole all of their content, and then repurposed it to show people before they visit their website.
(Not to be confused with Web3.)
We never distinguished automations from people though, that makes no sense on the internet.
Why is it such an issue that publishers and website owners want to maintain the traffic to their website so that they can continue operating as usual? Or should we all just accept every Google decision, even when those decisions result in more engagement on google.com, but 20-35% decreases in traffic to the original websites?
Also I'm going to need a citation that the vast majority of people want and get value out of AI overviews. Because that is certainly not the case from my experience.
This isn't a "google decision" people are changing the way they use the web.
Cloudflare to introduce pay-per-crawl for AI bots
It's interesting to see Matthew say the quiet part out loud here, if by "pass a law," Matthew means get federal legislation passed.
thisislife2•2h ago
npollock•2h ago
dawnerd•2h ago
eddythompson80•2h ago
So I'm assuming Cloudflare is basically asking Google to split its crawler's UA and distinguish between search and AI overview and respect something akin ot robots.txt
Palomides•2h ago
the reality of the tech is irrelevant
beejiu•2h ago
> Google-Extended doesn't have a separate HTTP request user agent string. Crawling is done with existing Google user agent strings; the robots.txt user-agent token is used in a control capacity.
https://developers.google.com/search/docs/crawling-indexing/...
imilk•2h ago
> What does an AI crawler do different from a search engine (indexing) crawler?
Many people don't want the extra bot traffic hitting their site that comes from AI, especially when AI chat & AI overviews in Google provide such a small amount of traffic in return and that traffic pretty much always has horrendous conversion rates (personally seen across multiple industries).
Spivak•2h ago
imilk•1h ago
https://developers.google.com/search/docs/crawling-indexing/...
Google-Extended is what is associated with AI crawling, but GoogleBot also crawls to produce AI overviews in addition to indexing your website in Google search.
While the number of crawlers and their overlapping responsibilities makes it difficult to know which ones you can safely block, I should also say that pure AI company bots behave 1000x worse than Google crawlers when it comes to not flooding your site with scraping requests.
fathomdeez•1h ago
dathinab•1h ago
it really can be, Anubis AI crawler detection was create mainly because of "way to many AI bot requests" to quote
> This program is designed to help protect the small internet from the endless storm of requests that flood in from AI companies.
imilk•1h ago
dathinab•1h ago
this is a problem which needs regulatory action, not one which should be solved by a quasi monopoly forcing it onto anyone but another quasi monopoly which can use their monopoly power to avoid it
require
- respecting robots.txt and similar
- require purpose binding/separation (of the crawler agent, but also the retrieved data) similar to what GDPR does
- require public agent purpose documentation and stable agent identities
- disallow obfuscation of who is crawling what
- do enforce it
and sure making something illegal doesn't prevent anyone from being technically able to do it
but now at lest large companies like Google have to decide weather they want to commit a crime, and the more they obfuscate that they are doing it the more there is prove it was done with a lot of bad faith, i.e. the higher judges can push punitive damages
combine it with internet gateways like CF trying to provide technical enforcement and you might have a good solution
but one quasi monopoly trying to force another to "comply" with their money making scheme (even if it's in the interest of the end user) smells a lot like you can have a winnable case against CF wrt. unfair market practices, monopoly power abuse etc...
imilk•1h ago
There is also nothing stopping other CDN/DNS providers spinning up a similar marketplace to what CF is looking to do now.
bediger4000•1h ago
I thought we were broadly opposed to regulatory action for a number of reasons, including anti-socialism ideology, dislike of "red tape", and belief that free markets can solve problems.