This is not some new enemy "bots". This is the same old non-human legal persons that polluted our physical world repeating things in the digital. Bots run by actual human persons are not the problem.
The model of the web host paying for all bandwidth was somewhat aligned with traditional usage models, but the wave of scrapping for training data is disrupting this logic.
I remember reading, about 10 years ago?, of how backend website communications (ads and demographic data sharing) had surpassed the bandwidth consumed by actual users. But even in this case, the traffic was still primarily linked to the website hosts.
Whereas with the recent scrapping frenzy the traffic is purely client side, and not initiated by actual website users, and not particularly beneficial to the website host.
One has to wonder what percentage of web traffic now is generated by actual users, versus host backend data sharing, and the mammoth new wave of scrapping.
Otherwise everything moves behind a paywall?
Basically. Paywalls and private services. Do things that are anti-scale, because things meant for consumption at scale will inevitably draw parasites.
Is anybody tracking the IP ranges of bots or anything similar that's reliable?
It seems like they're taking the "what are you gonna do about it" approach to this.
Edit: Yes [1]
[1] https://github.com/FabrizioCafolla/openai-crawlers-ip-ranges
For instance, it includes ChatGPT-User. This is not a crawler. This is used when a ChatGPT user pastes a link in and asks ChatGPT about the contents of the page.
One of the entries is facebookexternalhit. When you share a link on Facebook, Threads, WhatsApp, etc., this is the user-agent Meta uses to fetch the OpenGraph metadata to display things like the title and thumbnail.
Skimming through the list, I see a bunch of things like this. Not every non-browser fetch is an AI crawler!
One could reasonably claim that the value of AI systems and very large training sets is not that it is an approach to AGI, but that it makes finding previously unseen connections possible.
I thought everything was going well after that, until suddenly it started getting even worse. I started realizing that instead of one IP hitting the site a hundred times per second, it was now hundreds of IP's hitting the site Slightly below the Throttling threshold I had set up.
AI bots don't care about caches. That's one of the big issues
There are also good crawlers that search for sites, like Google, or marginalia which gives your page recognizibility. If you lock everything from the web, we'll it disappears from the web.
tartoran•5mo ago