This is not some new enemy "bots". This is the same old non-human legal persons that polluted our physical world repeating things in the digital. Bots run by actual human persons are not the problem.
The model of the web host paying for all bandwidth was somewhat aligned with traditional usage models, but the wave of scrapping for training data is disrupting this logic.
I remember reading, about 10 years ago?, of how backend website communications (ads and demographic data sharing) had surpassed the bandwidth consumed by actual users. But even in this case, the traffic was still primarily linked to the website hosts.
Whereas with the recent scrapping frenzy the traffic is purely client side, and not initiated by actual website users, and not particularly beneficial to the website host.
One has to wonder what percentage of web traffic now is generated by actual users, versus host backend data sharing, and the mammoth new wave of scrapping.
Otherwise everything moves behind a paywall?
Basically. Paywalls and private services. Do things that are anti-scale, because things meant for consumption at scale will inevitably draw parasites.
Is anybody tracking the IP ranges of bots or anything similar that's reliable?
It seems like they're taking the "what are you gonna do about it" approach to this.
Edit: Yes [1]
[1] https://github.com/FabrizioCafolla/openai-crawlers-ip-ranges
I thought everything was going well after that, until suddenly it started getting even worse. I started realizing that instead of one IP hitting the site a hundred times per second, it was now hundreds of IP's hitting the site Slightly below the Throttling threshold I had set up.
tartoran•4h ago