Obviously there's good reasons NOT to, but I am surprised they haven't started offering it (as an "on-by-default" option, naturally) yet.
It's entirely possible that they're doing this under the hood for cases where they can clearly identify the content they have cached is public.
I'll need to test it out, especially with the labyrinth.
I'm split between: Yes! At last something to get CF protected sites! And: Uh! Now the internet is successfully centralized.
We're creating an internet that is becoming self-reinforcing for those who already have power and harder for anyone else. As crawling becomes difficult and expensive, only those with previously collected datasets get to play. I certainly understand individual sites wanting to limit access, but it seems unlikely that they're limiting access to the big players - and maybe even helping them since others won't be able to compete as well.
AFAICT they don't make any attempt to be stealthy either, so it's easy enough to block them on your own terms if you want. The request are all branded with CF-specific headers which make it obvious what they're doing.
I had the idea after buying https://mirror.forum recently (which I talked in discord and archiveteam irc servers) that I wanted to preserve/mirror forums (especially tech) related [Think TinyCoreLinux] since Archive.org is really really great but I would prefer some other efforts as well within this space.
I didn't want to scrape/crawl it myself because I felt like it would feel like yet another scraping effort for AI and strain resources of developers.
And even when you want to crawl, the issue is that you can't crawl cloudflare and sometimes for good measure.
So in my understanding, can I use Cloudflare Crawl to essentially crawl the whole website of a forum and does this only work for forums which use cloudflare ?
Also what is the pricing of this? Is it just a standard cloudflare worker so would I get free 100k requests and 1 Million per the few cents (IIRC) offer for crawling. Considering that Cloudflare is very scalable, It might even make sense more than buying a group of cheap VPS's
Also another point but I was previously thinking that the best way was probably if maintainers of these forums could give me a backup archive of the forum in a periodic manner as my heart believes it to be most cleanest way and discussing it on Linux discord servers and archivers within that community and in general, I couldn't find anyone who maintains such tech forums who can subscribe to the idea of sharing the forum's public data as a quick backup for preservation purposes. So if anyone knows or maintains any forums myself. Feel free to message here in this thread about that too.
You feel better paying someone to do the same thimg?
And they can pull it off because of their reach over the internet with the free DNS.
> The /crawl endpoint respects the directives of robots.txt files, including crawl-delay. All URLs that /crawl is directed not to crawl are listed in the response with "status": "disallowed".
You don't need any scraping countermeasures for crawlers like those.
The fact that 30%+ of the web relies on their caching services, routablility services and DDoS protection services is the main pull.
Their DNS is only really for data collection and to front as "good will"
triwats•1h ago