https://developers.cloudflare.com/bots/concepts/bot/#ai-bots
> You can opt into a managed rule that will block bots that we categorize as artificial intelligence (AI) crawlers (“AI Bots”) from visiting your website. Customers may choose to do this to prevent AI-related usage of their content, such as training large language models (LLM).
> CCBot (Common Crawl)
Common Crawl is not an AI bot:
I recently went to a big local auction site on which I buy frequently and I got one of these "we detected unusual traffic from your network" messages. And "prove you're human". Which was followed by "you completed the capcha in 0.4s your IP is banned". Really? Am I supposed to slow down my browsing now? I tried a different browser, a different OS, logging on,clearing cookies, etc. Same result when I tried a search. It took 4h after contacting their customer service to unblock it. And the explanation was "you're clicking too fast".
At some point it just becomes a farce and the hassle is not worth the content. Also, while my story doesn't involve any bots perhaps a time will come when local LLMs will be good enough that I'll be able to tell one "reorder my cat food" and it will go and do it. Why are they so determined to "stop it" (spoiler, they can't).
For anyone who says LLMs are already capable of ordering cat food I say not so fast. First the cat food has to be on sale/offer (sometimes combined with extras). Second it is supposed to be healthy (no grains) and third the taste needs to be to my cats liking. So far I'm not going to trust a LLM with this.
Maybe sites could add "you must honor policies set in robots.txt" to something like a terms of service but I have no idea if that would have enough teeth for a crawler to give up.
The idea that Cloudflare could do the latter at the sole discretion of its leadership, though, is indicative of the level of power Cloudflare holds.
Would you say the same for ddos protection? Isn't that the same as well?
It isnt CF going around saying, that's a nice website you have there. I'm gonna put myself in between.
Do you have a source for that? https://blog.cloudflare.com/content-independence-day-no-ai-c... does say "changing the default".
https://blog.cloudflare.com/control-content-use-for-ai-train...
This is simply juat the first step in them implementing a marketplace and trying to get into LLM SEO. They dont care about your site or protecting it. They are gearing up to start making a cut in the Middle between scrapers and publishers. Why wouldnt I go DIRECTLY to the publisher and make a deal. So dumb I hate cf so much.
The only thing cloudflare knows how to do is MITM attacks.
> When you enable this feature via a pre-configured managed rule, Cloudflare can detect and block verified AI bots that comply with robots.txt and respect crawl rates, and do not hide their behavior from your website. The rule has also been expanded to include more signatures of AI bots that do not follow the rules.
We already know companies like Perplexity are masking their traffic. I'm sure there's more than meets the eye, but taking this at face value, doesn't punishing respectful and transparent bots only incentivize obfuscation?edit: This link[0], posted in a comment elsewhere, addresses this question. tldr, obfuscation doesn't work.
> We leverage Cloudflare global signals to calculate our Bot Score, which for AI bots like the one above, reflects that we correctly identify and score them as a “likely bot.”
> When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint. For every fingerprint we see, we use Cloudflare’s network, which sees over 57 million requests per second on average, to understand how much we should trust this fingerprint. To power our models, we compute global aggregates across many signals. Based on these signals, our models were able to appropriately flag traffic from evasive AI bots, like the example mentioned above, as bots.
[0] https://blog.cloudflare.com/declaring-your-aindependence-blo...They're cloudflare and it's not like it's particularly easy to hide a bot that is scraping large chunks of the Internet from them. On top of the fact that they can fingerprint any of your sneaky usage, large companies have to work with them so I can only assume there are channels of communication where cloudflare can have a little talk with you about your bad behavior. I don't know how often lawyers are involved but I would expect them to be.
Sure, but we crossed that bridge over 20 years ago. It's not creating an arms race where there wasn't already one.
Which is my generic response to everyone bringing similar ideas up. "But the bots could just...", yeah, they've been doing it for 20+ years and people have been fighting it for just as long. Not a new problem, not a new set of solutions, no prospect of the arms race ending any time soon, none of this is new.
> The rule has also been expanded to include more signatures of AI bots that do not follow the rules.
The Block AI Bots rule on the Super Bot Fight Mode page does filter out most bot traffic. I was getting 10x the traffic from bots than I was from users.
It definitely doesn't rely on robots.txt or user agent. I had to write a page rule bypass just to let my own tooling work on my website after enabling it.
There is a clear moment where you land on AI bot radar. For my large forum, it was a month ago.
Overnight, "72 users are viewing General Discussion" turned into "1720 users".
40% requests being cached turned into 3% of requests are cached.
I read the robots.txt entries as those AI bots that will be not marked as "malicious" and that will have the opportunity to be allowed by websites. The rest will be given the Cloudflare special.
It's the bots that do hide their behavior -- via residential proxy services -- that are causing most of the burden, for my site anyway. Not these large commercial AI vendors.
I don't see a way out of this happening. AI fundamentally discourages other forms of digital interaction as it grows.
Its mechanism of growing is killing other kinds of digital content. It will eventually kill the web, which is, ironically, its main source of food.
Let's instead be focused and talk about real stuff.
Consider https://learnpythonthehardway.org/ for example. It has influenced a generation of Python developers. Not just the main website, but the tons of Python code and Python-related content it inspired.
Why would anyone write these kinds of textbooks/websites/guides if AI can replace them? AI companies are effectively broadcasting you don't need the hard way anymore, you can just vibe.
Arguibly though, without the existance of Learn Python the Hard Way and similar content, AI would be worse at writing Python stuff. That's what I mean by "main source of food", good content that influences a lot of people. Net-positive effects hard to predict or even identify except for the more popular cases (such as LPTHW).
If my prediction is right, no one will notice that good content has stopped being produced. It will appear as if content is being created in generally the same way as before, but in reality, these long tail initiatives like LPTHW will have ceased before anyone can do anything about it.
Again, I don't see a way out of this scenario. Not for AI companies, not for content writers. This is going to happen. The world in which I'm wrong is the best one.
But then, how would you even train and replace those competent seniior engineers that do the filtering when they retire? The whole system was predicated on having a chain of new hires that gain experience in the process.
This is based on two assumptions:
- AI will get better. Developers using the system will transfer their knowledge to it.
- Seniors in a couple of years will be different. They should be those who can engage with the AI feedback loop.
Here's why I think it won't work:
- Senior developers learn more than they can produce. There is knowledge they never transfer to what they work on. Internalized knowledge that never materializes directly into code. _But it materializes indirectly_.
- Senior developer knowledge come from "schools", not just reading. These schools are not real physical locations. They're traditions, or ideas, that form a very long tail. These ideas, again, are not directly transferrable to code or prose.
- Juniors get embarrassed. You say "stop making this nonsense", and they'll stop and reflect, because they respect seniors. They might disagree, but a pea was then placed under their matress, and they'll think about "this nonsense" you told them to stop doing and why. That is how they get better. So far, AI has not demonstrated being able to do that.
The production of quality content is an aspect of one of those "schools of thought". You are supposed to bear the responsibility of passing the knowledge. Keeping lean codebases easy to understand is also a hallmark of many schools of thought. Working from fundamentals is another one of those ideas, etc.
# NOTICE: The collection of content and other data on this # site through automated means, including any device, tool, # or process designed to data mine or scrape content, is # prohibited except (1) for the purpose of search engine indexing or # artificial intelligence retrieval augmented generation or (2) with express # written permission from this site’s operator.
# To request permission to license our intellectual # property and/or other materials, please contact this # site’s operator directly.
# BEGIN Cloudflare Managed content
User-agent: Amazonbot Disallow: /
User-agent: Applebot-Extended Disallow: /
User-agent: Bytespider Disallow: /
User-agent: CCBot Disallow: /
User-agent: ClaudeBot Disallow: /
User-agent: Google-Extended Disallow: /
User-agent: GPTBot Disallow: /
User-agent: meta-externalagent Disallow: /
# END Cloudflare Managed Content User-agent: * Disallow: /* Allow: /$
Seems CF has been gathering data and profiling these malicious agents.
This post by CF elaborates a bit further: https://blog.cloudflare.com/declaring-your-aindependence-blo...
Basically becomes a game of cat and mouse.
> Cloudflare is making the change to protect original content on the internet, Mr. Prince said. If A.I. companies freely use data from various websites without permission or payment, people will be discouraged from creating new digital content, he said
> prohibited except for the purpose of [..] artificial intelligence retrieval augmented generation
This seems to be targeted at taxing training of language models, but why an exclusion for the RAG stuff? That seems like it has a much greater immediate impact for online content creators, for whom the bots are obviating a click.It means sense to allow for RAG in the same way that search engines provide a snippet of an important chunk of the page.
A blog author could not complain that their blog is getting ragged when they're extremely liable to be Google/whatever searching all day and basically consuming others' content in exactly the same way that they're trying to disparage.
Whether you call it training or something else is irrelevant, it's really exploitation of human work and effort for AI shareholder returns and tech worker comp (if those who create aren't compensated). And the technocracy has not been, based on the evidence, great stewards of the power they obtain through this. Pay the humans for their work.
AI scrapers increase traffic by maybe 10x (this varies per site) but provide no real value whatsoever to anyone. If you look at various forms of "value":
* Saying "this uses AI" might make numbers go up on the stock market if you manage to persuade people it will make numbers go up (see also: the market will remain irrational longer than you can remain solvent).
* Saying "this uses AI" might fulfill some corporate mandate.
* Asking AI to solve a problem (for which you would actually use the solution) allows you to "launder" the copyright of whatever source it is paraphrasing (it's well established that LLMs fail entirely if a question isn't found within their training set). Pirating it directly provides the same value, with significantly less errors/handholding.
* Asking AI to entertain you ... well, there's the novelty factor I guess, but even if people refuse to train themselves out of that obsession, the world is still far too full of options for any individual to explore them all. Even just the question of "what kind of ways can I throw some kind of ball around" has more answers than probably anyone here knows.
What am I missing?
Because it's bundled with other products that do provide value, and that counts as "someone using it".
Because some middle manager has declared that I must add AI to my workflow ... somehow. Whatever, if they want to pay me to accomplish less than usual, that's not my problem.
Because it's a cool new toy to play around with a bit.
Because surely all these people saying "AI is useful now" aren't just lying shills, so we'd better investigate their claims again ... nope, still terminally broken.
I get that everyone wants data, but presumably the big players already scraped the web. Do they really need to do it again? Or is it bit players reproducing data that's likely already in the training set? Or is it really that valuable to have your own scraped copy of internet scale data?
I feel like I'm missing something here. My expectation is that RAG traffic is going to be orders of magnitude higher than scraping for training. Not that it would be easy to measure from the outside.
Cloudflare have some recent data about traffic from bots (https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-cr...) which indicates that, for the time being, the overwhelming majority of the bot requests are for AI training and not for RAG.
And I'm talking about ecommerce websites, with their bot scraping every variation of each product, multiple times a day.
New data is still being added online daily (probably hourly, if not more often) by humans, and the first ones to gain access could be the "winners," particularly if their users happen to need up to date data (and the service happens to have scraped it). Just like with search engines/crawlers, there's also the big players that may respect your website, but there are also those that don't use rate-limiting or respect robots.txt.
So yeah, I too could see them doing this.
I have had odd bot behavior from some major crawlers, but never from Google. I wonder if there is a correlation to usefulness of content, or if certain sites get stuck in a software bug (or some other strange behavior).
Is Common Crawl exclusively for "AI"
CCBot was already in so many robots.txt prior to this
How is CC supposed to know or control how people use the archive contents
What if CC is relying on fair use
# To request permission to license our intellectual
# property andd/or other materials, please contact this
# site's operator directly
If the operator has no intellectual property rights in the material, then do they need permission from the rights holders to license such materials for use in creating LLMs and collect licensing feesIs it common for website terms and conditions to permit site operators to sublicense other peoples' ("users") work for use in creating LLMs for a fee
Is this fee shared with the rights holders
# To request permission to license our intellectual
# property andd/or other materials, please contact this
# site's operator directly
Scrapers don't accept the terms of service.Ironically, I've only ever scraped sites that block CCBot, otherwise I'd rather go to Common Crawl for the data.
That's incorrect. Cloudflare does in fact enforce this at a technical level. Cloudflare has been doing bot detection for years and can pretty reliably detect when bots are not following robots.txt and then block them.
"Information" is dead but content is not. Stories, empathy, community, connection, products, services. Content of this variety is exploding.
The big challenge is discoverability. Before, information arbitrage was one pathway to get your content discovered, or to skim a profit. This is over with AI. New means of discovery are necessary, largely network and community based. AI will throw you a few bones, but it will be 10% of what SEO did.
No, it most certainly does not. It was certainly trained on large swathes of human knowledge/interactions.
A model that consists of a perfect representation/compression of all this info is a zip file, not a model file.
In any case, as manifest by real world SEO, which is plummeting in traffic for informational queries, the effect is the same. This real world impact is what matters and will not be reversed, regardless of attempts at blocking.
To me it seems like there has to be so much optimization for this to happen that, it is not likely. LLM answers are slow and unreliable. Even using something like perplexity doesn’t give much value over using a regular search engine in my experience
Traditional search will still be highly useful for transactional, product, realtime, and action oriented queries. Also for discovering educational/entertainment content that is valued in of itself and cannot be reformulated by LLM.
1. Use the Cache-Control header to express how to cache your site correctly (https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Cac...)
2. Use a CDN service, or at least a caching reverse proxy, to serve most of the cacheable requests to reduce load on the (typically much more expensive) origin servers
OpenAI, Anthropic, Google? No, their bots are pretty well behaved.
The smaller AI companies deploying bots that don't respect any reasonable rate limits and are scraping the same static pages thousands of times an hour? Yup
Most of my site is cached in multiple different layers. But some things that I surface to unauthenticated public can't be cached while still being functional. Hammering those endpoints has taken my app down.
Additionally, even though there are multiple layers, things that are expensive to generate can still slip through the cracks. My site has millions of public-facing pages, and a batch of misses that happen at the same time on heavier pages to regenerate can back up requests, which leads to errors, and errors don't result in caches successfully being filled. So the AI traffic keeps hitting those endpoints, they keep not getting cached and keep throwing errors. And it spirals from there.
Cache is expensive at scale. So permitting big or frequent crawls by stupid crawlers either require significant investments in cache or slow down and worsen the site for all users. For whom we, you know, built the site, not to provide training data for companies.
As others have mentioned, Google is significantly more competent than 99.9% of the others. They are very careful to not take your site down and provide, or used to provide, traffic via their search. So it was a trade, not a taking.
Not to mention I prefer not to do business with Cloudflare because I don't like companies that don't publish quota. If going over X means I need an enterprise account that starts at $10k/mo, I need to know the X. Cloudflare's business practice appears to be letting customers exceed that quota then aggressively demanding they pay or they'll be kicked off the service nearly immediately.
I struggle to think of a web related library that has spread faster than Anubis checker. It's everywhere now! https://github.com/TecharoHQ/anubis
I'm surprised we don't see more efforts to rate limit. I assume many of these are distributed crawlers, but it feels like there's got to be pools of activity spinning up, on a handful of IPs. And that they would be time correlated together pretty clearly. Maybe that's not true. But it feels like the web, more than anything else, needs some open source software to add a lot more 420 Enhance Your Calm responses, as it feels like. https://http.dev/420
- opposition to generative AI in general
- a view that AI, unlike search which also relies on crawling, offers you no benefits in return
- crawlers from the AI firms being less well-behaved than the legacy search crawlers, not obeying robots.txt, crawling more often, more aggressively, more completely, more redundantly, from more widely-distributed addresses
- companies sneaking in AI crawling underneath their existing tolerated/whitelisted user-agents (Facebook was pretty clearly doing this with "facebookexternalhit" that people would have allowed to get Facebook previews; they eventually made a new agent for their crawling activity)
- a simultaneous huge spike in obvious crawler activity with spoofed user agents: e.g. a constant random cycling between every version of Chrome or Firefox or any browser ever released; who this is or how many different actors it is and whether they're even doing crawling for AI, who knows, but it's a fair bet.
Better optimization and caching can make this all not matter so much but not everything can be cached, and plenty of small operations got by just fine without all this extra traffic, and would get by just fine without it, so can you really blame them for turning to blocking?
> If you try to rate-limit them, they'll just switch to other IPs all the time. If you try to block them by User Agent string, they'll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.
My gut is that the switch between IP addresses can't be that hard to follow. That the access pattern it pretty obvious to follow across identities.
But it would be non trivial, it would entail crafting new systems and doing new work per request (when traffic starts to be elevated, as a first gate).
Just making the client run through some math gauntlet is an obvious win that aggressors probably can't break. But I still think there's probably some really good hanging fruit for identifying and rate limiting even these somewhat rather more annoying traffic patterns, that the behavior itself leaves a figure print that can't be hidden and which can absolutely be rate limited. And I'd like to see that area explored.
Edit: oh heck yes, new submission with 1.7tb logs of what AI crawlers do. Now we can machine learn some better rate limiting techniques! https://news.ycombinator.com/item?id=44450352 https://huggingface.co/datasets/lee101/webfiddle-internet-ra...
The largest site I work on has 100,000s of pages, each in around 10 languages — that's already millions of pages.
It generally works fine. Yesterday it served just under 1000 RPS over the day.
AI crawlers have brought it down when a single crawler has added 100, 200 or more RPS distributed over a wide range of IPs — it's not so much the number of extra requests, though it's very disproportionate for one "user", but they can end up hitting an expensive endpoint excluded by robots.txt and protected by other rate-limiting measures, which didn't anticipate a DDoS.
It also completely put a stop to perplexity as far as I can tell.
And the robots file meant nothing, they’d still request it hundreds of thousands of times instead of caching it. Every request they’d hit it first then hit their intended url.
The goal isn’t to stop 100% of scrapers, it was to reduce server load to a level that wasn’t killing the site.
I'm not optimistic that you can effectively block your original content from ending up in training sets by simply blocking the bots. For now I just assume that anything I put online will end up being used to train some LLM
Anybody know why these web crawling/bot standards are not evolving ? I believe robots.txt was invented in 1994(thx chatgpt). People have tried with sitemaps, RSS and IndexNow, but its like huge$$ organizations are depending on HelloWorld.bas tech to control their entire platform.
I want to spin up endpoints/mcp/etc. and let intelligent bots communicate with my services. Let them ask for access, ask for content, pay for content, etc. I want to offer solutions for bots to consume my content, instead of having to choose between full or no access.
I am all for AI, but please try to do better. Right now the internet is about to be eaten up by stupid bot farms and served into chat screens. They dont want to refer back to their source and when they do its with insane error rates.
This is clearly the first step in cf building out a marketplace where they will (fail) at attempting to be the middleman in a useless market between crawlers and publishers.
Not to pick on you, but I find it quicker to open new tab and do "!w robots.txt" (for search engines supporting the bang notation) or "wiki robots.txt"<click> (for Google I guess). The answer is right there, no need to explain to LLM what I want or verify [1].
[1] Ok, Wikipedia can be wrong, but at least it is a commonly accessible source of wrong I can point people to if they call me out. Plus my predictive model of Wikipedia wrongness gives me pretty low likelihood for something like this, while for ChatGPT it is more random.
Thought of and discussed as a possibility in 1994.
Proposed as a standard in 2019.
Adopted as a standard in 2022.
Thanks, IETF.
Today I see this article about Cloudflare blocking scrapers. There are useful and legitimate cases where I ask Claude to go research something for me. I'm not sure if Cloudflare discerns legitimate search/research traffic from an AI client vs scraping. Of the sites that are blocked by default will include content by small creators (unless on major platforms with deal?), while the big guys who have something to sell like an Amazon, etc, will likely be able to facilitate and afford a deal to show up more in the results.
A few days ago, Cloudflare is also looking to charge AI companies to scrape the content, which is cached copies of other people's content. I'm guessing it will involve paying the owners of the data at some point as well. Being able to exclude it from this purpose (sell/license content, or scrape) would be a useful lever.
Putting those two stories together:
- Is this a new form of showing up in the AISEO (Search everywhere optimization) to show up in an AI's corpus or ability to search the web, or paying licensing fees instead of advertising fees.. these could be new business models which are interesting, but trying to see where these steps may vector ahead towards, and what to think about today.
- With training data being the most valuable thing for AI companies, and this is another avenue for revenue for Cloudflare, this can look like a solution which helps with content licensing as a service.
I'd like to see where abstracting this out further ends up going
Maybe I'm missing something, is anyone else seeing it this way, or another way that's illuminating to them? Is anyone thinking about rolling their own service for whatever parts of Cloudflare they're using?
It might be too little, too late, at this juncture, and this particular solution doesn't seem too innovative. However, it is directionally 100% correct, and let's hope for massively more innovation in defending against AI parasitism.
Most amusingly, someone cited LLM generated output about this telling me how this “fact” is true when I was telling them it’s not true.
What I do care about is the theft of my identity. A person may learn from the words I write but that person doesn't end up mimicking the way I write. They are still uniquely themselves.
I'm concerned that the more I write the more my text becomes my identifier. I use a handle so I can talk more openly about some issues.
We write OSS and blog because information should be free. But that information is then being locked behinds paywalls and becoming more difficult to be found through search. Frankly, that's not okay
> people are making trillions off the free labor of people like you and me
I read "No Discrimination Against Fields of Endeavor" to also include LLMs and especially the cases that we most deeply disagree with.
Either we believe in the principles of OSS or we do not. If you do not like the idea of your intellectual property being used for commercial purposes then this model is definitely not for you.
There is no shame in keeping your source code and other IP a secret. If you have strong expectations of being compensated for your work, then perhaps a different licensing and distribution model is what you are after.
> that information is then being locked behinds paywalls and becoming more difficult to be found through search
Sure - If you give up and delete everything. No one is forcing you to put your blog and GH repos behind a paywall.
This is what AI scrapers are doing. They’re taking your code, your artwork and your writing without any consideration for the license.
There is an active case on this, where Microsoft has been sued over GitHub copilot, and it has been slowly moving through the court system since 2022. Most of the claims have been dismissed, and the prediction market is at 11%: https://manifold.markets/JeffKaufman/will-the-github-copilot...
I'm putting my new code somewhere private anyway.
The key question is whether it is sufficiently "transformative". See Authors Guild vs Google, Kelly vs Arriba Soft, and Sony vs Universal. This is a way a judge could definitely rule, and at this point I think is the most likely outcome.
> Microsoft will forever be a pariah if they get away with this.
I doubt this. Talking to developers, it seems like the majority are pretty excited about coding assistants. Including the ones that many companies other than Microsoft (especially Anthropic) are putting out.
I've been writing open source for more than 20 years
I gave away my work for free with one condition: leave my name on it (MIT license)
the AI parasites then strip the attribution out
they are the ones violating the principles of open source
> then perhaps a different licensing and distribution model is what you are after.
I've now stopped producing open source entirely
and I suggest every developer does the same until the legal position is clarified (in our favour)
There are a lot of people developing open source software with a wide range of goals. In my case, I'm totally happy for LLMs to learn from my coding, just like they've learned from millions of other peoples. I wouldn't want them to duplicate it verbatim, but (due to copyright filters + that not usually being the best way to solve a problem) they don't.
> Either we believe in the principles of OSS or we do not.
What about respecting licenses?Seriously, don't lick the boot. We can recognize that there's complexity here. Trivializing everything only helps the abusers.
Giving credit where credit is due is not too much to ask. Other people making money off my work can be good[0]. Taking credit for it is insulting
[0] If you're not making much, who cares. But if you're a trillion dollar business you can afford to give a little back. Here's the truth, OSS only works if we get enough money and time to do the work. That's either by having a good work life balance and good pay or enough donations coming in. We've been mostly supported by the former, but that deal seems to be going away
Of course they do, to some extent. Just because it's been infeasible to track the exact "graph of influence", that's literally how humans have learned to speak and write for as long as we've had language and writing.
> I'm concerned that the more I write the more my text becomes my identifier. I use a handle so I can talk more openly about some issues.
That's a much more serious concern, in my view. But I believe that LLMs are both the problem and solution here: "Remove style entropy" is just a prompt away, these days.
Oh, I wish I could get AI to mimic the way I write! I'd pay money for it. I often want to type up an email/doc/whatever but don't because of occasional RSI issues. If I could get an AI to type it up for me while still sounding like me - that would be a big boon for my health.
I also have this issues that often keep me from typing, but FYI dictation has gotten very good.
(Dictated this)
Cloudflare isn't solving a problem, they are just inserting themselves as an intermediary to extract a profit, and making everything worse.
They have an addon [1] that helps you bypass Cloudflare challenges anonymously somehow, but it feels wrong to install a plugin to your browser from the ones who make your web experience worse
1: https://developers.cloudflare.com/waf/tools/privacy-pass/
But I did find it ironic
If your on ipv4 you should check whether your behind a NAT otherwise you may have gotten an address that was previously used by a bot network.
Are you really arguing that it's legitimate to consider all IPv6 browsing traffic "suspicious"?
If anything, I'd say that IPv4 is probably harder, given that NATs can hide hundreds or thousands of users behind a single IPv4 address, some of which might be malicious.
> you may have gotten an address that was previously used by a bot network.
Great, another "credit score" to worry about...
If it’s a blacklist system, like I said I’ve not heard of any feasible solution more precise than banning huge ranges of ipv6 addresses.
A whitelist system would consider all IPv4 traffic suspicious by default too. This is not an answer to why you'd be suspicious of IPv6 in particular.
> I’ve not heard of any feasible solution more precise than banning huge ranges of ipv6 addresses.
Handling /56s or something like that is about the same as handling individual IPv4 addresses.
Based on what argument…?
https://news.ycombinator.com/item?id=42953508
https://news.ycombinator.com/item?id=13718752
https://news.ycombinator.com/item?id=23897705
...but OTOH it's their customers who want all of that and pay to get that, because the alternative is worse.
rock and a hard place.
Besides CloudFront, which still costs money, what other option is there for semi-privacy and caching for free?
Lots of nuance, but generally: pay for things you use. Servers, engineers, and research and development are not free, so someone has to pay.
Or I pay and am still the product. Just with less in-my-face ads.
Though why should it be for free?
Although bunny.net won't take ANY of my credit or debit cards
Maybe you haven't, but your users (primarily those using "suspicious" operating systems and browsers) certainly have – with their time spent solving captchas.
I get captchas daily, without using any VPN and on several different IPs (work, home, mobile). The only crime I can think of is that I'm using Firefox instead of Chrome.
It's probably because I use Firefox on Linux with an ad blocker.
For my part, I've ensured we don't use Cloudflare at work.
cloudflare are not the good guys because they give people free cdn and ddos protection lol
Meanwhile a bunch of "security" products other websites use just flat out block you if you're on a VPN. Other sites like youtube or reddit are in between where they block you unless you are logged in.
Cloudflare is the least obtrusive of the options.
(the people not getting the joke, yes the new system don't make you train any image recognition dataset, but they profile the hell out anything they can get their hand on just like google captcha and then if you smell like a bot you're denied access. goodbye)
Turn on Tor and browse for a week.
Now you know what “undesirables” feel like, where “undesirables” can be from a poor country, a bad IP block, outdated browsers, etc.
It sucks.
You might want to read some threads on here about Cloudflare.
Most of the time I don't use them for their network, usually just DNS records for mail because their interface is nicer than namecheap and gives me basic stats.
To my understanding, they aren't blocking MX records behind captchas
Some nicer people here tried the educative approach and it worked much better. I learned about Bunny. And I keep forgetting I have a few in deSec but that has a limit.
I do not understand the hostility
Unfortunately I don’t think they were participating in the conversation in good faith. People can have an extreme view on _anything_…even internet / tech. They buy into a dream of 100% open source, or “open internet”, or 100% decentralized, whatever.
When this happens they may be convinced that “others” are crazy for not sharing their utopian vision. And once this point is reached, they struggle to communicate with their peers or normal people effectively. They share their strong opinions without sharing important context (how they reached those opinions), they think the topic is black and white (because they feel so strongly about the topic), or they become hostile to others that are not sharing that vision.
You are their latest victim lol. Ignore them, and carry on.
Without something being done, the data that these scrapers rely on would eventually no longer exist.
Its not a issue when somebody does "ethical" scraping, with for instance, a 250ms delay between requests, and a active cache that checks specific pages (like news article links) to rescrape at 12 or 24h intervals. This type of scraping results in almost no pressure on the websites.
The issue that i have seen, is that the more unscrupulous parties, just let their scrapers go wild, constantly rescraping again and again because the cost of scraping is extreme low. A small VM can easily push 1000's of scraps per second, let alone somebody with more dedicated resources.
Actually building a "ethical" scraper involves more time, as you need to fine tune it per website. Unfortunately, this behavior is going to cost the more ethical scraper a ton, as anti-scraping efforts will increase the cost on our side.
I've been an active lurker in the self-hosting community and I'm definitely not alone. Nearly everyone hosting public facing websites, particularly those whose form is rather juicy for LLMs, have been facing these issues. It costs more time and money to deal with this, when applying a simple User-Agent block would be much cheaper and trivial to do and maintain.
sigh
We’re talking a JavaScript file of strings to respond like “login failed”, “reset your password” just over and over again. Hundreds of fetches a day, often from what appears to be the same system.
If DDoS wasn’t a scary enough boogeyman to get people to install Cloudflare as a man-in-the-middle on all their website traffic, maybe the threat of AI scrapers will do the trick?
The thing about this slow slide is it’s always defensible. Someone can always say “but I don’t want my site to be scraped, and this service is free, or even better yet, I can set up my own toll booth and collect money! They’re wonderful!”
Trouble is, one day, at this rate, almost all internet traffic will be going through that same gate. And once they have literally everyone (and all their traffic)… well, internet access is an immense amount of power to wield and I can’t see a world in which it remains untainted by commercial and government interests forever.
And “forever” is what’s at stake, because it’ll be near impossible to recover from once 99% of the population is happy to use one of the 3 approved browsers on the 2 approved devices (latest version only). Feels like we’re already accepting that future at an increasing rate.
Something coming down the pike I think, for example, is that IPv4 addresses are going to get realllly expensive soon. That's going to lead to all sorts of interesting things in the Internet landscape and their applications.
I'm sure we'll probably have to spend some decades in the "approved devices and browers only" world before a next wave comes.
* curl | bash
* Docker
* Anything that smacks of cryptocurrency or other scams
Just a standard repo for Debian and RHEL derived distros. Fully open source so everyone can use it. (apt/dnf install no-bad-actors)
Until that exists, using Cloudflare is inevitable.
It needs to be able to at least:
* provide some basic security (something to check for sql injection, etc)
* rate limiting
* User agent blocking
* IP address and ASN blocking
Make it easy to set up with sensible defaults and a way to subscribe to blocklists.
Also: Anubis does not mine cryptocurrency. Proof of work is easy to validate on the server and economically scales poorly in the wild for abusive scrapers.
Analogy: locks for high-value items in grocery stores are annoying to customers, but other stores aren't being coerced by the locksmith to use them.
I'll still post on the increasingly fewer hobby message boards that are out there.
Content writing, product reviews (real & fake), creative writing, customer support, photography/art to name a few off the top of my head.
HN itself is routinely scraped. What makes me most uncomfortable is deanonymization via speech analysis. It's something we can already do but is hard to do at scale. This is the ultimate tool for authoritarians. There's no hidden identities because your speech is your identifier. It is without borders. It doesn't matter if your government is good, a bad acting government (or even large corporate entity) has the power to blackmail individuals in other countries.
We really are quickly headed towards a dystopia. It could result in the entire destruction of the internet or an unprecedented level of self censorship. We already have algospeak because platform censorship[0]. But this would be a different type of censorship. Much more invasive, much more personal. There are things worse than the dark forest
[0] literally yesterday YouTube gave me, a person in the 25-60 age bracket, a content warning because there was a video about a person that got removed from a plane because they wore a shirt saying "End veteran suicide".
[0.1] Even as I type this I'm censored! Apple will allow me to swipe the word suicidal but not suicide! Jesus fuck guys! You don't reduce the mental health crisis by preventing people from even being able to discuss their problems, you only make it worse!
It’s Orwellian in the truest sense of the word.
this u chief?
Technology has advanced and now reading the sum total of the freely exchanged ideas has become particularly valuable. But who cares? The internet still exists and is still usable to freely exchange ideas the way it’s always been.
The value that one website provides is a minuscule amount, the value of one individual poster on Reddit is minuscule. Are we asking that each poster on Reddit be paid 1 penny (that’s probably what your posts are worth) for their individual contribution? My websites were used to train these models probably, but the value that each contributed is so small that I wouldn’t even expect a few cents for it.
The person who’s going to profit here is Cloudflare or the owners of Reddit, or any other gatekeeper site that is already profiting from other people’s contributions.
The “parasitism” here just feels like normal competition between giant companies who have special access to information.
Cloudflare is here to protecc you from all those evils. Just come under our umbrella.
What is absolutely not ok is to crawl at such an excessive speed that it makes it difficult to host small scale websites.
Truly a tragedy of the commons.
I think on the contrary, who sets the prompts stands to get benefits, the AI provider gets a flat fee, and authors get nothing except the same AI tools as anyone else. That is natural since the users are bringing the problem to the AI, of course they have the lion share here.
AI is useless until applied to a specific task owned by a person or company. Within such a task there is opportunity for AI to generate value. AI does not generate its own opportunities, users do.
Because users are distributed across society benefits follow the same curve. They don't flow to the center but mainly remain at the edge. In this sense LLMs are like Linux, they serve every user in their specific way, but the contributors to the open source code don't get directly compensated.
this just pushes AI agents "underground" to adopt the behavior of a full blown stealth focused scraper which makes it harder to detect.
At Coinbase, we've been building tools to make the blockchain the ideal payment rails for use cases like this with our x402 protocol:
Ping if you're interested in joining our open source community.
We are in the Napster phase of Web content stealing.
The open web is akin to the commons, public domain and public land. So this is like putting a spy cam on a freeway billboard, detecting autonomous vehicles, and shining a spotlight at their camera to block them from seeing the ad. To what end?
Eventually these questions will need to be decided in court:
1) Do netizens have the right to anonymity? If not, then we'll have to disclose whether we're humans or artificial beings. Spying on us and blocking us on a whim because our behavior doesn't match social norms will amount to an invasion of privacy (eventually devolving into papers please).
2) Is blocking access to certain users discrimination? If not, then a state-sanctioned market of civil rights abuse will grow around toll roads (think whites-only drinking fountains).
3) Is downloading copyrighted material for learning purposes by AI or humans the same as pirating it and selling it for profit? If so, then we will repeat the everyone-is-a-criminal torrenting era of the 2000s and 2010s when "making available" was treated the same as profiting from piracy, and take abuses by HBO, the RIAA/MPAA and other organizations who shut off users' internet connections through threat of legal actions like suing for violating the DMCA (which should not have been made law in the first place).
I'm sure there are more. If we want to live in a free society, then we must be resolute in our opposition of draconian censorship practices by private industry. Gatekeeping by large, monopolistic companies like Cloudflare simply cannot be tolerated.
I hope that everyone who reads this finds alternatives to Cloudflare and tells their friends. If they insist on pursuing this attack on our civil rights for profit, then I hope we build a countermovement by organizing with the EFF and our elected officials to eventually bring Cloudflare up on antitrust charges.
Cloudflare has shown that they lack the judgement to know better. Which casts doubt on their technical merits and overall vision for how the internet operates. By pursuing this course of action, they have lost face like Google did when it removed its "don't be evil" slogan from its code of conduct so it could implement censorship and operate in China (among other ensh@ttification-related goals).
Edit: just wanted to add that I realize this may be an opt-in feature. But that's not the point - what I'm saying is that this starts a bad precedent and an unnecessary arms race, when we should be questioning whether spidering and training AI on copyrighted materials are threats in the first place.
I really hope that we can continue training AI the same way we train humans – basically for free.
Looks like cloudflare just invented the new App Store.
Sure, fidelity has gotten better but so much has been lost.
If you don’t want people reading your data, don’t put it on the web.
The concept that copyright extends to “human eyeballs only” is a silly one.
They've been trying to do this for years. Now "AI" gives a convenient excuse.
cmg•13h ago
joshdavham•1h ago