Cloudflare Introduces Default Blocking of A.I. Data Scrapers

https://www.nytimes.com/2025/07/01/technology/cloudflare-ai-data.html

429•stephendause•7mo ago

Comments

cmg•7mo ago

Archive link: https://archive.ph/ARnyu

joshdavham•7mo ago

How did you make that link?

badlibrarian•7mo ago

Did they ever fix the auto-blocking of RSS feeds?

https://news.ycombinator.com/item?id=41864632

blakesterz•7mo ago

The list of bots is pretty short right now:

https://developers.cloudflare.com/bots/concepts/bot/#ai-bots

ZiiS•7mo ago

Enough to more than half the traffic to most sites if the blocks hold.

hennell•7mo ago

Cloudflare sees a lot of the web traffic. I assume these are the biggest bots they're seeing right now, and any new contenders would be added as they find them. Probably impossible to really block everything, but they've got the web-coverage to detect more than most.

TechDebtDevin•7mo ago

They are lying. They cant detect crawlers unless we tell them we are who we are.

JimDabell•7mo ago

> AI bots

> You can opt into a managed rule that will block bots that we categorize as artificial intelligence (AI) crawlers (“AI Bots”) from visiting your website. Customers may choose to do this to prevent AI-related usage of their content, such as training large language models (LLM).

> CCBot (Common Crawl)

Common Crawl is not an AI bot:

https://commoncrawl.org

johneth•7mo ago

The data it collects is used by AI companies, though.

Spivak•7mo ago

Poor ChatGPT-User, nobody understands you. Blocking a real user because of the, admittedly odd, browser they're using misses the point.

Roark66•7mo ago

This is a bit silly. Slowing down, yes, but blocking? People who *really* want that content will find a way and this will hit everyone else instead that will have to do silly riddles before following every link or run crypto mining for them before being shown the content .

I recently went to a big local auction site on which I buy frequently and I got one of these "we detected unusual traffic from your network" messages. And "prove you're human". Which was followed by "you completed the capcha in 0.4s your IP is banned". Really? Am I supposed to slow down my browsing now? I tried a different browser, a different OS, logging on,clearing cookies, etc. Same result when I tried a search. It took 4h after contacting their customer service to unblock it. And the explanation was "you're clicking too fast".

At some point it just becomes a farce and the hassle is not worth the content. Also, while my story doesn't involve any bots perhaps a time will come when local LLMs will be good enough that I'll be able to tell one "reorder my cat food" and it will go and do it. Why are they so determined to "stop it" (spoiler, they can't).

For anyone who says LLMs are already capable of ordering cat food I say not so fast. First the cat food has to be on sale/offer (sometimes combined with extras). Second it is supposed to be healthy (no grains) and third the taste needs to be to my cats liking. So far I'm not going to trust a LLM with this.

picohernandez•7mo ago

I was chatting with my sister last weekend. As a hobby, she creates and sells wedding invitation and other designs at an online marketplace site called Zazzle. She was telling me all about how that site implemented some automatic bot detection and it sounded like it was a total disaster. Real content creators were getting wrongly flagged as bots and then getting blocked from using the site just for using the most fundamental site functionality, and to make it worse, then it was apparently impossible for them to get past the captcha challenge or whatever it showed next. She forwarded me a link to some support forum discussion about it and it was mindboggling the troubles that some of the content creators there had to go through:

https://community.zazzle.com/t5/technical-issues/bot-test-wo...

My sister said that her sales figures are way down compared to what they used to be and she didn't know if this bot flagger was disrupting real paying customers too. She said it had flagged her a couple of times, although she was luckily able to get past the bot challenge. She has pretty much given up on making and uploading new designs because of what was happening to other content creators there. She's now scared to use the site because she doesn't want to get wrongly locked out of her account.

Sol-•7mo ago

Do the major AI companies actually honor robots.txt? Even if some of their publicly known crawlers might do it, surely they have surreptitious campaigns where they do some hidden crawling, just like how they illegally pirate books, images and user data to train on.

px43•7mo ago

There's a lack of clarity, but it seems likely to me that a majority of this traffic is actually people asking questions to the AI, and the AI going out and researching for answers. When the AI tools are being used like a web browser to do research, should they still be adhering to robots.txt, or is that only intended for search indexing?

chasd00•7mo ago

My thought too, honoring robots.txt is just a convention. There's no requirement to follow robots.txt, or at least certainly no technical requirement. I don't think there's any automatic legal requirement either.

Maybe sites could add "you must honor policies set in robots.txt" to something like a terms of service but I have no idea if that would have enough teeth for a crawler to give up.

TechDebtDevin•7mo ago

Cloudflare snd their customera have been desperately for years trying to kill scrapers in court. This is all. Meaningless, but they are probably gearing up for another legal battle to define robots.txt as a legal contract. Theyre going to use this marketplace theyre scamming people with to do it. They will fail.

prmoustache•7mo ago

I don't think terms of service are applicable anyway. Terms of Service aren't a signed contract as you may never see it nor know there is one. This happens both in the case of visiting the site interactively or fetching a page programatically.

mschuster91•7mo ago

Cloudflare, for all I hate their role as a gatekeeper these days, actually has the leverage to force the AI companies to bend.

deepsun•7mo ago

Hard to tell, because minor crawlers mimic major companies to not getting banned.

btown•7mo ago

The headline is somewhat misleading: sites using Cloudflare now have an opt-in option to quickly block all AI bots, but it won't be turned on by default for sites using Cloudflare.

The idea that Cloudflare could do the latter at the sole discretion of its leadership, though, is indicative of the level of power Cloudflare holds.

bitpush•7mo ago

It is now an adversarial relationship between aibots and website, and cloudflare is merely reacting to it.

Would you say the same for ddos protection? Isn't that the same as well?

TechDebtDevin•7mo ago

They arent doing anything. They are attempting to insert themselves into the middle of a marketplace (that doesnt exist and never will) where scrapers pay for IP. They think theyre going to profit off the bots, not protect your site. Dont fall for their scam.

bitpush•7mo ago

What do you mean they are trying to insert themselves? If I have a website that I host with cloudflare, I (as the rightful website owner) has inserted Cloudflare in between.

It isnt CF going around saying, that's a nice website you have there. I'm gonna put myself in between.

GrayShade•7mo ago

> sites using Cloudflare now have an opt-in option to quickly block all AI bots, but it won't be turned on by default for sites using Cloudflare

Do you have a source for that? https://blog.cloudflare.com/content-independence-day-no-ai-c... does say "changing the default".

mattcollins•7mo ago

"This feature is available to all customers, meaning anyone can enable this today from the Cloudflare dashboard."

https://blog.cloudflare.com/control-content-use-for-ai-train...

TechDebtDevin•7mo ago

They cant do anything other than bog down the internet. I havent found a single cf provided challenge I havent been able to get past in < half a day.

This is simply juat the first step in them implementing a marketplace and trying to get into LLM SEO. They dont care about your site or protecting it. They are gearing up to start making a cut in the Middle between scrapers and publishers. Why wouldnt I go DIRECTLY to the publisher and make a deal. So dumb I hate cf so much.

The only thing cloudflare knows how to do is MITM attacks.

Marsymars•7mo ago

So what would you suggest as an alternative if I have a site where I don’t want the content used for LLM training?

fkyoureadthedoc•7mo ago

Auth? Because whatever Cloudflare is doing isn't going to stop anyone serious about scraping data.

Marsymars•7mo ago

Let’s say I’m talking about content that I don’t want behind an auth wall. Is your position simply that all such sites should abandon any efforts to not have the content used for LLM training?

mattl•7mo ago

If you find a solution that’s not auth please let me know.

Wilder7977•7mo ago

Something like https://github.com/TecharoHQ/anubis?

It's not that different from CF, but you control it fully.

fkyoureadthedoc•7mo ago

CF will stop bots that respect your robots.txt, and try and stop ones that don't. If your concern is just that you don't want your content used to train an LLM, this will stop the honest companies.

If you are concerned about load on your site because the crawlers are hammering your site, the ones that respect robots.txt should be respecting your crawl delay too. CF will be able to block the dumb ones that ignore your robots.txt and hammer you with no real strategy.

But serious scrapers will have rotating residential IPs and be loading your site from real browsers, they'll take effort to appear as actual users. Sites like Ticket Master have an endless arms race against these. Some Chinese LLM company will get your data if it's public lol.

Marsymars•7mo ago

Sure, I understand all that, but you haven't really answered my question of what an alternative is, you've just laid out why CF is better than nothing. (Or even managing robots.txt manually.)

fkyoureadthedoc•7mo ago

Because that depends on your motivations, which I don't know. If you want to prevent your content from being used to train an LLM, CF is not going to prevent that. If you want to protect your site from heavy traffic, CF will do that just like it always has. If you absolutely don't want to your content used to train an LLM, you're basically out of luck if your site is public. So no you shouldn't abandon CF because you get other benefits from it if you need them, but don't expect that you've done anything to prevent your content fro training any given LLM.

DeusExMachina•7mo ago

I would expect these features to be opt-in. Even though I agree with it, I would be pretty upset if they just turned it on automatically on my website.

postalcoder•7mo ago

  > When you enable this feature via a pre-configured managed rule, Cloudflare can detect and block verified AI bots that comply with robots.txt and respect crawl rates, and do not hide their behavior from your website. The rule has also been expanded to include more signatures of AI bots that do not follow the rules.

We already know companies like Perplexity are masking their traffic. I'm sure there's more than meets the eye, but taking this at face value, doesn't punishing respectful and transparent bots only incentivize obfuscation?

edit: This link[0], posted in a comment elsewhere, addresses this question. tldr, obfuscation doesn't work.

  > We leverage Cloudflare global signals to calculate our Bot Score, which for AI bots like the one above, reflects that we correctly identify and score them as a “likely bot.”

  > When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint. For every fingerprint we see, we use Cloudflare’s network, which sees over 57 million requests per second on average, to understand how much we should trust this fingerprint. To power our models, we compute global aggregates across many signals. Based on these signals, our models were able to appropriately flag traffic from evasive AI bots, like the example mentioned above, as bots.

[0] https://blog.cloudflare.com/declaring-your-aindependence-blo...

colechristensen•7mo ago

>doesn't punishing respectful and transparent bots only incentivize obfuscation?

They're cloudflare and it's not like it's particularly easy to hide a bot that is scraping large chunks of the Internet from them. On top of the fact that they can fingerprint any of your sneaky usage, large companies have to work with them so I can only assume there are channels of communication where cloudflare can have a little talk with you about your bad behavior. I don't know how often lawyers are involved but I would expect them to be.

jerf•7mo ago

"doesn't punishing respectful and transparent bots only incentivize obfuscation?"

Sure, but we crossed that bridge over 20 years ago. It's not creating an arms race where there wasn't already one.

Which is my generic response to everyone bringing similar ideas up. "But the bots could just...", yeah, they've been doing it for 20+ years and people have been fighting it for just as long. Not a new problem, not a new set of solutions, no prospect of the arms race ending any time soon, none of this is new.

hombre_fatal•7mo ago

Next line:

> The rule has also been expanded to include more signatures of AI bots that do not follow the rules.

The Block AI Bots rule on the Super Bot Fight Mode page does filter out most bot traffic. I was getting 10x the traffic from bots than I was from users.

It definitely doesn't rely on robots.txt or user agent. I had to write a page rule bypass just to let my own tooling work on my website after enabling it.

account42•7mo ago

How many of those "bots" you are filtering are actually bots and how many are regular users buttflare has misidentified as bots?

hombre_fatal•7mo ago

Pretty simple to see this if you've run a website: compare your analytics pre-bot to post-bot to post-bot-blocker.

There is a clear moment where you land on AI bot radar. For my large forum, it was a month ago.

Overnight, "72 users are viewing General Discussion" turned into "1720 users".

40% requests being cached turned into 3% of requests are cached.

fluidcruft•7mo ago

Cloudflare already knows how to make the web hell for people they don't like.

I read the robots.txt entries as those AI bots that will be not marked as "malicious" and that will have the opportunity to be allowed by websites. The rest will be given the Cloudflare special.

dougb5•7mo ago

> Cloudflare can detect and block verified AI bots that comply with robots.txt and respect crawl rates, and do not hide their behavior from your website

It's the bots that do hide their behavior -- via residential proxy services -- that are causing most of the burden, for my site anyway. Not these large commercial AI vendors.

alganet•7mo ago

> If A.I. companies freely use data from various websites without permission or payment, people will be discouraged from creating new digital content

I don't see a way out of this happening. AI fundamentally discourages other forms of digital interaction as it grows.

Its mechanism of growing is killing other kinds of digital content. It will eventually kill the web, which is, ironically, its main source of food.

fennecfoxy•7mo ago

Additionally, ad blocker usage is apparently at 30%. So it's a redundant or more nuanced argument, really.

account42•7mo ago

Ad blockers only discourage commercialized content creation, not all of it. IMO that actually improves the quality of the content created.

spwa4•7mo ago

Yes what everyone wants to do with AI: generate entertainment and interactions with humans, including economical ones, will need to happen or AI will starve.

alganet•7mo ago

That's what is going to make it starve. Belly full, but of its own shit being tossed around humans seeking cheap copouts of doing actual work.

BrouteMinou•7mo ago

Just like cancer?

jasonthorsness•7mo ago

I turned this on and it adjusts the robots.txt automatically; not sure what else it is doing.

# NOTICE: The collection of content and other data on this # site through automated means, including any device, tool, # or process designed to data mine or scrape content, is # prohibited except (1) for the purpose of search engine indexing or # artificial intelligence retrieval augmented generation or (2) with express # written permission from this site’s operator.

# To request permission to license our intellectual # property and/or other materials, please contact this # site’s operator directly.

# BEGIN Cloudflare Managed content

User-agent: Amazonbot Disallow: /

User-agent: Applebot-Extended Disallow: /

User-agent: Bytespider Disallow: /

User-agent: CCBot Disallow: /

User-agent: ClaudeBot Disallow: /

User-agent: Google-Extended Disallow: /

User-agent: GPTBot Disallow: /

User-agent: meta-externalagent Disallow: /

# END Cloudflare Managed Content User-agent: * Disallow: /* Allow: /$

xyst•7mo ago

So in addition to updating the robots.txt file, which really only blocks a small number of them.

Seems CF has been gathering data and profiling these malicious agents.

This post by CF elaborates a bit further: https://blog.cloudflare.com/declaring-your-aindependence-blo...

Basically becomes a game of cat and mouse.

postalcoder•7mo ago

This is interesting. The reasoning and response don't line up.

  > Cloudflare is making the change to protect original content on the internet, Mr. Prince said. If A.I. companies freely use data from various websites without permission or payment, people will be discouraged from creating new digital content, he said

  >  prohibited except for the purpose of [..] artificial intelligence retrieval augmented generation

This seems to be targeted at taxing training of language models, but why an exclusion for the RAG stuff? That seems like it has a much greater immediate impact for online content creators, for whom the bots are obviating a click.

fennecfoxy•7mo ago

With that opinion, are you also suggesting that we ban ad blockers? Because it's better I not click & consume resources than click and not be served ads, basically just costing the host money.

It means sense to allow for RAG in the same way that search engines provide a snippet of an important chunk of the page.

A blog author could not complain that their blog is getting ragged when they're extremely liable to be Google/whatever searching all day and basically consuming others' content in exactly the same way that they're trying to disparage.

postalcoder•7mo ago

I don't think we should ban ad blockers, but I also think it's fair to suggest that the loss of organic traffic could be affecting the incentive to create new digital content, at least as much as the fear of having your content absorbed into an LLM's training data.

Boldened15•7mo ago

IMO the backlash against LLMs is more philosophical, a lot of people don’t like them or the idea of one learning from their content. Unless your website has some unique niche information unavailable anywhere else there’s no direct personal risk. RAG would be a more direct threat if anything.

toomuchtodo•7mo ago

It's really about who is getting the value from the work of the content. If content creators of all sorts have their work consumed by LLMs, and LLM orgs charge for it can capture all the value, why should people create to have their work vacuumed up for the robot's benefit? For exposure? You can't eat or pay rent with exposure. Humans must get paid, and LLMs (foundational models and output using RAG) cannot improve without a stream of works and data humans create.

Whether you call it training or something else is irrelevant, it's really exploitation of human work and effort for AI shareholder returns and tech worker comp (if those who create aren't compensated). And the technocracy has not been, based on the evidence, great stewards of the power they obtain through this. Pay the humans for their work.

o11c•7mo ago

It's not philosophical, it's economical.

AI scrapers increase traffic by maybe 10x (this varies per site) but provide no real value whatsoever to anyone. If you look at various forms of "value":

* Saying "this uses AI" might make numbers go up on the stock market if you manage to persuade people it will make numbers go up (see also: the market will remain irrational longer than you can remain solvent).

* Saying "this uses AI" might fulfill some corporate mandate.

* Asking AI to solve a problem (for which you would actually use the solution) allows you to "launder" the copyright of whatever source it is paraphrasing (it's well established that LLMs fail entirely if a question isn't found within their training set). Pirating it directly provides the same value, with significantly less errors/handholding.

* Asking AI to entertain you ... well, there's the novelty factor I guess, but even if people refuse to train themselves out of that obsession, the world is still far too full of options for any individual to explore them all. Even just the question of "what kind of ways can I throw some kind of ball around" has more answers than probably anyone here knows.

What am I missing?

robrenaud•7mo ago

Why are 100s of millions of people using AI if it is providing no value?

o11c•7mo ago

Because it's injected into a previously working product - even if it makes it worse - and automatically injects its ideas. That counts as "somebody using it".

Because it's bundled with other products that do provide value, and that counts as "someone using it".

Because some middle manager has declared that I must add AI to my workflow ... somehow. Whatever, if they want to pay me to accomplish less than usual, that's not my problem.

Because it's a cool new toy to play around with a bit.

Because surely all these people saying "AI is useful now" aren't just lying shills, so we'd better investigate their claims again ... nope, still terminally broken.

ijk•7mo ago

What I want to know is if the flood of scraping everyone has been complaining about is coming from people trying to scrape for training or bots doing RAG search.

I get that everyone wants data, but presumably the big players already scraped the web. Do they really need to do it again? Or is it bit players reproducing data that's likely already in the training set? Or is it really that valuable to have your own scraped copy of internet scale data?

I feel like I'm missing something here. My expectation is that RAG traffic is going to be orders of magnitude higher than scraping for training. Not that it would be easy to measure from the outside.

mattcollins•7mo ago

I wondered about this, too.

Cloudflare have some recent data about traffic from bots (https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-cr...) which indicates that, for the time being, the overwhelming majority of the bot requests are for AI training and not for RAG.

wiether•7mo ago

You should ask Zuck, since, for what we've seen and what we were ask to act against, Meta is the main culprit in scraping every single page of websites, multiple times a day.

And I'm talking about ecommerce websites, with their bot scraping every variation of each product, multiple times a day.

progmetaldev•7mo ago

I believe it's both. We're at a place where legislation hasn't really declared what is and isn't allowed. These scrapers are acting like Googlebot or any other search engine crawler, and trying to find any kind of new content that might be of value to their users.

New data is still being added online daily (probably hourly, if not more often) by humans, and the first ones to gain access could be the "winners," particularly if their users happen to need up to date data (and the service happens to have scraped it). Just like with search engines/crawlers, there's also the big players that may respect your website, but there are also those that don't use rate-limiting or respect robots.txt.

lxgr•7mo ago

More and more people use ChatGPT for search, so blocking that doesn't seem like a successful strategy long-term.

bee_rider•7mo ago

I wonder… Google scrapes for indexing and for AI, right? I wonder if they will eventually say: ok, you can have me or not, if you don’t want to help train my AI you won’t get my searches either. That’s a tough deal but it is sort of self-consistent.

giancarlostoro•7mo ago

"Embrace, Extend, Extinguish" Google's mantra. And yes, I know about Microsoft's history with that phrase ;) But Google has done this with email, browsers (Google has web apps that run fine on Firefox but request you use Chrome), Linux (Android), and I'm sure there's others I am forgetting about.

So yeah, I too could see them doing this.

mrweasel•7mo ago

Very few people seems to be complaining that Google crashes their sites. Google also publish their crawlers IP ranges, but you really don't need to rate-limit Google, they know how to back off and not overload sites.

Symbiote•7mo ago

In theory — in practise I've had to limit Google on two large sites at work. I currently have them limited to 10/s for non-cached requests.

progmetaldev•7mo ago

Curious if the content on those sites might have high value to Google? Such as if they have data that is new or unavailable elsewhere, or if they're just standard sites, and you've just been unlucky?

I have had odd bot behavior from some major crawlers, but never from Google. I wonder if there is a correlation to usefulness of content, or if certain sites get stuck in a software bug (or some other strange behavior).

Symbiote•7mo ago

Google do value the sites, they have data unavailable elsewhere. At some point we had an automated message saying the site had too many pages and would no longer be indexed, then a human message saying that was a mistake, and our site was an exception to that rule.

But as with any contact with these large companies, our contact eventually disappeared.

1vuio0pswjnm7•7mo ago

"User-agent: CCBot disallow: /"

Is Common Crawl exclusively for "AI"

CCBot was already in so many robots.txt prior to this

How is CC supposed to know or control how people use the archive contents

What if CC is relying on fair use

   # To request permission to license our intellectual
   # property andd/or other materials, please contact this
   # site's operator directly

If the operator has no intellectual property rights in the material, then do they need permission from the rights holders to license such materials for use in creating LLMs and collect licensing fees

Is it common for website terms and conditions to permit site operators to sublicense other peoples' ("users") work for use in creating LLMs for a fee

Is this fee shared with the rights holders

nemomarx•7mo ago

Read a tos and notice that you give the site operators unlimited license to reproduce or spread your works, almost on any site. it's required to host and show the content essentially

ronsor•7mo ago

   # To request permission to license our intellectual
   # property andd/or other materials, please contact this
   # site's operator directly

Scrapers don't accept the terms of service.

Ironically, I've only ever scraped sites that block CCBot, otherwise I'd rather go to Common Crawl for the data.

Bender•7mo ago

For my silly hobby sites I just return status 444 close the connection for anything that has case-insentive "bot" in the UA requesting anything other than robots.txt, humans.txt, favicon.ico, etc... This would also drop search engines but I blackhole route most of their CIDR blocks. I'm probably the only one here that would do this.

sneak•7mo ago

How does a bot scraping your silly hobby sites for any purpose harm or negatively affect you in any way?

pixl97•7mo ago

Depends if they hit a site enough to make it cost something. It's not hard for bots to flood servers.

Bender•7mo ago

Only if they push me over my bandwidth limits but they can't do that if I just drop them on the floor.

slenk•7mo ago

I thought I saw cloudflare insert noindex links?

swyx•7mo ago

what actually are the consequences of ignoring robots.txt (apart from DDOS)? have any of these cases ended up in court at all?

v5v3•7mo ago

BBC recently served a cease and desist on perplexity to stop, and delete all existing.

https://www.bbc.co.uk/news/articles/cy7ndgylzzmo

So an ai company can just be naughty till asked to stop, and then exclud that one company that has the financial resources to go legal.

lxgr•7mo ago

That's at least a more reasonable default than that I've seen at least one newspaper do, which is to block both LLM scrapers and things like ChatGPT's search feature explicitly.

cratermoon•7mo ago

I'm still not sure this is going to be very effective, as so many of the worst offenders don't identify themselves as bots, and often change their user agent. Has Cloudflare said anything about identifying the bad actors?

chasd00•7mo ago

i've mentioned this in a couple replies so maybe i'm wrong but it's up to the client to obey robots.txt. Why would they not just ignore it? Unless there's some legal consequence not complying with robots.txt then why even follow it? There's no technical enforcement of the policies in the file, it's up to the client to honor them.

kentonv•7mo ago

> There's no technical enforcement of the policies in the file, it's up to the client to honor them.

That's incorrect. Cloudflare does in fact enforce this at a technical level. Cloudflare has been doing bot detection for years and can pretty reliably detect when bots are not following robots.txt and then block them.

GrayShade•7mo ago

Yes, they have over the years, for example https://blog.cloudflare.com/residential-proxy-bot-detection-..., https://blog.cloudflare.com/cloudflare-bot-management-machin..., https://blog.cloudflare.com/introducing-bot-analytics/.

cratermoon•7mo ago

Between those measures, if they are effective and the new blocking, maybe the bigger companies will be induced to behave a little better.

lucasyvas•7mo ago

I fail to see how this won’t just result in UA string or other obfuscation.

chasd00•7mo ago

a crawler doesn't have to change anything, they can just ignore the robots.txt file. It's up to the client to read robots.txt and follow its directives but there's no technical reason why the client cannot just ignore everything in the file period.

kube-system•7mo ago

Cloudflare’s filtering is already way more sophisticated than just looking at UA string or other voluntary reporting. They’re almost certainly using fingerprinting and behavioral analytics.

gazpacho•7mo ago

From an open source projects perspective we’d want to disable this on our docs sites. We actually want those to be very discoverable by LLMs, during training or online usage.

rorylaitila•7mo ago

Unfortunately I think pissing into the wind. Information websites are all but dead. AI contains all published human information. If you have positioned your website as an answer to a question, it won't survive that way.

"Information" is dead but content is not. Stories, empathy, community, connection, products, services. Content of this variety is exploding.

The big challenge is discoverability. Before, information arbitrage was one pathway to get your content discovered, or to skim a profit. This is over with AI. New means of discovery are necessary, largely network and community based. AI will throw you a few bones, but it will be 10% of what SEO did.

fennecfoxy•7mo ago

>AI contains all published human information

No, it most certainly does not. It was certainly trained on large swathes of human knowledge/interactions.

A model that consists of a perfect representation/compression of all this info is a zip file, not a model file.

rorylaitila•7mo ago

AI providers have scrapped and will continue to, all internet published information or virtually so. Since "Information" is infinite, AI cannot contain "all information" in a complete sense. But it certainly answers almost everything that matters for any existing search query that has ever been targeted by a webpage that is crawlable.

In any case, as manifest by real world SEO, which is plummeting in traffic for informational queries, the effect is the same. This real world impact is what matters and will not be reversed, regardless of attempts at blocking.

ozgrakkurt•7mo ago

You are assuming LLMs will replace search engines. Why is this the case?

To me it seems like there has to be so much optimization for this to happen that, it is not likely. LLM answers are slow and unreliable. Even using something like perplexity doesn’t give much value over using a regular search engine in my experience

rorylaitila•7mo ago

LLMs will not fully replace search engines, but Google and Bing are evolving to be LLM first, anyhow. So "what is a search engine" today is not what it was yesterday. Let's call the time before LLMs, traditional search. LLM first products bundle some aspect of traditional search. And traditional search is adding LLM answers.

Traditional search will still be highly useful for transactional, product, realtime, and action oriented queries. Also for discovering educational/entertainment content that is valued in of itself and cannot be reformulated by LLM.

Meekro•7mo ago

I've heard lots of people on HN complaining about bot traffic bogging down their websites, and as a website operator myself I'm honestly puzzled. If you're already using Cloudflare, some basic cache configuration should guarantee that most bot traffic hits the cache and doesn't bog down your servers. And even if you don't want to do that, bandwidth and CPU are so cheap these days that it shouldn't make a difference. Why is everyone so upset?

deepsiml•7mo ago

Not much into that kind of DevOps. What is a good basic caching in this instance?

TechDebtDevin•7mo ago

Cloudflare and other CDNs will usually automatically cache your static pages.

haiku2077•7mo ago

It comes down to:

1. Use the Cache-Control header to express how to cache your site correctly (https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Cac...)

2. Use a CDN service, or at least a caching reverse proxy, to serve most of the cacheable requests to reduce load on the (typically much more expensive) origin servers

mrweasel•7mo ago

Just note that many AI scrapers will go to great length to do cache busting. For some reason many of them feel like they need to get the absolute latest version and don't trust your cache.

haiku2077•7mo ago

You can use Cache Control headers to express that your own CDN should aggressively refresh a resource but always serve it to external clients from cache. It's covered in the link under "Managed Caches"

cortesoft•7mo ago

A CDN can be configured to ignore cache control headers in the requests and cache things anyway.

deepsiml•7mo ago

Thank you!

conductr•7mo ago

The presumption I’m already using cloudfare is a start. Is this a requirement for maintaining a simple website now?

haiku2077•7mo ago

Either that or Anubis (https://anubis.techaro.lol/docs), yes.

roguecoder•7mo ago

So these companies broke the internet

haiku2077•7mo ago

Which companies?

OpenAI, Anthropic, Google? No, their bots are pretty well behaved.

The smaller AI companies deploying bots that don't respect any reasonable rate limits and are scraping the same static pages thousands of times an hour? Yup

e3bc54b2•7mo ago

Anecdote, but at least for tiny little server hosting single public repository, none of these companies had 'well behaved' bots. It may be possible that they learned to behave better but I wouldn't know since my only possible recourse was to blacklist them all AND take the repo private.

haiku2077•7mo ago

Those are the small companies spoofing their user agent as the big companies to dodge countermeasures.

noodle•7mo ago

As someone who had some outages due to AI traffic and is now using CloudFlare's tools:

Most of my site is cached in multiple different layers. But some things that I surface to unauthenticated public can't be cached while still being functional. Hammering those endpoints has taken my app down.

Additionally, even though there are multiple layers, things that are expensive to generate can still slip through the cracks. My site has millions of public-facing pages, and a batch of misses that happen at the same time on heavier pages to regenerate can back up requests, which leads to errors, and errors don't result in caches successfully being filled. So the AI traffic keeps hitting those endpoints, they keep not getting cached and keep throwing errors. And it spirals from there.

x0x0•7mo ago

It's not complex. I worked on a big site. We did not have the compute or i/o (most particularly db iops) to live generate the site. Massive crawls both generated cold pages / objects (cpu + iops) and yanked them into cache, dramatically worsening cache hit rates. This could easily take down the site.

Cache is expensive at scale. So permitting big or frequent crawls by stupid crawlers either require significant investments in cache or slow down and worsen the site for all users. For whom we, you know, built the site, not to provide training data for companies.

As others have mentioned, Google is significantly more competent than 99.9% of the others. They are very careful to not take your site down and provide, or used to provide, traffic via their search. So it was a trade, not a taking.

Not to mention I prefer not to do business with Cloudflare because I don't like companies that don't publish quota. If going over X means I need an enterprise account that starts at $10k/mo, I need to know the X. Cloudflare's business practice appears to be letting customers exceed that quota then aggressively demanding they pay or they'll be kicked off the service nearly immediately.

jtolmar•7mo ago

The stories I've heard have been mostly about scraper bots finding APIs like "get all posts in date range" and then hammering that with every combo of start/end date.

jauntywundrkind•7mo ago

I too am a bit confused / mystified at the strong reaction. But I do expect a lot of badly optimized sites that just want out.

I struggle to think of a web related library that has spread faster than Anubis checker. It's everywhere now! https://github.com/TecharoHQ/anubis

I'm surprised we don't see more efforts to rate limit. I assume many of these are distributed crawlers, but it feels like there's got to be pools of activity spinning up, on a handful of IPs. And that they would be time correlated together pretty clearly. Maybe that's not true. But it feels like the web, more than anything else, needs some open source software to add a lot more 420 Enhance Your Calm responses, as it feels like. https://http.dev/420

zerocrates•7mo ago

The reaction comes from some combination of

- opposition to generative AI in general

- a view that AI, unlike search which also relies on crawling, offers you no benefits in return

- crawlers from the AI firms being less well-behaved than the legacy search crawlers, not obeying robots.txt, crawling more often, more aggressively, more completely, more redundantly, from more widely-distributed addresses

- companies sneaking in AI crawling underneath their existing tolerated/whitelisted user-agents (Facebook was pretty clearly doing this with "facebookexternalhit" that people would have allowed to get Facebook previews; they eventually made a new agent for their crawling activity)

- a simultaneous huge spike in obvious crawler activity with spoofed user agents: e.g. a constant random cycling between every version of Chrome or Firefox or any browser ever released; who this is or how many different actors it is and whether they're even doing crawling for AI, who knows, but it's a fair bet.

Better optimization and caching can make this all not matter so much but not everything can be cached, and plenty of small operations got by just fine without all this extra traffic, and would get by just fine without it, so can you really blame them for turning to blocking?

jowea•7mo ago

I'm not an expert on website hosting, but after reading some of the blog posts on Anubis, those people were truly at wit's end trying to block AI scrappers with techniques like the ones you imply.

jauntywundrkind•7mo ago

https://xeiaso.net/blog/2025/anubis/ links to https://pod.geraspora.de/posts/17342163 which says:

> If you try to rate-limit them, they'll just switch to other IPs all the time. If you try to block them by User Agent string, they'll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.

My gut is that the switch between IP addresses can't be that hard to follow. That the access pattern it pretty obvious to follow across identities.

But it would be non trivial, it would entail crafting new systems and doing new work per request (when traffic starts to be elevated, as a first gate).

Just making the client run through some math gauntlet is an obvious win that aggressors probably can't break. But I still think there's probably some really good hanging fruit for identifying and rate limiting even these somewhat rather more annoying traffic patterns, that the behavior itself leaves a figure print that can't be hidden and which can absolutely be rate limited. And I'd like to see that area explored.

Edit: oh heck yes, new submission with 1.7tb logs of what AI crawlers do. Now we can machine learn some better rate limiting techniques! https://news.ycombinator.com/item?id=44450352 https://huggingface.co/datasets/lee101/webfiddle-internet-ra...

xena•7mo ago

This isn't as helpful as you think. If it included all of the HTTP headers that the bots sent and other metadata like TLS ClientHelloInfo it would be a lot more useful.

jauntywundrkind•7mo ago

There's headers, but I hadn't noticed that they are the response headers. :facepalm:

Symbiote•7mo ago

That's a pretty big assumption.

The largest site I work on has 100,000s of pages, each in around 10 languages — that's already millions of pages.

It generally works fine. Yesterday it served just under 1000 RPS over the day.

AI crawlers have brought it down when a single crawler has added 100, 200 or more RPS distributed over a wide range of IPs — it's not so much the number of extra requests, though it's very disproportionate for one "user", but they can end up hitting an expensive endpoint excluded by robots.txt and protected by other rate-limiting measures, which didn't anticipate a DDoS.

Meekro•7mo ago

Ok, clearly I had no idea of the scale of it. 200RPS from a single bot sounds pretty bad! Do all 100,000+ pages have to be live to be useful, or could many be served from a cache that is minutes/hours/days old?

Symbiote•7mo ago

The main data for those pages is in a column store, so it can sustain many thousand RPS (at least).

The problem is we have things like

  Disallow: /the-search-page
  Disallow: /some-statistics-pages

in robots.txt, which is respected by most search engine (etc) crawlers, but completely ignored by the AI crawlers.

By chance, this morning I find a legacy site is down, because in the last 8 hours it's had 2 million hits (70/s) to a location disallowed in robots.txt. These hits have come from over 1.5 million different IP addresses, so the existing rate-limit-by-IP didn't catch it.

The User-Agents are a huge mixture of real-looking web browsers; the IPs look to come from residential, commercial and sometimes cloud ranges, so it's probably all hacked computers.

I could see Cloudflare might have data to block this better. They don't just get 1 or 2 requests from an IP, they presumably see a stream of them to different sites. They could see many different user agents being used from that IP, and other patterns, and can assign a reputation score.

I think we will need to add a proof-of-work thing in front of these pages and probably whitelist some 'good' bots (Wikipedia, Internet Archive etc). It is annoying since this was working fine in its current form for over 5 years.

yodon•7mo ago

Discussed yesterday (270+ comments)[0]

[0]https://news.ycombinator.com/item?id=44432385

dawnerd•7mo ago

I’ve been using this for a while on my mastodon server and after a few tweaks to make sure it wasn’t blocking legit traffic it’s been really working great. Between Microsoft and Meta, they were hitting my services more than any other traffic combined which says a lot of you know how noisy mastodon can be. Server load went down dramatically.

It also completely put a stop to perplexity as far as I can tell.

And the robots file meant nothing, they’d still request it hundreds of thousands of times instead of caching it. Every request they’d hit it first then hit their intended url.

TechDebtDevin•7mo ago

This does nothing dude. Literally nothing. OpenAI or whoever are just going to hire people like me who dont get caught. Stop ruining the experience of users and allowing cf to fill the internet with more bloated javascript challenge pages and privacy invading fingerprinting. Stop making cf the police of the internet. We're literally handing the internet to this company on a silver platter to do MITM attacks on our privacy and god knows what else. Fucking wild.

fluidcruft•7mo ago

They literally said it significantly reduced their server resource usage. Are you suggesting they are lying?

drowsspa•7mo ago

Why do you think you have the moral high ground here?

dawnerd•7mo ago

Well the alternative is to not have an instance at all so… what do you suggest? I’m not paying for the other services, it’s already expensive enough to run the site.

The goal isn’t to stop 100% of scrapers, it was to reduce server load to a level that wasn’t killing the site.

jowea•7mo ago

You want them to pay the server costs to serve content to AI scrappers for free? The alternative is Anubis, which is maybe equally annoying to users in a different way.

danielspace23•7mo ago

Have you considered Anubis? I know it's harder to install, but personally, I think the point of Mastodon is trying to avoid centralization where possible, and CloudFlare is one of the corporations that are keeping the internet centralized.

dawnerd•7mo ago

Haven’t heard of it, will look into it. I agree, I’d rather not have cloudflare but for what they provide for free it’s a tough offer to pass up

account42•7mo ago

Yay, looking forward to more CAPTCHAs as a regular user.

thephotonsphere•7mo ago

account wall :-(

deadbabe•7mo ago

No one else can really do this except Cloudflare.

dirkc•7mo ago

I assume they will "protect original content online" by blocking LLM clients from ingesting data as context?

I'm not optimistic that you can effectively block your original content from ending up in training sets by simply blocking the bots. For now I just assume that anything I put online will end up being used to train some LLM

NullCascade•7mo ago

How would you do the opposite of this? Optimize your content to be more likely crawled by AI bots? I know traditional Google-focused SEO is not enough because these AI bots often use other web search/indexing APIs.

TechDebtDevin•7mo ago

There are script tags you can put in your site from LLM SEO companies if you want your content to be indexed by Perplexity or OpenAI. Theyre kind of too new for me to reccomend.

zargath•7mo ago

Sounds very basic, sadly.

Anybody know why these web crawling/bot standards are not evolving ? I believe robots.txt was invented in 1994(thx chatgpt). People have tried with sitemaps, RSS and IndexNow, but its like huge$$ organizations are depending on HelloWorld.bas tech to control their entire platform.

I want to spin up endpoints/mcp/etc. and let intelligent bots communicate with my services. Let them ask for access, ask for content, pay for content, etc. I want to offer solutions for bots to consume my content, instead of having to choose between full or no access.

I am all for AI, but please try to do better. Right now the internet is about to be eaten up by stupid bot farms and served into chat screens. They dont want to refer back to their source and when they do its with insane error rates.

TechDebtDevin•7mo ago

This comment seems like it comes from a Cloudflare employee.

This is clearly the first step in cf building out a marketplace where they will (fail) at attempting to be the middleman in a useless market between crawlers and publishers.

zargath•7mo ago

nah, disappointed cf customer

stereolambda•7mo ago

> I believe robots.txt was invented in 1994(thx chatgpt).

Not to pick on you, but I find it quicker to open new tab and do "!w robots.txt" (for search engines supporting the bang notation) or "wiki robots.txt"<click> (for Google I guess). The answer is right there, no need to explain to LLM what I want or verify [1].

[1] Ok, Wikipedia can be wrong, but at least it is a commonly accessible source of wrong I can point people to if they call me out. Plus my predictive model of Wikipedia wrongness gives me pretty low likelihood for something like this, while for ChatGPT it is more random.

reaperducer•7mo ago

robots.txt was invented in 1994(thx chatgpt)

Thought of and discussed as a possibility in 1994.

Proposed as a standard in 2019.

Adopted as a standard in 2022.

Thanks, IETF.

Dylan16807•7mo ago

This phrasing is very misleading. To bullet point directly from "possibility" to "standard" implies the standardization was a turning point where it could start being used. But it was massively used long before that. The standard is a side note that's barely relevant.

reaperducer•7mo ago

It only became massively used in 2019, when Google recommended it.

Dylan16807•7mo ago

Where did you get that date?

https://serverfault.com/questions/171985/how-can-i-encourage...

Here's a 2010 discussion about Google's explicit support, and I'm sure I could find earlier.

The thing google did in 2019 was submit it as a standard, nothing to do with adoption or starting to recommend. In that very post they said "For 25 years, the Robots Exclusion Protocol (REP) has been one of the most basic and critical components of the web" "The proposed REP draft reflects over 20 years of real world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP."

reaperducer•7mo ago

Where did you get that date?

On July 1, 2019, Google announced the proposal of the Robots Exclusion Protocol as an official standard under Internet Engineering Task Force.[8]

https://en.wikipedia.org/wiki/Robots.txt

Dylan16807•7mo ago

That is not when they started recommending it. It would be nice if you acknowledged the rest of my comment, I even quoted from the [8] reference.

StochasticLi•7mo ago

ehem https://github.com/Kaliiiiiiiiii-Vinyzu/patchright

j45•7mo ago

This is interesting. I'm a fan of Cloudflare, and appreciate all the free tiers they put out there for many.

Today I see this article about Cloudflare blocking scrapers. There are useful and legitimate cases where I ask Claude to go research something for me. I'm not sure if Cloudflare discerns legitimate search/research traffic from an AI client vs scraping. Of the sites that are blocked by default will include content by small creators (unless on major platforms with deal?), while the big guys who have something to sell like an Amazon, etc, will likely be able to facilitate and afford a deal to show up more in the results.

A few days ago, Cloudflare is also looking to charge AI companies to scrape the content, which is cached copies of other people's content. I'm guessing it will involve paying the owners of the data at some point as well. Being able to exclude it from this purpose (sell/license content, or scrape) would be a useful lever.

Putting those two stories together:

- Is this a new form of showing up in the AISEO (Search everywhere optimization) to show up in an AI's corpus or ability to search the web, or paying licensing fees instead of advertising fees.. these could be new business models which are interesting, but trying to see where these steps may vector ahead towards, and what to think about today.

- With training data being the most valuable thing for AI companies, and this is another avenue for revenue for Cloudflare, this can look like a solution which helps with content licensing as a service.

I'd like to see where abstracting this out further ends up going

Maybe I'm missing something, is anyone else seeing it this way, or another way that's illuminating to them? Is anyone thinking about rolling their own service for whatever parts of Cloudflare they're using?

ec109685•7mo ago

It seems like search access is more valuable these days since reasoning requires realtime access to site data.

j45•7mo ago

Owning demand for X is what a lot of startups are about.

I think Sparktoro had some information that the majority of people are not doing search on AI, and it's still web 90%+ or something like that.

ssijak•7mo ago

I dont want this by default. I want my website to end up in AI chatbots. For SEO

abalashov•7mo ago

Few people realise that virtually everything we do online has, until this point, been free training to make OpenAI, Anthropic, etc. richer while cutting humans--the ones who produced the value--out of the loop.

It might be too little, too late, at this juncture, and this particular solution doesn't seem too innovative. However, it is directionally 100% correct, and let's hope for massively more innovation in defending against AI parasitism.

k__•7mo ago

Is anyone suing to make the models and their weights open source?

jefftk•7mo ago

I write online (comments here, open source software, blogging, etc) because I have ideas I want to share. Whether it's "I did a thing and here's how" or "we should change policy in this specific way" or "does anyone know how to X" I'm happy for this to go into training models just like I'm happy for it to go into humans reading.

dolebirchwood•7mo ago

Thank you for having this attitude. I have never attempted any blogging because I always figured no one is actually going to read it. With LLMs, however, I know they will. I actually see this as a motivation to blog, as we are in a position to shape this emerging knowledge base. I don't find it discouraging that others may be profiting off our freely published work, just as I myself have benefited tremendously from open source and the freely published works of others.

arkmm•7mo ago

This is an interesting take, thanks for sharing. I wonder how someone should adjust their blogging if they believe their primary audience will be LLMs.

lawlessone•7mo ago

SEO -> LLMEO

trollbridge•7mo ago

There’s a few instances of things I stated (about historical topics or very narrow topics in sociology) that were incorrect. LLMs scraped these off of web forums or other places, and now these bogus “facts” are permanently embedded into LLM models, because nobody else ever really talked about the specific topic.

Most amusingly, someone cited LLM generated output about this telling me how this “fact” is true when I was telling them it’s not true.

godelski•7mo ago

Tbh, that content I'm mostly fine with. My only real issue is that people are making trillions off the free labor of people like you and me, giving less time to create that OSS and blogs. But this isn't new to AI, it is just scaled.

What I do care about is the theft of my identity. A person may learn from the words I write but that person doesn't end up mimicking the way I write. They are still uniquely themselves.

I'm concerned that the more I write the more my text becomes my identifier. I use a handle so I can talk more openly about some issues.

We write OSS and blog because information should be free. But that information is then being locked behinds paywalls and becoming more difficult to be found through search. Frankly, that's not okay

bob1029•7mo ago

> OSS

> people are making trillions off the free labor of people like you and me

I read "No Discrimination Against Fields of Endeavor" to also include LLMs and especially the cases that we most deeply disagree with.

Either we believe in the principles of OSS or we do not. If you do not like the idea of your intellectual property being used for commercial purposes then this model is definitely not for you.

There is no shame in keeping your source code and other IP a secret. If you have strong expectations of being compensated for your work, then perhaps a different licensing and distribution model is what you are after.

> that information is then being locked behinds paywalls and becoming more difficult to be found through search

Sure - If you give up and delete everything. No one is forcing you to put your blog and GH repos behind a paywall.

mattl•7mo ago

Open source software typically has a license. People not following the license isn’t tolerated.

This is what AI scrapers are doing. They’re taking your code, your artwork and your writing without any consideration for the license.

jefftk•7mo ago

Weather training on code is fair use is still an open legal question, and it may well be fair use. The way a license works is by saying "you have my permission to use this code as long as you follow these conditions", but if no license is required than the conditions are irrelevant.

There is an active case on this, where Microsoft has been sued over GitHub copilot, and it has been slowly moving through the court system since 2022. Most of the claims have been dismissed, and the prediction market is at 11%: https://manifold.markets/JeffKaufman/will-the-github-copilot...

mattl•7mo ago

I can't see how it can be fair use. Just follow the license, it's not that difficult. Microsoft will forever be a pariah if they get away with this.

I'm putting my new code somewhere private anyway.

jefftk•7mo ago

> I can't see how it can be fair use.

The key question is whether it is sufficiently "transformative". See Authors Guild vs Google, Kelly vs Arriba Soft, and Sony vs Universal. This is a way a judge could definitely rule, and at this point I think is the most likely outcome.

> Microsoft will forever be a pariah if they get away with this.

I doubt this. Talking to developers, it seems like the majority are pretty excited about coding assistants. Including the ones that many companies other than Microsoft (especially Anthropic) are putting out.

mattl•7mo ago

Yeah, this is just sad to hear.

godelski•7mo ago

> The way a license works is

Let's actually look at the MIT license, a very permissive license

  > Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to ***use***, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

  > The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

So, you can use it but need to cite the usage. It's not that hard. Fair use if you just acknowledge usage.

Is it really that difficult to acknowledge that you didn't do everything on your own? People aren't asking for money. It's just basic acknowledgement.

Forget the courts for a second, just ask yourself what is the right thing to do. Ethically.

jefftk•7mo ago

> Forget the courts for a second, just ask yourself what is the right thing to do

Forgetting the courts, whether reading the source code and learning from it is intended to count as "use" is not clear to me, and I would have guessed no. Using a tool and examining a tool are pretty different.

godelski•7mo ago

Context matters, right?

Human reading code? Ambiguous. But I think you're using it. Running code? Not ambiguous.

Machine processing code? I don't think that's ambiguous. It's using the code. A person is using the code to make their machine better.

This really isn't that hard.

Let's think about it this way. How do you use a book?

I think you need to be careful that you're not justifying the answer you want and instead are looking for what the right answer is. I'm saying this because you quoted me saying "what is right" and you just didn't address it. To quote Feynman (<- look, I cited my work. I fulfilled the MIT license obligations!)

  > The first principle is that you must not fool yourself, and you are the easiest person to fool.

jefftk•7mo ago

> How do you use a book?

I think that's a great example, actually! Imagine a book with a license saying that you could only read it if your goal was to promote the specific ideology, and quoting was only allowed to support that ideology. Reading it and then quoting it to debunk that ideology would be both legal and ethical.

With books, once you buy the copy you are free to read it, lend it, or resell it: a license can give you additional rights, but not restrict you further, nor should it.

> To quote Feynman (<- look, I cited my work. I fulfilled the MIT license obligations!)

Huh? Feynman didn't say that under MIT license.

blibble•7mo ago

> Either we believe in the principles of OSS or we do not. If you do not like the idea of your intellectual property being used for commercial purposes then this model is definitely not for you.

I've been writing open source for more than 20 years

I gave away my work for free with one condition: leave my name on it (MIT license)

the AI parasites then strip the attribution out

they are the ones violating the principles of open source

> then perhaps a different licensing and distribution model is what you are after.

I've now stopped producing open source entirely

and I suggest every developer does the same until the legal position is clarified (in our favour)

jefftk•7mo ago

> I suggest every developer does the same until the legal position is clarified (in our favour)

There are a lot of people developing open source software with a wide range of goals. In my case, I'm totally happy for LLMs to learn from my coding, just like they've learned from millions of other peoples. I wouldn't want them to duplicate it verbatim, but (due to copyright filters + that not usually being the best way to solve a problem) they don't.

godelski•7mo ago

  > Either we believe in the principles of OSS or we do not.

What about respecting licenses?

Seriously, don't lick the boot. We can recognize that there's complexity here. Trivializing everything only helps the abusers.

Giving credit where credit is due is not too much to ask. Other people making money off my work can be good[0]. Taking credit for it is insulting

[0] If you're not making much, who cares. But if you're a trillion dollar business you can afford to give a little back. Here's the truth, OSS only works if we get enough money and time to do the work. That's either by having a good work life balance and good pay or enough donations coming in. We've been mostly supported by the former, but that deal seems to be going away

dotnet00•7mo ago

I think this may be too much of a "literal" interpretation of OSS without really considering the social contract many OSS supporters believe in, wherein users of OSS will act in good faith and might eventually reciprocate for the benefits they're getting, e.g. the way companies have slowly accepted paying their own employees to contribute to projects openly, releasing their own open source code, respecting the spirit of OSS licenses, sponsoring the developers of the thing they use, etc.

I think it's entirely fair that even staunch supporters of OSS get turned off when AI companies scrape their work to ingest into a black box regurgitator and then turn around and tell the world how their AI will make trillions of dollars by taking away the jobs of those obsolete OSS developers, showing no intention of ever giving back to the community.

lxgr•7mo ago

> What I do care about is the theft of my identity. A person may learn from the words I write but that person doesn't end up mimicking the way I write. They are still uniquely themselves.

Of course they do, to some extent. Just because it's been infeasible to track the exact "graph of influence", that's literally how humans have learned to speak and write for as long as we've had language and writing.

> I'm concerned that the more I write the more my text becomes my identifier. I use a handle so I can talk more openly about some issues.

That's a much more serious concern, in my view. But I believe that LLMs are both the problem and solution here: "Remove style entropy" is just a prompt away, these days.

BeetleB•7mo ago

> A person may learn from the words I write but that person doesn't end up mimicking the way I write.

Oh, I wish I could get AI to mimic the way I write! I'd pay money for it. I often want to type up an email/doc/whatever but don't because of occasional RSI issues. If I could get an AI to type it up for me while still sounding like me - that would be a big boon for my health.

jefftk•7mo ago

> but don't because of occasional RSI issues

I also have this issues that often keep me from typing, but FYI dictation has gotten very good.

(Dictated this)

BeetleB•7mo ago

Oh yeah, I use dictation and then clean it up with GPT. It's awesome. But I speak very differently from how I write. So I'd like to dictate it, and then have it rewrite it in my writing style.

andy99•7mo ago

It's cloudflare and parasites like them that will make the internet un-free. It's already happening, I'm either blocked or back to 1998 load times be cause of "checking your browser". They are destroying the internet and will make it so only people who do approved things on approved browsers (meaning let advertising companies monetize their online activity) will get real access.

Cloudflare isn't solving a problem, they are just inserting themselves as an intermediary to extract a profit, and making everything worse.

lillecarl•7mo ago

I use Firefox with adblocking and some fingerprinting anti-measurements and I rarely hit their challenges. Your IP reputation must be bad.

They have an addon [1] that helps you bypass Cloudflare challenges anonymously somehow, but it feels wrong to install a plugin to your browser from the ones who make your web experience worse

1: https://developers.cloudflare.com/waf/tools/privacy-pass/

godelski•7mo ago

I'm in a pretty similar boat except I frequently hit challenges. Especially if I use a VPN (which is more trustworthy than my ISP). Ironically, I'm using Cloudflare for DoH

lxgr•7mo ago

I'd be surprised if Cloudflare were actually correlating DoH requests to HTTP requests following them, so I don't think that's a signal they are likely to use.

godelski•7mo ago

Probably not. In fact, it's probably a good sign that they are being accurate about that traffic being encrypted.

But I did find it ironic

chrismorgan•7mo ago

> Your IP reputation must be bad.

And for an extremely large number of honest users, they cannot realistically avoid this.

I live in India. Mobile data and fibre are all through tainted CGNAT, and I encounter Cloudflare challenges all the time. The two fibre providers I know about use CGNAT, and I expect others do too. I did (with difficulty!) ask my ISP about getting a static IP address (having in mind maybe ditching my small VPS in favour of hosting from home), but they said ₹500/month, which is way above market rate for leasing IPv4 addresses, more than I pay for my entire VPS in fact, so it definitely doesn’t make things cheaper. And I’m sceptical that it’d have good reputation with Cloudflare even then. It’ll probably still be in a blacklisted range.

bombela•7mo ago

Why don't your ISPs just use IPv6?

henrixd•7mo ago

I'm having lots of problems with fingerprinting protection on Librewolf and ungoogled-chromium. I use uBlock Origin and JShelter extensions on both. I'm always getting "your browser is out of date" despite always having the most newest versions.

Some sites like Stackexchange will work after just reloading the page. And rest of the sites usually work when I remove Javascript protection and Fingerprint detection from JShelter. Sill not all of them. So, they maybe/probably want to reliably fingerprint my browser to let me continue.

If I use crappy fingerprint protection, I'm not having problems but if I actually randomize some values then sites wont work. JShelter deterministicly randomizes some values using session identifier and eTLD+1 domain as a key to avoid breaking site functionality but apparently Cloudflare is beeing really picky. Tor browser is not having these problems but it uses different strategy to protect itself from fingerprinting and doesn't randomize values but tries to have unified values across different users making identification impossible.

MichaelZuo•7mo ago

If your on ipv6, I think they have to for ipv6 addresses… there’s just way too many bots and way too many addresses to feasibly do anything more precise.

If your on ipv4 you should check whether your behind a NAT otherwise you may have gotten an address that was previously used by a bot network.

lxgr•7mo ago

> I think they have to for ipv6 addresses… there’s just way too many bots and way too many addresses

Are you really arguing that it's legitimate to consider all IPv6 browsing traffic "suspicious"?

If anything, I'd say that IPv4 is probably harder, given that NATs can hide hundreds or thousands of users behind a single IPv4 address, some of which might be malicious.

> you may have gotten an address that was previously used by a bot network.

Great, another "credit score" to worry about...

MichaelZuo•7mo ago

For a whitelist system, then by definition yes?

If it’s a blacklist system, like I said I’ve not heard of any feasible solution more precise than banning huge ranges of ipv6 addresses.

Dylan16807•7mo ago

> For a whitelist system, then by definition yes?

A whitelist system would consider all IPv4 traffic suspicious by default too. This is not an answer to why you'd be suspicious of IPv6 in particular.

> I’ve not heard of any feasible solution more precise than banning huge ranges of ipv6 addresses.

Handling /56s or something like that is about the same as handling individual IPv4 addresses.

MichaelZuo•7mo ago

> A whitelist system would consider all IPv4 traffic suspicious by default too.

Based on what argument…?

Dylan16807•7mo ago

The definition of whitelisting. The argument you brought up.

MichaelZuo•7mo ago

No…? Someone can clearly implement a whitelist system that applies only to ipv6… but that makes no judgement on ipv4.

Dylan16807•7mo ago

Let's back up a step. You said by definition a whitelist system would consider every IPv6 suspicious (until it's put on the list, presumably). What is that definition?

If "applies only to IPv6" is an optional decision someone could make, then it's not part of the definition of a whitelist system for IPs, right?

MichaelZuo•7mo ago

What are you talking about?

The prior comment was responding directly to your comment, not any comment preceding that.

Of course it’s no longer by definition if you expand the scope beyond an ipv6 whitelist as there are an infinite number of possible whitelists.

Dylan16807•7mo ago

> What are you talking about?

The first comment with the word "whitelist". Before I entered the conversation. This comment: https://news.ycombinator.com/item?id=44449821

lxgr was challenging the idea that you would treat all IPv6 traffic as suspicious.

You justified it by saying that "by definition" "a whitelist system" would do that.

I want your definition of "a whitelist system". Not one of the infinite possible definitions, the one you were using right then while you wrote that comment.

> if you expand the scope beyond an ipv6 whitelist

Your comment before that was talking about IP filtering in general, both v4 and v6!

And then lxgr's comment was about both v4 and v6.

So when you said "a whitelist system" I assumed you were talking about IP whitelists in general.

If you weren't, if you jumped specifically to "IPv6 whitelist", you didn't answer the question they were asking. What is the justification to treat all IPv6 as suspicious? Why are we using the definition of 'IPv6 whitelist' in the first place?

MichaelZuo•7mo ago

None of this even makes sense.

Why does your opinion on how a comment should be interpreted, matter more than anyone else’s opinion in the first place?

Dylan16807•7mo ago

I didn't say that. Huh?

I'm inviting you to tell me how to interpret it. In fact I'm nearly begging you to explain your original comment more. I'm not telling anyone how to interpret it.

I have criticisms for what was said, but that comes after (attempted) interpretation and builds on top of it. I'm not telling anyone how to interpret any post I didn't make.

Edit: In particular, my previous comment has "I assumed" to explain my previous posts, an it has an "If" about what you meant. Neither one of those is telling anyone how to interpret you.

MichaelZuo•7mo ago

This is even closer to gibberish… what are you even trying to say?

Dylan16807•7mo ago

You don't understand a word I'm saying, and you have missed/declined every single time I asked you to explain the first comment I responded to.

Let's just mutually give up on this conversation.

MichaelZuo•7mo ago

Okay then.

trollbridge•7mo ago

I try to build things to be INET6 ready, and just repeat /64s like a single host. Eventually this will probably have to broadened to /56s or /48s.

slenk•7mo ago

How is Cloudflare a parasite? I can use Cloudflare, and get their AI protection, for free. I have dozens of domains I have used with Cloudflare at one point and I haven't paid them a dime.

fsflover•7mo ago

They put themselves as a middle man for almost the whole Internet, collect huge usage data about everyone and block anybody who doesn't use mainstream tools:

https://news.ycombinator.com/item?id=42953508

https://news.ycombinator.com/item?id=13718752

https://news.ycombinator.com/item?id=23897705

https://news.ycombinator.com/item?id=41864632

https://news.ycombinator.com/item?id=42577076

baq•7mo ago

valid.

...but OTOH it's their customers who want all of that and pay to get that, because the alternative is worse.

rock and a hard place.

slenk•7mo ago

Right - do I want them getting some info from me, or do I want my IP address exposed?

Besides CloudFront, which still costs money, what other option is there for semi-privacy and caching for free?

mattl•7mo ago

bunny.net has some options

slenk•7mo ago

I will have to check them out I guess

aorth•7mo ago

As the old addage goes: If you're not paying for it, you're the product.

Lots of nuance, but generally: pay for things you use. Servers, engineers, and research and development are not free, so someone has to pay.

qualeed•7mo ago

Lots of services don't even let me pay if I wanted to, so I am forced to be the product. (Donating typically does not un-productify myself).

Or I pay and am still the product. Just with less in-my-face ads.

aorth•7mo ago

> Or I pay and am still the product. Just with less in-my-face ads.

Yes, this is enshittification. You pay for Amazon something or other, and they STILL show you ads. Horrible.

slenk•7mo ago

fwiw, I have been convinced to look at other options

matt-p•7mo ago

Cloud front is pretty much free for your first TB. Fastly has a free plan.

Though why should it be for free?

slenk•7mo ago

Multiple people have brought that up. I pay for everything else, why not one more.

Although bunny.net won't take ANY of my credit or debit cards

MisterTea•7mo ago

I want to know if there is a way to design an alternative that isn't controlled by a single entity which allows gatekeeping.

AnthonyMouse•7mo ago

You can add another one as a result of this article: The data you need to train AI and the data you need to build a search engine are the same data. So now they're inhibiting every new search engine that wants to compete with Google.

1oooqooq•7mo ago

they always had. this post is about turning the false positives "up to 11" with impunity

1oooqooq•7mo ago

https://infosec.exchange/@k3ym0/114762301792775770

fsflover•7mo ago

https://en.wikipedia.org/wiki/Cloudflare#Controversies

lxgr•7mo ago

> I have dozens of domains I have used with Cloudflare at one point and I haven't paid them a dime.

Maybe you haven't, but your users (primarily those using "suspicious" operating systems and browsers) certainly have – with their time spent solving captchas.

sealeck•7mo ago

But Cloudflare have removed CAPTCHAs

lxgr•7mo ago

Not sure if you're joking, but if you're not: Congratulations on using a very "normal/safe" OS/browser/IP.

I get captchas daily, without using any VPN and on several different IPs (work, home, mobile). The only crime I can think of is that I'm using Firefox instead of Chrome.

Symbiote•7mo ago

Since a few days ago, I've been getting Captchas hourly or more.

It's probably because I use Firefox on Linux with an ad blocker.

For my part, I've ensured we don't use Cloudflare at work.

kelvinjps10•7mo ago

I use firefox on linux with an ad blocker and cloudfare works fine

DanOpcode•7mo ago

It must depend on something else. Firefox & Linux have always worked fine for me, I cannot remember when I last got restricted by a Cloudflare captcha.

dotnet00•7mo ago

Using Linux is rare among the general public, but very normal among the kind of person who may find themselves working at Cloudflare or at a potential cloudflare partner/customer.

I don't really buy the argument that they're pushing more captchas to you just because of using Firefox on Linux with an ad blocker.

fsflover•7mo ago

https://news.ycombinator.com/item?id=42577076

sealeck•7mo ago

https://blog.cloudflare.com/end-cloudflare-captcha/

Sebguer•7mo ago

Managed challenges are just CAPTCHA by another name.

lxgr•7mo ago

It's not much consolation to me if I'm one of the 25% still being challenged.

The world really has more than enough heuristic fraud detection systems that most people aren't even aware exist, but make life miserable for those that somehow don't fit the typical user profile. In fact, the lower the false positive rate gets, the more painful these systems usually become for each false positive.

I'm so tired of it. Sure, use ML (or "AI") to classify good and evil users for initial triaging all day long, but make sure you have a non-painful fallback.

sitzkrieg•7mo ago

my residental ip of years (which is not shared or cgnat) was recently flagged by cloudflare for who knows why. if you are asking, you havent seen when cloudflare thinks you are something else.

cloudflare are not the good guys because they give people free cdn and ddos protection lol

SchemaLoad•7mo ago

I use a VPN and firefox and I get some extra captchas but not enough to be annoying. And you don't have to do anything more than tap the checkbox.

Meanwhile a bunch of "security" products other websites use just flat out block you if you're on a VPN. Other sites like youtube or reddit are in between where they block you unless you are logged in.

Cloudflare is the least obtrusive of the options.

lxgr•7mo ago

No, the least obtrusive option is the one you don't even notice because it actually works (or offers a non-painful secondary flow when it doesn't).

const_cast•7mo ago

Really? Because I'm on Debian, with Firefox, with a VPN active 24/7 and I almost never get Captchas. I do get those "checking your browser" pages often but they just stick around for maybe half a second then redirect.

1oooqooq•7mo ago

you forgot /s

(the people not getting the joke, yes the new system don't make you train any image recognition dataset, but they profile the hell out anything they can get their hand on just like google captcha and then if you smell like a bot you're denied access. goodbye)

bombcar•7mo ago

Download Brave.

Turn on Tor and browse for a week.

Now you know what “undesirables” feel like, where “undesirables” can be from a poor country, a bad IP block, outdated browsers, etc.

It sucks.

slenk•7mo ago

I already said in another post I am looking at Bunny, but they also don't seem to want to take my money. I've tried 3 cards. I am willing to pay for a good service, but I will be honest, I don't know many of cloudflare's competitors

SchemaLoad•7mo ago

It's kind of an impossible problem though. They either save some tracking cookie to link your sessions between websites, or they have to re captcha check you on every website.

v5v3•7mo ago

Why download brave and the use Tor.

Just use the Tor browser

bombcar•7mo ago

Some large percentage of people fail when directed to the Tor browser; I don't know why.

fsflover•7mo ago

This is not a good reason to suggest Brave instead of Tor Browser.

chaoskitty•7mo ago

Serious question: You put Cloudflare between all your domains and all your visitors without looking in to how this would affect your site's reachability? If so, that's interesting, considering that many people in this community are negatively affected by Cloudflare because they're using Linux and/or some less than mainstream browser.

You might want to read some threads on here about Cloudflare.

slenk•7mo ago

Where did I say all.

Most of the time I don't use them for their network, usually just DNS records for mail because their interface is nicer than namecheap and gives me basic stats.

To my understanding, they aren't blocking MX records behind captchas

djfivyvusn•7mo ago

So you're not using the parasite and that's your claim why it's not a parasite?

slenk•7mo ago

Dude, stop putting words in my mouth. I never said they weren't bad.

Some nicer people here tried the educative approach and it worked much better. I learned about Bunny. And I keep forgetting I have a few in deSec but that has a limit.

I do not understand the hostility

lurking_swe•7mo ago

> I do not understand the hostility

Unfortunately I don’t think they were participating in the conversation in good faith. People can have an extreme view on _anything_…even internet / tech. They buy into a dream of 100% open source, or “open internet”, or 100% decentralized, whatever.

When this happens they may be convinced that “others” are crazy for not sharing their utopian vision. And once this point is reached, they struggle to communicate with their peers or normal people effectively. They share their strong opinions without sharing important context (how they reached those opinions), they think the topic is black and white (because they feel so strongly about the topic), or they become hostile to others that are not sharing that vision.

You are their latest victim lol. Ignore them, and carry on.

ranger_danger•7mo ago

One of my favorite quotes: "As a rule, strong feelings about issues do not emerge from deep understanding." -Sloman and Fernbach

Learning how to spot this, and ignore such-minded people who argue in bad faith, has made me a lot happier and more chill in general.

amy_petrik•7mo ago

>How is Cloudflare a parasite?

>I never said they weren't bad.

>I don't understand the hostility.

It's known the community here doesn't like Cloudflare, and anyone who's been on the customer end of Cloudflare would tend to agree. In that context, if you truly are blind to seeing this, when you said, "how is Cloudflare a parasite" to a group not liking of cloudflare... ... it may land as saying something like "How is Hitler a bad guy?", which I hope is self-evident is saying he's a good guy contextually, of course you could troll it out and devil's advocate yourself that you were merely asking an innocent question.

slenk•7mo ago

I thought Cloudflare overall was neutral - meaning as many haters as lovers. I know the CEO frequents here as well.

When I ask how is Cloudflare a "parasite" I was being genuine. I know it was a problem for some users, but I don't think I realized how prevalent it was

djfivyvusn•7mo ago

I never said you did?

You said one response up that they weren't parasites by asking how they were parasites and then proceeded to claim you have no experience with their parasitic services.

I'm just pointing out your anecdote wasn't valid.

chaoskitty•7mo ago

You're right that you didn't say all. What you did write implied you use them for "AI protection", although you didn't say you did do that.

So if I wrote, "You would put" instead of "You put", then what? Would you be comfortable using their "AI protection" simply because it's free?

slenk•7mo ago

AI protection isn't a selling point for me. What I have said is I use them for DNS records, primarily for mx and txt records

shlomo_z•7mo ago

Did you read his comment? He explained the issue he has with Cloudflare...

hnanon12341•7mo ago

Yeah but they are a dictator, OpenAI et al are the parasites.

brendyn•7mo ago

A parasite leaches off it's host to the hosts harm. Maybe it's not a good analogy, but Im in china, and it's painful after paying money for a VPN to bypass censorship to find myself routinely blocked by CDNs because they decided I'm not human. I'm honestly feeling more opressed by these middlemen than the government sometimes. For example, maybe I can't log in to a game due to being blocked by the login API, and the game company just responds by telling me to run an antivirus scanner and try again since they are not personally developing that system that lack awareness. Such people with genuine need for VPNs and privacy tools are the sacrifice for this system.

rockskon•7mo ago

LLM scrapers have dramatically been increasing the cost of hosting various small websites.

Without something being done, the data that these scrapers rely on would eventually no longer exist.

benjiro•7mo ago

I think the correct term is, that unrestricted LLM scrapers have dramatically been increasing the cost of hosting various small websites.

Its not a issue when somebody does "ethical" scraping, with for instance, a 250ms delay between requests, and a active cache that checks specific pages (like news article links) to rescrape at 12 or 24h intervals. This type of scraping results in almost no pressure on the websites.

The issue that i have seen, is that the more unscrupulous parties, just let their scrapers go wild, constantly rescraping again and again because the cost of scraping is extreme low. A small VM can easily push 1000's of scraps per second, let alone somebody with more dedicated resources.

Actually building a "ethical" scraper involves more time, as you need to fine tune it per website. Unfortunately, this behavior is going to cost the more ethical scraper a ton, as anti-scraping efforts will increase the cost on our side.

Tmpod•7mo ago

The biggest issue for me is clearly masquerading their User-Agent strings. Regardless of whether they are slow and respectful crawlers, they should clearly identify themselves, provide a documentation URL and obey robots.txt. Without that, I have to play a frankly tiring game of cat and mouse, wasting my time and the time of my users (they have to put up with some form of captcha or PoW thing).

I've been an active lurker in the self-hosting community and I'm definitely not alone. Nearly everyone hosting public facing websites, particularly those whose form is rather juicy for LLMs, have been facing these issues. It costs more time and money to deal with this, when applying a simple User-Agent block would be much cheaper and trivial to do and maintain.

sigh

trollbridge•7mo ago

I use Cloudflare and edge caching, so it doesn’t really affect me, but the amount of LLM scraping of various static assets for apps I host is ridiculous.

We’re talking a JavaScript file of strings to respond like “login failed”, “reset your password” just over and over again. Hundreds of fetches a day, often from what appears to be the same system.

SchemaLoad•7mo ago

Turn on the the Cloudflare tarpit. When it detects LLM scrapers it starts generating infinite AI slop pages to feed the scrapers. Ruining their dataset and keeping them off your actual site.

dceddia•7mo ago

Yep this terrifies me, 100%. We’re slowly losing the open internet and the frog is being boiled slowly enough that people are very happy to defend the rising temperature.

If DDoS wasn’t a scary enough boogeyman to get people to install Cloudflare as a man-in-the-middle on all their website traffic, maybe the threat of AI scrapers will do the trick?

The thing about this slow slide is it’s always defensible. Someone can always say “but I don’t want my site to be scraped, and this service is free, or even better yet, I can set up my own toll booth and collect money! They’re wonderful!”

Trouble is, one day, at this rate, almost all internet traffic will be going through that same gate. And once they have literally everyone (and all their traffic)… well, internet access is an immense amount of power to wield and I can’t see a world in which it remains untainted by commercial and government interests forever.

And “forever” is what’s at stake, because it’ll be near impossible to recover from once 99% of the population is happy to use one of the 3 approved browsers on the 2 approved devices (latest version only). Feels like we’re already accepting that future at an increasing rate.

RiverCrochet•7mo ago

The Internet is not the first global network. Before the Internet, you had the global telephone network. It, too, strangulated end users, but eventually became stagnant, overpriced, and irrelevant. Super long-term, the current Internet is not immune from this. Internet standards are about getting as complicated and quirky as the old Bell stuff that was trying to make miles of buried copper the future, and if regulatory/commercial forces freeze this stuff in place, it's going to lead to stagnation eventually.

Something coming down the pike I think, for example, is that IPv4 addresses are going to get realllly expensive soon. That's going to lead to all sorts of interesting things in the Internet landscape and their applications.

I'm sure we'll probably have to spend some decades in the "approved devices and browers only" world before a next wave comes.

mattl•7mo ago

We need a reasonable alternative to some of what Cloudflare does that can be easily installed as a package on Linux distributions without any of the following to install it.

* curl | bash

* Docker

* Anything that smacks of cryptocurrency or other scams

Just a standard repo for Debian and RHEL derived distros. Fully open source so everyone can use it. (apt/dnf install no-bad-actors)

Until that exists, using Cloudflare is inevitable.

It needs to be able to at least:

* provide some basic security (something to check for sql injection, etc)

* rate limiting

* User agent blocking

* IP address and ASN blocking

Make it easy to set up with sensible defaults and a way to subscribe to blocklists.

saint_yossarian•7mo ago

I remember using mod_security with Apache long ago for some of this, looks like it's still around and now also supports Nginx and IIS: https://modsecurity.org/

mattl•7mo ago

Thank you. This doesn't have everything I'm looking for, but apparently it has been packaged in Debian at least. I don't know why the website doesn't mention this.

1oooqooq•7mo ago

it's called not having a vibecoded app that falls to pieces on public endpoints even before ngix ratelimit can kick in

mattl•7mo ago

Nobody is talking about a vibe coded app. I want to block AI scrapers entirely.

1oooqooq•7mo ago

point is, why do you care if your site can handle the traffic?

there's no (malicious) bot detection that won't impact a portion of real users. accept that fact and just let it be.

poisoning data in ways that's obvious to the false positive user is a much better option.

mattl•7mo ago

I really doubt any legit user is using a weird user agent and an IP address in the same AS as an AI slop crawler

1oooqooq•7mo ago

You'd be surprised. Your users too, but you wouldn't know because they will not be able to tell you.

xena•7mo ago

I make this: https://anubis.techaro.lol. I have yet to add the SQL injection or IP list layers, but I can add that to the roadmap.

mattl•7mo ago

The proof of work stuff feels so cryptocurrency adjacent that I've been looking at other tools for my own thing, but I've seen Anubis on other websites and it seems to do a good job.

xena•7mo ago

There's a non proof of work challenge: https://anubis.techaro.lol/docs/admin/configuration/challeng...

Also: Anubis does not mine cryptocurrency. Proof of work is easy to validate on the server and economically scales poorly in the wild for abusive scrapers.

mattl•7mo ago

Thanks for the link. I’ll have a look.

I’m glad there’s no cryptocurrency involved (was never a concern) but I worry about the optics of something so closely associated.

(I appreciate your commenting on this. I know the project recently blew up in popularity. Keep up the great work)

xena•7mo ago

If you have suggestions for JS based challenges that don't become a case of "read the source code to figure out how to make playwright lie", I'm all ears for the ideas :)

fsflover•7mo ago

This unsubstantiated anti-cryptocurrency bias on HN is quite disappointing. Did you hear about filecoin, which allows to buy and sell disk space independently on large companies? Why wouldn't an anonymous cryptocurrency like Monero help with this real problem? What would the downsides be?

v5v3•7mo ago

Primary reason people use cloudflare is to hide the ip address of their own server. So they are less likely to be hacked.

Most people are not worried about DDos as their is no reason for any one to DDos them.

Until other services start offering the same, Cloudflare remains default.

brumar•7mo ago

Correction: extract monstreous profits. When I read about the revenues associated with Reddit AI deals, I can't even imagine what could possibly be deals that cover half of the internet. Cynically speaking, it's a genious level move.

axus•7mo ago

From the server perspective Cloudflare is solving problems and not causing problems to other servers.

Analogy: locks for high-value items in grocery stores are annoying to customers, but other stores aren't being coerced by the locksmith to use them.

nickjj•7mo ago

Yep, it's really annoying.

I'm using Firefox with a normal adblocker (uBlock Origin).

I get hit with a Cloudflare captcha often and that page itself takes a few seconds before I can even click the checkbox. It's probably an extra 6-7 seconds and it happens quite a few times a day.

It's like calling into a billion dollar company and it taking 4 minutes to reach a human because you're forced through an automated system where you need to choose 9 things before you even have a chance to reach a human. Of course it rattles through a bunch of non-skippable stuff that isn't related to your issue for the first minute, like how much the company is there to offer excellent customer support and how much they value you.

It's not about the 8 seconds or 4 minutes. It's the feeling that you're getting put into really poor experiences from companies with near-unlimited resources with no control over the situation while you slowly watch everything get worse over time.

The Cloudflare situation is worse because you have no options as an end user. If a site uses it, your only option is to stop using the site and that might not be an option if they are providing you an important service you depend on.

Secondly they now have a complete profile over your browsing history for any site that has CF enabled and there's not much you can do here except stop using 20% or whatever market share of the internet they have, and also do a DNS lookup for every domain you visit from an anonymous machine to see if it's a Cloudflare IP range.

In case you didn't know, CF offers a partial CNAME / DNS feature where your primary DNS can be hosted anywhere and then you can proxy traffic from CF to your back-end on a per domain / sub-domain level. Basically you can't just check a site's DNS provider to see if they are on CF. You would have to check each domain and sub-domain to see if it resolves to a CF IP range which is documented here: https://www.cloudflare.com/ips-v4/# and https://www.cloudflare.com/ips-v6/#

rramon•7mo ago

Isn't there a possibility that model makers retaliate by erasing them and their frameworks from memory, hurting CF adoption by devs?

cmeacham98•7mo ago

Cutting humans out of what loop? What jobs or opportunities were people posting Reddit comments or whatever getting that are now going to AI?

Larrikin•7mo ago

People who used to post gained knowledge from their profession or hobby. I don't bother posting any of that information on large sites like Reddit anymore, for various reasons but AI scraping solidified.

I'll still post on the increasingly fewer hobby message boards that are out there.

kamarg•7mo ago

> What jobs or opportunities were people posting Reddit comments or whatever getting that are now going to AI?

Content writing, product reviews (real & fake), creative writing, customer support, photography/art to name a few off the top of my head.

fkyoureadthedoc•7mo ago

Now the astroturfing is done by AI agents instead of hard working serfs in a call center, you hate to see it

godelski•7mo ago

Including your comment, including this comment.

HN itself is routinely scraped. What makes me most uncomfortable is deanonymization via speech analysis. It's something we can already do but is hard to do at scale. This is the ultimate tool for authoritarians. There's no hidden identities because your speech is your identifier. It is without borders. It doesn't matter if your government is good, a bad acting government (or even large corporate entity) has the power to blackmail individuals in other countries.

We really are quickly headed towards a dystopia. It could result in the entire destruction of the internet or an unprecedented level of self censorship. We already have algospeak because platform censorship[0]. But this would be a different type of censorship. Much more invasive, much more personal. There are things worse than the dark forest

[0] literally yesterday YouTube gave me, a person in the 25-60 age bracket, a content warning because there was a video about a person that got removed from a plane because they wore a shirt saying "End veteran suicide".

[0.1] Even as I type this I'm censored! Apple will allow me to swipe the word suicidal but not suicide! Jesus fuck guys! You don't reduce the mental health crisis by preventing people from even being able to discuss their problems, you only make it worse!

trollbridge•7mo ago

The degree to which people say “self-delete” and “unalive” is absurd these days and I now hear it in real life.

It’s Orwellian in the truest sense of the word.

baq•7mo ago

Orwell was the optimist. It’s Huxley’s vision we should be really worried about. Brave new world indeed.

Kostic•7mo ago

This would be true if not for open-weights (and even some open source) LLMs that exist today. Not everything should be done for profit.

giancarlostoro•7mo ago

There's a reason reddit started charging for API usage.

fkyoureadthedoc•7mo ago

It surely wasn't to force users into their shitty app where they can't block ads and definitely had nothing to do with their IPO. It was the AI.

giancarlostoro•7mo ago

Ah yes, its only because of ONE singular reason they started charging for API usage. Are you okay? I'm listing one reason out of many as to why reddit started charging for API usage. After all, reddit is a for profit website.

fkyoureadthedoc•7mo ago

> There's *A* reason

this u chief?

giancarlostoro•7mo ago

Does "a" not mean, one out of many in this context? I know English is not my first language, but I've always taken "a reason" to mean one of many.

fkyoureadthedoc•7mo ago

You said:

> there is a reason

What's in the garage? There's a car. I wouldn't expect to walk in and find many cars.

Had you said:

> this is a reason

that would convey it was one among potentially many.

Either way my comment wasn't even about you, it was about Reddit, so idk why you got riled up.

dwoldrich•7mo ago

I think the parasitism goes quite a bit further than AI. We're being digested not parasitized.

Dig1t•7mo ago

That was always the cost of free and open exchange of ideas though. The idea of the internet in the first place was to allow people to communicate in the open and publish ideas freely. There was never any stipulation that using the published ideas to make money was off limits.

Technology has advanced and now reading the sum total of the freely exchanged ideas has become particularly valuable. But who cares? The internet still exists and is still usable to freely exchange ideas the way it’s always been.

The value that one website provides is a minuscule amount, the value of one individual poster on Reddit is minuscule. Are we asking that each poster on Reddit be paid 1 penny (that’s probably what your posts are worth) for their individual contribution? My websites were used to train these models probably, but the value that each contributed is so small that I wouldn’t even expect a few cents for it.

The person who’s going to profit here is Cloudflare or the owners of Reddit, or any other gatekeeper site that is already profiting from other people’s contributions.

The “parasitism” here just feels like normal competition between giant companies who have special access to information.

risyachka•7mo ago

Maybe so, but I'll take Cloudflare over OpenAI and Meta every time.

lofaszvanitt•7mo ago

Cyberpunk aged well. "You better not be on the unprotected internet". Too many hazards out there. Rogue AIs and other shit...

Cloudflare is here to protecc you from all those evils. Just come under our umbrella.

bawolff•7mo ago

I think its 100% ok to freely train on public internet data.

What is absolutely not ok is to crawl at such an excessive speed that it makes it difficult to host small scale websites.

Truly a tragedy of the commons.

tedd4u•7mo ago

Agree. The problem lately is that even if each single scraper is doing so “reasonably,” there are so many individuals and groups doing this that it’s still too onerous for many sites. And of course many are not “reasonable.”

SchemaLoad•7mo ago

This is the attitude that's going to kill the public internet. Because you're right, it is a free for all right now with the only way to opt out being putting content behind restricted platforms.

jowea•7mo ago

Is it even possible that Cloudfare could manage to block all AI data scrapping? I think this measure is just going to make it harder and more expensive, which will stop AI scrappers from hitting every single page every single day and creating expenses for publishers, but not actually stop their data from ending up in a few datasets.

mathiaspoint•7mo ago

This has been going on even since early social media. I think most of the users actually prefer it.

nektro•7mo ago

it brings me so much joy that this is the top comment on this post

visarga•7mo ago

> everything we do online has, until this point, been free training to make OpenAI, Anthropic, etc. richer while cutting humans--the ones who produced the value--out of the loop

I think on the contrary, who sets the prompts stands to get benefits, the AI provider gets a flat fee, and authors get nothing except the same AI tools as anyone else. That is natural since the users are bringing the problem to the AI, of course they have the lion share here.

AI is useless until applied to a specific task owned by a person or company. Within such a task there is opportunity for AI to generate value. AI does not generate its own opportunities, users do.

Because users are distributed across society benefits follow the same curve. They don't flow to the center but mainly remain at the edge. In this sense LLMs are like Linux, they serve every user in their specific way, but the contributors to the open source code don't get directly compensated.

qskousen•7mo ago

That's a really interesting way to think about it, thank you! I've always had a kind of "gut feeling" that AI training on our data is fine with me, but without really thinking too much about why. I think this explains what I've been feeling.

az226•7mo ago

That’s the irony. Doing it now is just hampering competition and making it better for the incumbents.

jjangkke•7mo ago

so TLDR it adjusts your robot.txt and relies on cloudflare to catch bot behavior and it doesn't actually do any sophisticated residential proxy filtering or common bypass methods that works on cloudflare turnstill, do I have this correct?

this just pushes AI agents "underground" to adopt the behavior of a full blown stealth focused scraper which makes it harder to detect.

nemild•7mo ago

Think this is the future, as the AI Web takes over the human web.

At Coinbase, we've been building tools to make the blockchain the ideal payment rails for use cases like this with our x402 protocol:

https://www.x402.org/

Ping if you're interested in joining our open source community.

bgwalter•7mo ago

The destruction of the Web and IP theft needs to be addressed legally. The opinion of a single judge notwithstanding, "AI" scraping already violates copyright. This needs to be made explicit in law and scrapers must get the same treatment as Western governments gave to thousands of individuals who were bankrupted or jailed for copyright infringement.

We are in the Napster phase of Web content stealing.

zackmorris•7mo ago

As usual, this is the wrong approach.

The open web is akin to the commons, public domain and public land. So this is like putting a spy cam on a freeway billboard, detecting autonomous vehicles, and shining a spotlight at their camera to block them from seeing the ad. To what end?

Eventually these questions will need to be decided in court:

1) Do netizens have the right to anonymity? If not, then we'll have to disclose whether we're humans or artificial beings. Spying on us and blocking us on a whim because our behavior doesn't match social norms will amount to an invasion of privacy (eventually devolving into papers please).

2) Is blocking access to certain users discrimination? If not, then a state-sanctioned market of civil rights abuse will grow around toll roads (think whites-only drinking fountains).

3) Is downloading copyrighted material for learning purposes by AI or humans the same as pirating it and selling it for profit? If so, then we will repeat the everyone-is-a-criminal torrenting era of the 2000s and 2010s when "making available" was treated the same as profiting from piracy, and take abuses by HBO, the RIAA/MPAA and other organizations who shut off users' internet connections through threat of legal actions like suing for violating the DMCA (which should not have been made law in the first place).

I'm sure there are more. If we want to live in a free society, then we must be resolute in our opposition of draconian censorship practices by private industry. Gatekeeping by large, monopolistic companies like Cloudflare simply cannot be tolerated.

I hope that everyone who reads this finds alternatives to Cloudflare and tells their friends. If they insist on pursuing this attack on our civil rights for profit, then I hope we build a countermovement by organizing with the EFF and our elected officials to eventually bring Cloudflare up on antitrust charges.

Cloudflare has shown that they lack the judgement to know better. Which casts doubt on their technical merits and overall vision for how the internet operates. By pursuing this course of action, they have lost face like Google did when it removed its "don't be evil" slogan from its code of conduct so it could implement censorship and operate in China (among other ensh@ttification-related goals).

Edit: just wanted to add that I realize this may be an opt-in feature. But that's not the point - what I'm saying is that this starts a bad precedent and an unnecessary arms race, when we should be questioning whether spidering and training AI on copyrighted materials are threats in the first place.

sct202•7mo ago

My data served by Cloudflare has increased to 100gb /month compared to <20gb like 2 years ago, and they're all fairly static hobby sites. Actual people traffic is down by like half in the same time frame, so I imagine a lot of this is probably cost savings for Cloudflare to reduce resource usage.

Apofis•7mo ago

Makes total sense, bandwidth on this scale is expensive.

e38383•7mo ago

Why is every second article about this claiming that it’s automatic? It needs to be turned on or at least there was no mention of automatic in the original blog post.

I really hope that we can continue training AI the same way we train humans – basically for free.

aunty_helen•7mo ago

I saw yesterday that they were going to allow websites to charge per scrape.

Looks like cloudflare just invented the new App Store.

hmate9•7mo ago

Isn’t this only useful for blogs, news sites, or forums? Why would I want an AI to know less about my product? I want it to understand it, talk about it, and ideally recommend it. Should be default off.

maximilianburke•7mo ago

Every evolution of the web, from Web 2 giving us walled gardens to Web 3 giving us, well, nothing, to what we have now is taking us further from a network of communities and personal repositories of knowledge.

Sure, fidelity has gotten better but so much has been lost.

sneak•7mo ago

This idea that you can publish data for people to download and read but not for people to download and store, or print, or think about, or train on is a doomed one.

If you don’t want people reading your data, don’t put it on the web.

The concept that copyright extends to “human eyeballs only” is a silly one.

t1001•7mo ago

With the problem being bots hammering the site en masse, it feels like the better analog is "allowing free replicator use without having someone ruin the fun by requesting ten tons of food be produced in their quarters every minute".

ChrisArchitect•7mo ago

[dupe] https://news.ycombinator.com/item?id=44432385

kristoff200512•7mo ago

AI will endlessly crawl my website, quickly exhausting the egress quota of my Supabase free plan,but Cloudflare can stop all of this.

userbinator•7mo ago

s/A.I. Data Scrapers/non-sanctioned browsers running on non-sanctioned platforms/

They've been trying to do this for years. Now "AI" gives a convenient excuse.

YPPH•7mo ago

This is great. But my concerns about Cloudflare's power remain. Today it's blocking AI crawlers, tomorrow will it be blocking all browsers that fail hardware-attestation checks?

nsoonhui•7mo ago

But how is this effective against Gemini and even OpenAI, who can instead of relying on their Google and Bing crawlers respectively to crawl the content?

SectorC: A C Compiler in 512 bytes

Brookhaven Lab's RHIC concludes 25-year run with final collisions

The F Word

I write games in C (yes, C)

Software factories and the agentic moment

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

First Proof

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Reinforcement Learning from Human Feedback

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

We mourn our craft

Coding agents have replaced every framework I used

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

France's homegrown open source online office suite

72M Points of Interest

The AI boom is causing shortages everywhere else

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

Unseen Footage of Atari Battlezone Arcade Cabinet Production

History and Timeline of the Proco Rat Pedal (2021)

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Hackers (1995) Animated Experience

SectorC: A C Compiler in 512 bytes

Brookhaven Lab's RHIC concludes 25-year run with final collisions

The F Word

I write games in C (yes, C)

Software factories and the agentic moment

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

First Proof

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Reinforcement Learning from Human Feedback

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

We mourn our craft

Coding agents have replaced every framework I used

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

France's homegrown open source online office suite

72M Points of Interest

The AI boom is causing shortages everywhere else

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

Unseen Footage of Atari Battlezone Arcade Cabinet Production

History and Timeline of the Proco Rat Pedal (2021)

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Hackers (1995) Animated Experience

Cloudflare Introduces Default Blocking of A.I. Data Scrapers

Comments