You get the numbers that Cloudflare tells you, but who knows if you can trust their stats after their CEO is apparently cherry-picking data to shape their product narrative?
I wouldn't trust a single thing coming out of his mouth.
I've been monitoring bot traffic on digital platforms for over 10 years. Sure, the crawler share is growing, some even with malicious intentions, and those I detect and block.
I disagree that this pain is worth the cost of making real people spend their life on verification.
If the "bot traffic" declines, then the "bot protection business" goes down with it
Cloudflare communication are sometimes careful to refer to traffic _labeled as_ bot traffic versus actual bot traffic
Because the "business" relies on the existance of "bot traffic", theres an incentive to broaden the scope of what is labeled as "bot traffic"
The false positive rate can be high. The public should see those statistics, and in truth it may be infeasible to compile them when theres no verification and the entire system relies on heuristics
"Bot protection" can be used to gather fingerprints for marketing
It can be used to force users to use certain software, e.g., certain browsers, and to enable Javascript subjecting users to data collection, surveillance and ads
Originally the motivation for avoiding "bot traffic" was based on behaviour, e.g., exceeding acceptable rates of usage, making too many requests in a given time period, exceeding rate limits
Now it's available to exclude traffic based on criteria such as what browser someone is using. NB. This is more than "user-agent string". The company forces people to sign NDAs before telling them what it is doing to fingerprint www users
If residential proxies are the problem then why not go after the companies that provide them
The truth is that those companies are not the problem. Their customers are so-called "tech" companies
Perhaps it's these so-called "tech" companies that are the problem
Certainly the problem is not the individual www user who doesnt use an "approved" graphical, Javascript-enabled browser who gets blocked or fingerprinted trying to make a single request
But thats who suffers from "bot protection" so that so-called "tech" companies can profit from data collection, surveillance and ads
> Now it's available to exclude traffic based on criteria such as what browser someone is using
I'm pretty sure user-agent-based bot detection predates every request-rate-based method by quite a few years.
Update: I see the problem. Here's the full tweet: https://x.com/eastdakota/status/2062212701414187452
"Thought it would be end of 2027, then early 2027, but agentic traffic growing so fast that bots have now passed human traffic online for the first time in the Internet's history."
But the quoted segment in the article was just "…bots passed human traffic online for the first time in the Internet’s history."
It looks to me like the data supports "bots passed human traffic" but does NOT support "agentic traffic", since more of that traffic is from AI crawlers building indexes than from agents that are browsing the web on behalf of their owners.
If that's the point the article is trying to make then the headline is a little more supported, though I'd still say it's too hype-y a headline.
I guess a lot of this rests on what you assume "agentic traffic" to mean.
I think you are missing the fact that the dashboard has HTML pre-selected as a filter. Once you change that to all content types, you’ll see humans account for twice as much traffic as bots.
Note this part of the article:
> The CEO ignored the all-traffic number, on his own dashboard, and instead published the HTML-only number as a fact about the whole internet.
I recall Facebook doing it years ago, I imagine they still do.
A 'pixel' is an unobtrusive (as in, not seen by the user like a banner ad is seen) asset* served on a web page that can cause the user's user agent to make an affirmative web request from you, a third party, so you know someone was at the site serving your pixel.
Typically used for:
- tracking in general, as well as more specifically:
- retargeting
- conversion
* Note: Doesn't have to be a literal pixel, but a literal transparent pixel is least likely to get blocked. Serve your pixels from the end of a parameterized path (/some/param/or/other/pixel.gif) and it's not seen as query string tracking either.
The only thing I saw that could possibly be construed as abusive were some poorly configured RSS bots. Even when my server told the bot that the page would not change for 4 hours the RSS bots would check every 10 minutes meaning they are ignoring the cache-control header. This was entirely harmless, just slightly annoying. The RSS bots are not new. Most of the bots are not even trying to disguise themselves as humans.
I was expecting the bots to mirror a couple git repositories I exposed but they did not go deeper than the README.md. None of them. I think this is the same pattern of catastrophization that exists around AI dooming the world and I don't know why it is spreading. I guess it must work or people would not do it.
[1] - https://blawg.nochan.net/b/Internet-Crap/20260522-Maybe-AI-B...
So many people have sketchy TV boxes or whatever other sketchy IOT decice that is a larp for using your network to sell bandwidth to proxy networks.
However, CF is unnecessarily making people on 5G connections from desktops do turnstiles as it looks like a scraper using a mobile proxy. This will become more and more of an issue as more laptops have 5G modems in them. Not sute how this WAF IP fingerprinting model survives widesprear CGNAT. I guess it will be an excuse to more intensly fingerprint us.
It’d make sense as you might not want your bot to load everything a real human would do (ie: analytics, ads, unrelated files, etc..) and only focus on the content.
Also, am I the only one surprised that bot traffic is not the majority already? For my site, it’s x100 bots for every human.
It's easy to drastically underestimate the amount of bot traffic, because bots make efforts (of varying sophistication) to look human enough to evade blocking. That includes using fake user-agent strings corresponding to real browsers (often but not always with implausibly old version numbers), proxying through residential IPs, and sometimes using full headless browsers. In my own data, traffic from badly behaved browser-impersonation bots exceeds traffic from named scrapers like GPTBot by something like 10x.
The measured percentage of bot traffic is higher for HTML than for other content types because many bots will load an HTML page, and then not load the JS/CSS/image/etc resources it references. But these are the least-sophisticated and most-detectable bots.
I would assume a lot of people running websites tend to think in pageviews, especially when dealing with bots because images and CSS files tend to be "cheap" static content but HTML requests are often dynamically generated.
It's also a single tweet that links to the data used to "disprove" it. Would be a weird way to lie.
Mostly current valid user agents, lots of ip addresses, but the traffic patterns are not organic. I’m not clear if it’s bad ai scraping or dos, but at some level it’s indistinguishable.
Same general idea goes for any of the algorithmic driven platforms. The algorithms are ostensibly intended to surface organically discovered things by watching how people interact with things. That they are so susceptible to distortion through bot farms should be a lot more acknowledged than it is. People trust them far more than they should.
There is also a general cost of running things concern. It isn't like it is completely free to execute on bot traffic.
If the digital platform's storefront is their business, they could afford to spend some budget on bot detection. Bots still come from data center networks, sometimes render pages incompletely, request resources in bulk, and show enough patterns to be flagged internally.
If we look at a medium website, most random crawlers will come from Amazon, Microsoft, DigitalOcean, Hetzner, OVH, and a few other DC networks — these can be blocked easily without harming real users. The rest can be detected and cleaned up, even manually.
The math is simple: 20,000 visits a day at 15 seconds each = ~83 hours a day lost watching a Cloudflare logo, just because someone doesn't want to dig into the logs. I don't buy it.
There is also a bit of mixed incentives. Yes, it is the ad platform that is getting abused. But it is also the ad platform that is charging people based on abused practices.
And it isn't like this isn't completely made up. Just look at how facebook killed a lot of ton of people during the "pivot to video" programs. I don't know all of the details, as I was thankfully not in any of the involved industries, but my understanding is it is fairly well documented.
>Certainly the problem is not the individual www user who doesnt use an "approved" graphical, Javascript-enabled browser who gets blocked or fingerprinted trying to make a single request
The alternatives to javascript fingerprinting are either ineffective (TLS fingerprinting and/or IP rate limits), or even worse for privacy (eg. attestation).
>If residential proxies are the problem then why not go after the companies that provide them
The most important thing is to link to your methodology, which Cloudflare's CEO did in this case.
That is something that should not be allowed to exist. It's one of the reasons monopolies (or even majority-opolies) are bad. It's a weapon hanging on the wall, waiting to be used.
Are there network effects like what happens with Microsoft in the business computing space? With Microsoft, I'm also aware of a great amount of anti-competitive behavior, and though I haven't seen that from Cloudflare personally and haven't heard accusations of it, I also haven't paid attention.
When I learned econ 101 in high school there was a concept of a "natural monopoly" like an electricity utility, a concept that was probably mostly post-hoc rationalization of the regulatory structures that were chosen a century ago, but it at least was a coherent narrative. I can't see any coherent narrative about Cloudflare's services being a natural monopoly. So I'm left wondering if they are just way better at what they do than anybody else, and perhaps the space isn't big enough to drive a competitor to enter it?
I hope somebody on HN has a much better explanation of this than I do.
CF serves something it convinced customers they need.
Static blogs hiding behind bot protection (in some cases blocking legit users from GrapheneOS because it's difficult to fingerprint them) because someone convinced them they'll be DDoSed by bots otherwise is a loss to the Internet.
A lot of self-hosters running CF tunnels because they don't know better also contributes.
> It's more healthy to start the conversation of _why_ CF services are valuable.
Begging the question. It's what TFA is about - telling people they need CF.
A lot of self-hosters running CF tunnels because they don't know better also contributes.
If you know better, you should contribute.I’d blame the bad actors, rather than service providers that alleviate the problem.
I am unaware of such a capability of Cloudflare.
I believe it is the site administrators who have inserted Cloudflare in between their sites and their users.
Usually it is done for rational reasons of establishing a protection against bots. But what is less rational, in my opinion, is when everyone uses the same provider for that.
Because it indirectly turns Cloudflare into a monopoly. And monopolies often converge to a state when they start to abuse their position.
What makes you think this is the cause, rather than something more straightforward like: CGNAT means more users are sharing IPs, and there's a higher chance that the IP pool gets contaminated by bad behavior? Apparently cloudflare tries to detect CGNAT pools and give them more leeway, but at the same time they can't give them unlimited leeway.
[1] https://blog.cloudflare.com/detecting-cgn-to-reduce-collater...
Maybe from your IP block even. More common since you don't control that.
kordlessagain•1h ago
The fact is, Cloudflare is a man-in-the-middle. That's their focus, that's their purpose.
They will limit your local crawler from accessing pages. They will demand you use their crawler.
They will decrypt your traffic if they get a warrant. They always decrypt your traffic anyway, but they will give it to state actors if they demand it.
That's not to say anyone should break the laws, but the issue right now is that intellectual property is incompatible with what is coming with AI.
I don't hate on Cloudflare because it's a bad service. It's actually pretty good, but the fundamental problem is they make their purpose to be a single choke point of all data on the Web.
That's not right. It never was.
gonzalohm•46m ago
They don't see anything wrong with one entity controlling most of the internet traffic
gruez•12m ago
Source? According to cloudflare, their crawling service don't get any special treatment from their WAF/CDN.