I love experimental data like this. So much better than gut reaction that was spammed when anubis was just introduced
The idea is to scare off bots and not normal humans.
I'll implement Anubis at low difficulty for all my projects and leave a decent llms.txt referenced in my sitemap and robots.txt so LLMs can still get relevant data for my site while.keeping bad bots out. I'm getting thousands of requests from China that have really increased costs, glad it seems the fix is rather easy.
It's even dumber than that, because by default anubis whitelists the curl user agent.
curl -H "User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/146.0.0.0 Safari/537.36" "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/?id=v7.0-rc5&id2=v7.0-rc4&dt=2"
<!doctype html><html lang="en"><head><title>Making sure you're not a bot!</title><link rel="stylesheet"
vs curl "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/?id=v7.0-rc5&id2=v7.0-rc4&dt=2"
<!DOCTYPE html>
<html lang='en'>
<head>
<title>kernel/git/torvalds/linux.git - Linux kernel source tree</title>In that case a better solution would be to take the site down altogether.
https://web.archive.org/web/20260329052632/https://gladeart....
Should have have maybe prioritized differently...
I may be missing something of course
People have a right to complete anonymity, and should be able to go across the majority of the Internet just as they can go across most of the country.
That’s what you are missing.
Don’t get me wrong, I am also in favour of a single government ID, but in terms of combatting identity fraud, accessing public resources like single-payer healthcare, and making it easier for a person to prove their identity to authorities or employers.
It should not be used as a pass card for fundamental rights that normally would have zero government involvement.
Very annoying. And you can't filter them because they look like legitimate trafic.
On a page with differents options (such as color, size, etc...) they'll try all the combinaisons, eating all the ressources.
That’s how fast the landscape is changing.
And remember: while the report might have been released in 2024, it takes time to conduct research and publish. A good chunk of its data was likely from 2023 and earlier.
> Anubis uses a Proof-of-Work scheme in the vein of Hashcash
And if you look up Hashcash on Wikipedia you get https://en.wikipedia.org/wiki/Hashcash which explains how Hashcash works in a fairly straightforward manner (unlike most math pages).
for (; ;) {
const hashBuffer = await calculateSHA256(data + nonce);
const hashArray = new Uint8Array(hashBuffer);
let isValid = true;
for (let i = 0; i < requiredZeroBytes; i++) {
if (hashArray[i] !== 0) {
isValid = false;
break;
}
}
It's less proof of work and just annoying to users, and feel good to whoever added it to their site, I can't wait for it to go away. As a bonus, it's based on a misunderstanding of hashcash, because it is only testing zero bytes comparison with a floating point target (as in Bitcoin for example), the difficulty isn't granular enough to make sense, only a couple of the lower ones are reasonably solvable in JavaScript and the gaps between "wait for 90 minutes" and "instantly solved" are 2 values apart.Normal and sane people understand this intuitively. If someone goes to a mechanic because their car is broken and the mechanic says "well, if you can tell that you car is broken, then you should be able to figure out how to fix it" - that mechanic would be universally hated and go out of business in months. Same thing for a customer complaining about a dish made for them in a restaurant, or a user pointing out a bug in a piece of software.
It also drives home that Anubis needs a time estimate for sites that don't use Anubis as a "can you run javascript" wall but as an actual proof of work mechanism that it purports to be its main mechanism
It shows a difficulty of "8" with "794 kilohashes per second", but what does that mean? I understand the 8 must be exponential (not literally that 8 hashes are expected to find 1 solution on average), but even as a power of 2, 2^8=256 I happen to know by heart, so thousands of hashes per second would then find an answer in a fraction of a second. Or if it's 8 bytes instead of bits, then you expect to find a solution after like 8 million hashes, which at ~800k is about ten seconds. There is no way to figure out how long the expected wait is even if you understand all the text on the page (which most people wouldn't) and know some shortcuts to do the mental math (how many people know small powers of 2 by heart)
If a webstie takes so long to verify me I'll bounce. That's it.
JA4 fingerprinting works decently for the residential proxies.
Maybe I’m a bot, I gave up waiting before the progress bar was even 1% done.
If you have a logging stack, you can easily find crawler/bot patterns, then flag candidate IP subnets for blocking.
It's definitely whackamole though. We are experimenting with blocking based on risk databases, which run between $2k and $10k a year depending on provider. These map IP ranges to booleans like is_vpn, is_tor, etc, and also contain ASN information. Slightly suspicious crawling behavior or keyword flagging combined with a hit in that DB, and you have a high confidence block.
All this stuff is now easy to homeroll with claude. Before it would have been a major PITA.
> "The idea is that at individual scales the additional load is ignorable, ..."
Three minutes, one pixel of progress bar, 2 CPUs at 100%, load average 4.3 ...
The site is not protected by Anubis, it's blocked by it.
Closed.
One of the mistakes people assume is that AI capability means humanness. If you know exactly where to look, you can start to identify differences between improving frontier models and human cognition.
One concrete example from a forthcoming blog post of mine:
[begin]
In fact, CAPTCHAs can still be effective if you know where to look.
We ran 75 trials -- 388 total attempts -- benchmarking three frontier AI agents against reCAPTCHA v2 image challenges. We looked across two categories: static, where each image grid is an individual target, and cross-tile challenges, where an object spans multiple tiles.
On static challenges, the agents performed respectably. Claude Sonnet 4.5 solved 47%. Gemini 2.5 Pro: 56%. GPT-5: 23%.
On cross-tile challenges: Claude scored 0%. Gemini: 2%. GPT-5: 1%.
In contrast, humans find cross-tile challenges easier than static ones. If you spot one tile that matches the target, your visual system follows the object into adjacent tiles automatically.
Agents find them nearly impossible. They evaluate each tile independently, produce perfectly rectangular selections, and fail on partial occlusion and boundary-spanning objects. They process the grid as nine separate classification problems. Humans process it as one scene.
The challenges hardest for humans -- ambiguous static grids where the target is small or unclear -- are easiest for agents. The challenges easiest for humans -- follow the object across tiles -- are hardest for agents. The difficulty curves are inverted. Not because agents are dumb, but because the two systems solve the problem with fundamentally different architectures.
Faking an output means producing the right answer. Faking a process means reverse-engineering the computational dynamics of a biological brain and reproducing them in real time. The first problem can be reduced to a machine learning classifier. The second is an unsolved scientific problem.
The standard objection is that any test can be defeated with sufficient incentive. But fraudsters weren't the ones who built the visual neural networks that defeated text CAPTCHAs -- researchers were. And they aren't solving quantum computing to undermine cryptography. The cost of spoofing an iris scan is an engineering problem. The cost of reproducing human cognition is a scientific one. These are not the same category of difficulty.
[end]
Your key finding is that humans process the grid as one visual scene — but that's a finding about sighted cognition.
Isn't this, like most things, a sensitivity specificity tradeoff?
How many real humans should be blocked from your system to keep the bots out?
What is the Blackstone ratio of accessibility?
I can't believe people are still using this as a generic anti-AI argument even though a decade ago people were insisting that there's no way AI can have the capabilities that frontier LLMs have today. Moreover it's unclear whether the gap even exists. Even if we take the claim that the grid pattern is some sort of fundamental constraint that AI models can't surpass, it doesn't seem too hard to work around by infilling the grids pattern and presenting the 9 images to LLMs as one image.
Why do you say that?
We need a better solution.
In addition to pulling responses with huge amplification (40x, at least, for posting a single Facebook post to an empty audience), it's sending us traffic with fbclids in the mix. No idea why.
They're also sending tons of masked traffic from their ASN (and EC2), with a fully deceptive UserAgent.
The weirdest part though is that it's scraping mobile-app APIs associated with the site in high volume. We see a ton of other AI-training focused crawlers do this, but was surprised to see the sudden change in behavior on facebookexternalhit ... happened in the last week or so.
Everyone is nuts these days. Got DoSed by Amazonbot this month too. They refuse to tell me what happened, citing the competitive environment.
On Safari or Orion it is merely extremely slow to load.
I definitely wouldn't use any of this on a site that you don't want delisted for cryptojacking.
Good luck banning yourself from the future.
Is the theory here that OpenAI, Anthropic, Gemini, xAI, Qwen, Z.ai etc are all either running bad scrapers via domestic proxies in Indonesia, or are buying data from companies that run those scrapers?
I want to know for sure. Who is paying for this activity? What does the marketplace for scraped data look like?
What is the point of these anti bot measures if organic HN traffic can nuke your site regardless? If this is about protecting information from being acquired by undesirable parties, then this site is currently operating in the most ideal way possible.
The information will eventually be ripped out. You cannot defeat an army with direct access to TSMC's wafer start budget and Microsoft's cloud infrastructure. I would find a different hill to die on. This is exactly like the cookie banners. No one is winning anything here. Publishing information to the public internet is a binary decision. If you need to control access, you do what Netflix and countless others have done. You can't have it both ways.
Retr0id•1h ago
> Here is a massive log file for some activity in the Data Export tar pit:
A bit of a privacy faux pas, no? Some visitors may be legitimate.