> Anubis sits in the background and weighs the risk of incoming requests. If it asks a client to complete a challenge, no user interaction is required.
> Anubis uses a proof-of-work challenge to ensure that clients are using a modern browser and are able to calculate SHA-256 checksums. Anubis has a customizable difficulty for this proof-of-work challenge, but defaults to 5 leading zeroes.
When I go to Codeberg or any other site using it, I'm never asked to perform any kind of in-browser task. It just has my browser run some JavaScript to do that calculation, or uses a signed JWT to let me have that process cached.
Why shouldn't an automated agent be able to deal with that just as easily, by just feeding that JavaScript to its own interpreter?
As other commenters say, it was completely predictable from the start.
anubis directly incentivises the adversary, at expense of everyone else
it's what you would deploy if you want to exclude everyone else
(conspiracy theorists note that the author worked for an AI firm)
"Everyone else" actually has staggering piles of compute, utterly dwarfing the cloud, utterly dwarfing all the AI companies, dwarfing everything. It's also generally "free" on the margin. That is, if your web page takes 10 seconds to load due to an Anubis challenge, in principle you can work out what it is costing me but in practice it's below my noise floor of life expenses, pretty much rolled in to the cost of the device and my time. Whereas the AI companies will notice every increase of the Anubis challenge strength as coming straight out of their bottom line.
This is still a solid and functional approach. It was always going to be an arms race, not a magic solution, but this approach at least slants the arms race in the direction the general public can win.
(Perhaps tipping it in the direction of something CPUs can do but not GPUs would help. Something like an scrypt-based challenge instead of a SHA-256 challenge. https://en.wikipedia.org/wiki/Scrypt Or some sort of problem where you need to explore a structure in parallel but the branches have to cross-talk all the time and the RAM is comfortably more than a single GPU processing element can address. Also I think that "just check once per session" is not going to make it but there are ways you can make a user generate a couple of tokens before clicking the next link so it looks like they only have to check once per page, unless they are clicking very quickly.)
To actually make it expensive for scrapers every page would need a new challenge. And that would not be tolerated by real human users. Or the challenge solution would need to be tied to a stateful reward that only entitles a human-level amount of subsequent request usage.
I'm not sure if anubis currently does that, but it certainly could.
it's a "I don't want my server to be _overrun_ by crawlers" protection which works by
- taking advantage that many crawlers are made very badly/cheaply
- increasing the cost of crawling
thats it, simple but good enough to shake of the dumbest crawlers and to make it worth it for AI agents to e.g. cache site crawling so that they don't craws your site a 1000 times a day but instead just once
This may tread too close to DRM tho due to element protection scheme.
Binding a challange-response to a specific resource doesnt sound like such a bad idea though.
The real question is whether or not it would really be enough to discourage indiscriminate/unrestrained scraping. The disparity between the computing resources of your average user and a GPU-accelerated bot with tons of memory is after all so lop-sided that such an approach may not even be sufficient. For a user to compute a hash that requires 1024 iterations of an expensive function which demands 25 MB of memory might seem like a promising scraping deterrent at first glance. On the other hand, to a company which has numerous cores per processor running in separate threads and several terabytes of RAM at it's disposal (multiplied by scores of computer racks) it might just be like a drop in the bucket. In any case, it would definitely require a modicum of tuning/testing to see if it is even viable.
I have actually implemented this very kind of hash function in the past and can attest that the implementation is fairly trivial. With just a bit of number theory and some sponge-contruction tricks you can achieve a highly robust implementation with just a few dozen lines of Javascript code. Maybe when I have the time I should put something up on Github as a proof-of-concept for people to play with. =)
If they notice that they are getting rate limited or IP blocked, they will use each IP only once. This means that IP based rate limiting simply doesn't work.
The proof of work algorithm in Anubis creates an initial investment that is amortized over multiple requests. If you decide to throw the proof away, you will waste more energy, but if you don't, you can be identified and rate limited.
The automated agent. An never get around this, since running the code is playing by the rules. The goal of the automated agent is to ignore the rules.
I understand why certain business models have a problem with AI crawlers, but I fail to see why sites like Codeberg have an issue.
If the problem is cost for the traffic then this is nothing new and I thought we have learned how to handle that by now.
Services like codeberg that are run on donations can be easily DOS'ed by AI crawlers
For example: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
> [...] Now it’s LLMs. If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.
The linux kernel has also been dealing with it AFAIK. Apparently it's not so easy to deal with, because these ai scrapers pull a lot of tricks to anonymize themselves.
> Precisely one reason comes to mind to have ROBOTS.TXT, and it is, incidentally, stupid - to prevent robots from triggering processes on the website that should not be run automatically. A dumb spider or crawler will hit every URL linked, and if a site allows users to activate a link that causes resource hogging or otherwise deletes/adds data, then a ROBOTS.TXT exclusion makes perfect sense while you fix your broken and idiotic configuration.
(And it led to outrage from people for whom requiring an account was some kind of insult.)
Now it's 2-5 sites per day, including web forums and such.
"Bruh sorry we were technically unable to produce a website without invasive dark pattern tracking stuff. Tech is haaaaard."
Honestly, I've never found a page outside my own country that I couldn't live without. Screw that s*t.
First from known networks, then from residential IPs. First with dumb http clients, now with full blown headless chrome browsers.
I've worked with a company that has had to invest in scraper traffic mitigation, so I'm not disputing that it happens in high enough volume to be problematic for content aggregators, but as for small independent non-commercial websites I'll stick with my original hypothesis unless I come across contradictory evidence.
A more memory-hard "mining" algorithm could help.
Here's the basic problem: the fully loaded cost of a server CPU core is ~1 cent/hour. The most latency you can afford to inflict on real users is a couple of seconds. That means the cost of passing a challenge the way the users pass it, with a CPU running Javascript, is about 1/1000th of a cent. And then that single proof of work will let them scrape at a minimum hundreds, but more likely thousands, of pages.
So a millionth of a cent per page. How much engineering effort is worth spending on optimizing that? Basically none, certainly not enough to offload to GPUs or ASICs.
The more sites get protected by Anubis, the stronger the incentives are for scrapers to actually switch to GPUs etc. It wouldn't take all that much engineering work to hook the webcrypto apis up to a GPU impl (although it would still be fairly inefficient like that). If you're scraping a billion pages then the costs add up.
Now, could you construct a challenge that forced the client to keep a ton of data in memory, and then regularly be forced to prove they still have that data during the entire session? I don't think so. The problem is that for that kind of intermittent proof scenario there's no need to actually keep the data in low latency memory. It can just be stored on disk, and paged in when needed (not often). It's a very different access pattern from the cryptocurrency use case.
Besides the "got the source code for training data" , the other access scenario is just downloading to an end users "agent" Which again, the end user is running something in the background, doesn't care how long it takes, how much it costs, its not a volume or spam type problem
That, or, they could just respect robots.txt and we could put enforcement penalties for not respecting the web service's request to not be crawled. Granted, we probably need a new standard but all these AI companies are just shitting all over the web, being disrespectful of site owners because who's going to stop them? We need laws.
IMO, if digital information is posted publicly online, it's fair game to be crawled unless that crawl is unreasonably expensive or takes it down for others, because these are non rivalrous resources that are literally already public.
> we could put enforcement penalties for not respecting the web service's request to not be crawled... We need laws.
How would that be enforceable? A central government agency watching network traffic? A means of appealing to a bureaucracy like the FCC? Setting it up so you can sue companies that do it? All of those seem like bad options to me.
I disagree. Whether or not content should be available to be crawled is dependent on the content's license, and what the site owner specifies in robots.txt (or, in the case of user submitted content, whatever the site's ToS allows)
It should be wholly possible to publish a site intended for human consumption only.
> How would that be enforceable?
Making robots.txt or something else a legal standard instead of a voluntary one. Make it easy for site owners to report violations along with logs, legal action taken against the violators.
You have just described the rationale behind DRM. If you think DRM is a net positive for society, I won't stop you, but there has been plenty published online on the anguish, pain and suffering it has wrought.
This is actually kind of why I like Anubis. Instead of trying to dictate what clients or purposes or types of users can access a site, it just changes the asymmetry of costs enough that hopefully it fixes the problem. Because like you can still scrape a site behind Anubis, it just takes a little bit more commitment, so it's easier to do it on an individual level than on a mass DoS level.
This _is_ the problem Anubis is intended to solve -- forges like Codeberg or Forgejo, where many routes perform expensive Git operations (e.g. git blame), and scrapers do not respect the robots.txt asking them not to hit those routes.
Same old problem. Corps are gonna corp.
I see a lot more private networks in our future, unfortunately.
And by putting a wall up you end up losing a large portion of the market to those that will now simply arbitrage and fill the space you leave behind.
There is simply no way to stop crawlers/scrapers, period, unless you put a meter on it or go offline.
There is no legal recourse here, if you don't want AI crawlers accessing your content 1) put it behind a paywall 2) remove from public access
More importantly do you want to now compete with those that do not bottleneck and lose your traffic ?
This is the paradox, the length you go to protect your content only increases costs for everybody else who isn't an AI crawler.
Search era clearly proved it is possible to crawl respectfully - the AI crawlers have just decided not to. They need to be disincentivized from doing this
- is hard to enforce
- misses bite, i.e. it makes you more money to break it then any penalties
but in general yes, a site which indicates they don't want to be crawled by AI bots but still gets crawled should be handled similar to someone with house ban on a shop forcing them self into the shop
given how severely messed up some millennia cyber security laws are I wonder if crawlers bypassing Anubis could be interpreted as "circumventing digital access controls/protections" or similar, especially given that its done to make copies of copyrighted material ;=)
If you put something in public domain people are going to access it unless you put it behind a paywall but you don't want to do it because that would limit access or people wouldn't pay for it to begin with (ex. your blog nobody wants to pay for)
There's no law against scraping, and we've already past the CFAA argument
It's something people do for people. It's not "in the public domain" for companies to gobble up with machines.
> There's no law against scraping
There's no law against incurring as heavy and pernicious social and material costs to commercial scrapers as is physically possible within legal bounds, either. So what's the problem?
if you really think what you offer has value, put it in behind a paywall and see how many people will consume it then, probably not a lot.
There will always be bots, they were here before anubis, they'll be there long after you block them again. Take care of yourself first. There's no need to make a bad day worse trying to sprint down a marathon.
You're doing a tremendous job. On a personal note, I'm not angry or anything, it's just the nature of the process. No hard feelings here. I root for you.
Hope everything goes way better and way sooner than you ever imagined. Good luck & godspeed!
There's a lot of love for Anubis alone, but this doesn't translate to a third of what a junior dev makes. How do we expect to turn these highly beneficial open source products when people cannot pass off a small portion of their savings to those that make the tools.
I hope you figure it out Xena. Even though I'm only on the user side of things I've liked Anubis more than other solutions like cloudflare.
But what's your point? That Anubis creates less value because it's made in Canada? That Xena should be paid like an entry level engineer because it's open source? That there's people who make less so be happy? (Starving kids in Africa) How would you respond if your employer made the same argument to you? Or that Xena should work two jobs because one of the above options?
I really don't understand what you're arguing.
My comparison to an entry level engineer's salary was because it is a very low bar.
https://www.levels.fyi/t/software-engineer/locations/canada
Or look at just entry level: https://www.levels.fyi/t/software-engineer/levels/entry-leve...
US for comparison: https://www.levels.fyi/t/software-engineer/locations/united-...
However, not gonna lie, I feel a lot less risk tolerant given that my husband is currently unemployed and as much as this is the worst possible place to do this, I really would love if he got a job somewhere. It would make things so much logistically simpler.
Thank you for your kind words though! Really, it made my day better.
I've also set up LiberaPay after getting many complaints about my choice of funding platforms: https://liberapay.com/Xe/.
Generally I don't feel comfortable doing self-promotion for financial things on Hacker News. I don't want to piss in the pool that we all swim in. I've also been considering making a Y Combinator application, if only because money would let me afford to be able to take a risk.
Sorry if this is overly ranty, it's really annoying to be one of those load-bearing dependency pegs that everyone complains about but relatively few people contribute to financially. It's making me wonder if making this liberally open source was a mistake.
I do think there is just a larger problem that we're just so bad at funding open source software despite their great utility and how much of our world is built off of these systems. My partner is studying economics and I always love to introduce her colleagues to OSS and it is definitely something they even have a hard time wrapping their heads around lol.
I know it doesn't pay the bills, but I am rooting for you and do want to say thanks for your work. Not just Anubis either. It's hard, but don't be afraid to ask for help. You're not being greedy and don't let people like rekabis bully you. People having it worse off doesn't make you or what you do any less valuable. The work speaks for itself, even if as a society we haven't figured out how to make that pay the bills.
Anonymous account for obvious reasons.
It does help from accidental ddos or just rude scrapers that assume everyone has unlimited bandwidth and money.
If a bot wants access, let it earn it—and let that work be captured, not discarded. Each request becomes compensation to the site itself. The crawler feeds the very system it scrapes. Its computational effort directly funds the site owner's wallet, joining the pool to complete its proof.
The cost becomes the contract.
That’s not much of a different ask from Anubis. It just commandeers the compute for some useful purpose.
Because if so, I don't yet see how to "smooth out" the wins. If the crawler manages to solve the very-high-difficulty puzzle for you and get you 1BTC, great, but it will be a long time between wins.
If you're proposing a new (or non-mainstream) blockchain: What makes those coins valuable?
But in principle I very much agree
Some focus on generating content that can be served to waste crawler time: crates.io/crates/iocaine/2.1.0
Some focus on generating linked pages: https://hackaday.com/2025/01/23/trap-naughty-web-crawlers-in...
Some of them play the long game and try to poison models' data: https://codeberg.org/konterfai/konterfai
There are lots more as well; those are just a few of the ones that recently made the rounds.
I suspect that combining approaches will be a tractable way to waste time:
- Anubis-esque systems to defeat or delay easily-deterred or cut-rate crawlers,
- CloudFlare or similar for more invasive-to-real-humans crawler deterrence (perhaps only served to a fraction of traffic or traffic that crosses a suspicion threshold?),
- Junk content rings like Nepenthes as honeypots or "A/B tests" for whether a particular traffic type is an AI or not (if it keeps following nonsense-content links endlessly, it's not a human; if it gives up pretty quickly, it might be--this costs/pisses off users but can be used as a test to better train traffic-analysis rules that trigger the other approaches on this list in response to detected likely-crawler traffic).
- Model poisoners out of sheer pettiness, if it brings you joy.
I also wonder if serving taboo traffic (e.g. legal but beyond-the-pale for most commercial applications porn/erotica) would deter some AI crawlers. There might be front-side content filters that either blacklist or de-prioritize sites whose main content appears (to the crawler) to be at some intersection of inappropriate, prohibited, and not widely-enough related to model output as to be in demand.
And the proposed remedy is to give them human-labeled data directly in the form of captchas, even more severely degrading the user experience and thus website viability?
Color me unconvinced.
it's well intentioned but just waste electricity from good people in the end.
anubis does nothing to impact bad crawlers, well only the laziest ones. but for those generating fake infinite content on the fly is much more efficient.
It doesn't quite do what it is advertised to do, as evidenced by this post; and it degrades user experience for everybody. And it also stops the website from being indexed by search engines (unless specifically configured otherwise). For example, gitlab.freedesktop.org pages have just disappeared from Google.
We need to find a better way.
Even wikipedia begged for those damn bot about stopping doing this, the data is already accessible in archive here.
jsnell•5mo ago
zahlman•5mo ago
logicprog•5mo ago
myaccountonhn•5mo ago
misswaterfairy•5mo ago
https://blog.cloudflare.com/introducing-pay-per-crawl/
There's also an open specification called x402:
https://www.x402.org/x402-whitepaper.pdf
I would definitely use this to charge US$100,000 per request from any AI company to crawl my site. I would exempt 'public good' crawlers like The Internet Archive though.
If AI companies valued at billions of dollars want to slurp up my contribution to the human condition, that's my price - subject to price rises only.
sumtechguy•5mo ago