I realize Anubis was probably never tested on a true single-core machine. They are actually somewhat difficult to find these days outside of microcontrollers.
Also, they still might not (but probably learned). In this article they imply that each type of CPU core (what they call a "tier" in the article) will still be a power of two, and one just happened to be 2^0. I'm not sure they were around when the AMD Athlon II X3 was hot.
>>> Today I learned this was possible. This was a total "today I learned" moment. I didn't actually think that hardware vendors shipped processors with an odd number of cores, however if you look at the core geometry of the Pixel 8 Pro, it has three tiers of processor cores. I guess every assumption that developers have about CPU design is probably wrong.
Yeah that's obviously not true, and believing it shows a marked lack of experience in the field. Of the current Xeon workstation lineup, only 3 of 14 SKUs have power-of-2 core counts. And there are consumer lines of CPUs with 6 cores and that sort of thing.
I never thought about it before but I actually had to look up die shots to make sure they were not the same processor. and if I can trust the internet they are not. Hell I had to confirm that yes the playstation 3(also ppc, queue x-files theme) only had the one core and it's screwball subprocessers like I remembered.
To me, the executive level idea was to spare a core for system and minor peripheral tasks - almost all game console before that generation had single CPU architectures with no resident operating system. Transitioning away from a manually coordinated, single threaded code, to an asynchronous multithreaded programming model, was a challenge in itself, and having to deal with it while operating system forced by the console manufacturer taking away control might have been too much for developers.
(Sega Saturn had 2x SH-2, Dreamcast was to have WinCE on ROM but cancelled. Sega being Sega)
Another joke from the same era: Having a 2 core processor means that you can now e.g. watch a film at the same time. At the same time with what? At the same time with running Windows Vista!
2^0 = 1
So the logic might make sense in people's heads if they never encounter 6 or 12 core CPUs that are common these days.
I have Chrome on mobile configured as such that JS and cookies are disabled by default, and then I enable them per site based on my judgement. You might be surprised to learn that normally, this actually works fine, and sites are usually better for it. They stop nagging, and load faster. This makes some sense in retrospect, as this is what allows search engine crawlers to do their thing and get that SEO score going.
Anubis (and Cloudflare for that matter) force me to temporarily enable JS and cookies at least once however anyways, completely defeating the purpose of my paranoid settings. I basically never bother to, but I do admit it is annoying. It's kind of up there with sites that don't have any content by default, only with JS on (high profile example: AWS docs). At least Cloudflare only spoils the fun every now and then. With Anubis, it's always.
It's definitely my fault, but at the same time, I don't feel this is right. Simple static pages now require allowing arbitrary code execution and statefulness. (Although I do recognize that SVGs and fonts also kind of do so anyhow, much to my further annoyance).
Making you pay time, power, bandwidth, or money to access content does not significantly impede your browsing, so long as the cost is appropriately small. For the user above reporting thirty seconds of maxcpu, that’s excessive for a median normal person (but us hackers are not that).
If giving your unique burned-in crypto-attested device ID is acceptable, there’s an entire standard for that, and when your device is found to misbehave, your device can be banned. Nintendo, Sony, Xbox call this a “console ban”; it’s quite effective because it’s stunningly expensive to replace a device.
If submitting proof of citizenship through whatever attestation protocol is palatable is okay, the Anubis could simply add the digital ID web standard and let users skip the proof of work in exchange for affirming that they have a valid digital ID. But this only works if your specific identity can be banned, or else AI crawlers will just send a valid anonymized digital ID header.
This problem repeats in every suggested outcome: either you make it more difficult for users to access a site, or you require users to waste energy to access a site, or you require identifiable information signed by a dependable third-party authority to be presented such that a ban is possible based on it. IP addresses don’t satisfy this; Apple IDs, trusted-vendor HSM-protected device identifiers, and digital passports do satisfy this.
If you have a solution that only presents barriers to excessive use and allows abusive traffic to be revoked without depending on IP address, browser fingerprint, or paid/state credentials, then you can make billions of dollars in twelve months.
Ideas welcome! This has been a problem since bots started scraping RSS feeds and republishing them as SEO blogs, and we still don’t have a solution besides Cloudflare and/or CPU-burning interstitials.
(ps. I do have a solution for this, but it would require physical builds, be mildly unprofitable over time with no growth potential, and incite governments hostility towards privacy-preserving identity systems. A billionaire philanthropist could build it in a year and completely solve this problem. Sigh.)
This might seem contradictory, but I believe this is technically possible? What I don't think is this is how these solutions actually work currently. Like to basically prove that I am indeed a unique visitor who's a person according to the govt, but wouldn't reveal the person info to the site, and wouldn't reveal the site info to the govt, even if they collude.
Same with the whole +18 goof. I'd actually quite like to try age gated communities, like +-5 years my age. I feel a lot of conflict stems from people coming from a bit too different walks of life sometimes. Could even do high confidence location based gating this way, which could also be cool (as well as the exact opposite of cool, because of course).
I'm not super well-versed in crypto though, so I confess this is a lot more conjecture than knowledge.
The only way I can imagine this working is:
1. You go to the government and request to have a digital ID generated.
2. The government generates a random number.
3. The government issues a request to an NGO to generate a new cryptographic object based on the random number, and receives back a retrieval number.
4. The government gives you the retrieval number, which you can use to get your digital ID from the NGO.
This way, the government only has the mapping between your identity and a random number, and the NGO only has the mapping between the random number and the generated object, with no possibility to deanonymize it because you don't present any ID to get it. Obviously, there must be no information exchange between the government and the NGO.
The construction would go basically like this:
pseudonym = VRF(secret_key + site_id)
The expectation is that you would have only one valid secret_key at any time, and it would be unknown to the government. This kind of scheme is called anonymous credential generation in literature I believe. It can be established the secret_key got govt backed, but that's it.
The site_id would be e.g. domain cert public key or similar (domain ownership is a moving target, so just the domain name imo is not sound).
VRF is a verifiable random function. This is the magic ZK part.
Pseudonym is what you present to the site, i.e. the identity you go by.
This way the site can verify that this pseudonym was specifically issued for it (making it site unique), and that it belongs to a govt certified identity (of which there should be only one issued at a time per person). The VRF is deterministic, guaranteeing that it's the same person every time.
Revocation is annoying so I didn't bother thinking that through but should be fairly okay I think?
I believe this is robust to people forging arbitrary IDs, to sites colluding with each other in deanonymization, and colluding with the govt in the same. The only kickers I can think of are secret_key misuse (e.g. via duress) / theft / loss / sharing, and the trust anchor (the govt) being untrustworthy (forging invalid or duplicate identities). Would also need to handle people dying, but that would be pretty much just revocation.
I consider trust anchor issues out of scope. The remainder doesn't sound too bad to try defending for, and I think is also basically out of scope.
Potentially important edit: I'm not accounting for timing side channels here, which might be relevant during revocation or else.
Another: didn't mention but in my humble opinion cryptographically attesting people is unsound. People can't calculate crypto in their head, and can't recall long arbitrary strings of hex. What is appropriate to attest (if anything) is their devices instead. But that's a layer of complication I didn't want to deal with here.
Why, though? If you're the only one who knows it, nothing prevents you from creating as many identities for the same site as you wish.
Person W is welcome to have thousands of unique IDs if they want to, so long as when site X bans identity Y, that ban is applied to all of Person W’s present and future identities. Whether W has a single Y or a thousand Y makes no difference to me. I suppose some sites will care to restrict participation to a single Y per W, but e.g. in the general browsing a site with crawler/bot/AI shielding such as Anubis today, it’s completely irrelevant to them what your Y is so long as rate limits and bans apply to all Y of W rather than to the presented Y alone.
It’s not difficult to solve this problem — the database schema and queries are dead simple! — it’s just exceedingly difficult to succeed if you're not a passport-issuing entity or an authorized monopoly of such.
Scratch that, this happens all the time. With a third party there's no way to revoke, government you can usually physically handle this.
In the model I described, the trust anchor would be the govt, so basically a centralized model like domain certs. This resolves the issues you list off, but brings others: what if the trust anchor isn't trustworthy and starts forging identities?
The alternative to that would then be web of trust stuff. But this is why I consider this to be a separate problem. If the core protocol could be laid out and standardized at least, then layering on another that makes this choice between centralized vs web of trust could be done separately.
there are different metrics for cost, however. Based on cpu utilization and/or time, it's hard to argue that Anubis is a high price.
But if it is important to you to not run javascript for whatever reason, the price of access to a site using Anubis is rather high.
You put stuff on the public Internet, expect it to be read by everyone.
Don't like that? Put it behind a login.
How did the propaganda persuade people into accepting mass surveillance and normalising the invasion of privacy for something that was never really a problem?
I suspect the ones pushing this are the ones running these "crawlers" themselves, much like Cloudflare hosts providers of DDoS services.
What a coincidence that "identity verification" became a hot topic recently.
Don't even try to solve this nonexistent "problem", because that will only drive us further into authoritarian technocracy.
Crying “Conspiracy” in reply to a career Chicken Little is comedic. I’ve been raising warnings about identity verification looming on the horizon for perhaps fifteen years now; thanks to DejaNews for that early realization, I suppose.
> Don't even try to solve this nonexistent "problem", because that will only drive us further into authoritarian technocracy.
I would celebrate and tell all my friends if someone on this thread, on any thread, would explain how we solve this without bankrupting non-business site operators and without a third-party authority. Anubis is a band-aid at best, yet no better solution — not even an idea — is presented alongside your objections.
> You put stuff on the public Internet, expect it to be read by everyone.
My hobbyist forum can barely stay online eight hours a day due to crawler traffic. Someone scraped the entire site by spawning one request per page with no fork limit last year. It was down for a solid week after that, and now has very severe limits in place. I don’t know how they can afford to stay running, but certainly “static only” isn’t going to solve the CPU and bandwidth costs incurred by incompetent and redundant AI crawlers. So, by making their site public in today’s infested internet, their content is no longer accessible.
> Don't like that? Put it behind a login.
As I noted above, one solution is payment — since free credentials registration is not an obstacle to AI bots, after all. For some reason people don’t like to charge money for hobbyist content if they can avoid it. I recognize why and am trying my best to discover a non-monetary solution on their behalf.
> I suspect the ones pushing this are the ones running these "crawlers" themselves, much like Cloudflare hosts providers of DDoS services.
I do not, have not, and will not run crawlers or AI agents, trainers, or other such shit at any time in the past thirty years and will continue to abstain from the entire category, which should be quite easy as I’m a retired sysop now attending full-time accounting school and giving a finger to the entire industry to pursue work that benefits humanity. Same reason I bother telling HN “the anonymity sky is falling” every so often: I’d much prefer it if we didn’t have to sacrifice anonymity online to defeat scraper bots.
> No, no, no, hell fucking no!!
Please find a way to turn your vehemence and passion into a productive contribution, before it’s too late for all of us. As presented, your argument is neither supported nor persuasive, and your hostility only gives opponents of anonymity more arrows in their quiver to shoot at us.
Authentication works, doesn't it?
Free credentials temporarily do, but only until someone teachers crawlers to AI-automate signups. Forum spammers already figured this out a long time ago.
A while back someone from Financial Times commented on HN about how it was confusing that we were all so hostile to their free registration required article paywall. The HN community view on their registration requirement suggests that free authentication does not work and has not for some time; has that viewpoint changed in recent years?
Forum spamming is a solved problem, isn't it? I mean, do you see any spam here on HN?
> The HN community view on their registration requirement suggests that free authentication does not work and has not for some time; has that viewpoint changed in recent years?
I don't think there is such a thing as a "HN community view". Are there meetings to agree on a consensus on important topics? Or were you trying to pull an appeal to authority?
Yes, I’ve reported two or three this past week or so by emailing the mods. Do you browse /new often? What do you think makes HN a spam-free place?
> I don't think there is such a thing as a "HN community view".
Noted.
Where I work our main product is a React-based web site with a JSON back end, you might go to
http://example.com/web/item/88841
and that will load maybe 20MB of stuff (always the same thing) and eventually after the JS boots up a useEffect() gets called that reads '88841' out of the URL and does a GET to
http://example.com/api/item/88841
which gets you nicely formatted JSON. On top of that the public id(s) are sequential integers so you could easily enumerate all the items if you just thought a little bit.
We've had more than one obnoxious crawler that we had reason to believe was targeted specifically at us that would go to the /web/ URL and, without a cache, download all the HTML, Javascript, CSS, then run the JS and download the JSON for each page -- at which case they are either saving the generated HTML or looking at the DOM. If they'd spent 10 minutes playing with the browser dev tools they would have seen the /item/ request and probably could have figured out pretty quickly out how to interpret the results. As is they're going to have to figure out how to parse that HTML and turn it into something like the JSON and could probably save them 95% of the bandwidth, 95% of the CPU, and whatever time they spent writing parsing code and managing their Rube Goldberg machine but I'd take 50% odds any day that they never actually did anything with the data they captured because crawlers usually don't.
I know because I've done more than my share of web crawling and I have crawlers that: capture plain http data, can run Javascript in a limited way, and can run React apps. The last one would blast right past Anubis without any trouble except for the rate limiting which is not a lot of problem because when I crawl I hit fast, I hit hard, and I crawl once. [1] (There's a running gag in my pod that I can't visit the state of Delaware because of my webcrawling)
[1] Ok, sometimes the way you avoid trouble is hit slow, hit soft, but still hit once. It's a judgement call if you can hit them before they knew what hit them or if you can blend in with the rest of the traffic.
I have no problem with bots scraping all my data, I have a problem with poorly-coded bots overloading my server, making it unusable for anybody else. I'm using Anubis on the web interface to an SVN server, so if the bots actually wanted the data, they could just run "svn co" instead of trying to scrape the history pages for 300k files.
> It seems like a whole lot of crap to me. Hostile webcrawlers, not to mention Google, frequently run Javascript these days.
I'm also rather unhappy that I had to deploy Anubis, but it's unfortunately the only thing that seemed to work, and the server load was getting so bad that the alternative was just disabling the SVN web interface altogether.
Incidentally, I read a short while ago that not having "Mozilla" in your user-agent will bypass Anubis, so give that a try.
My options are using custom Chrome, migrating to Firefox, or proxying my traffic and making edits that way (e.g. doing the Anubis PoW there and injecting the cookie required).
Not stoked about any of these, although Firefox is a lot on my mind these days, and option #3 would be a good excuse to dust off my RPi.
I refuse to be boiled slowly by Google. With MV3, it was full-fat ad blockers. With MV4 it could very well be ALL ad-blockers.
And yeah, I concede that sounds conspiratorial - as conspiratorial as Google cracking down on your ability to run the ad-blocker of your choice would've sounded a decade ago.
My God, there's two of us!
(Though … you're being privacy conscious on Chrome? Come to Firefox. Ignore the pesky "it's funded by Google" problems, nothing to see, nothing to see, the water is fiiiine.)
> You might be surprised to learn that normally, this actually works fine
I guess I have a different experience there. A huge number of sites just outright crash. (E.g., the HN search.) JavaScript devs, I've learned, do not handle error cases, and the exceptions tend to just propagated out and ruin the rendering. There seems to be some popular framework out there that even just destroys the whole DOM to emit just the error. (I forget the text, but it's the same text, always. Always centered. Flash of page, then crash.)
I have a custom extension that fakes the cookie storage for those JS pages that just lies & says "yeah, cookies are enabled" and the blackholes the writes. But it fails for anything that needs a real cookie … like Anubis.
I'm empathetic towards where Anubis is coming from though. But the "I passed the challenge" cookie is indistinguishable from a tracker … although probably most people running Anubis are inherently trustworthy by a sort of cultural association so long as Anubis remains non-mainstream. I think I might modify it to have the ability to store cookies for a short time frame (like 1h) in some cases, such as Anubis; that's enough to pass the challenge, but weighed against tracking. I'm usually only blocked by Anubis for something like a blog post, so that should suffice.
Anubis has become an annoying denial-of-service layer in front of sites that I would otherwise use. I hope its no-script mode gets enabled by default soon.
I'm not sure what generation it is, but I bought it around a decade ago I think.
Javascripters, perhaps. Those who work on schedulers, or kernels in general would find this completely normal
I always found it annoying that CPU information was widely available and precise while memory information was not - it's clamped to 0.25, 0.5, 1, 2, 4 or 8 GB. If you're running something memory-bound in the browser you have to be really conservative to avoid locking up the user's device (or ask them to manually specify how much memory to use). https://developer.mozilla.org/en-US/docs/Web/API/Device_Memo...
> ... a challenge method that requires the client to do a single round of SHA-256 hashing deeply nested into a Preact hook in order to prove that the client is running JavaScript.
Why a single round? Doing the whole proof of work challenge inside the proof of react would be even more effective, right?
However it’s much worse than that. There’s a bug in libuv that I think is fixed on master but may not have shipped yet, where they fall back to the cgroup cpuset if the returned value is less than 1 (presumably they meant != 0). So if you’re running on a docker or kubernetes host with 8 cores and you try to give one container half a CPU, it will believe it has access to 8, not 0.5 CPUs. And now you’re task switching constantly.
I checked the value of navigator.hardwareConcurrency on my phone and it returns 9... I guess that explains it.
It looks like setting light performance mode in device optimisations (I don't game on my phone) turns off the S24s sole Cortex-X4.
I'd immediately look into what happens for odd numbers, rounding, implicit type conversions etc. Or at least that's what I was taught when I first started programming.
Also relying on "well we know that X is always Y" is almost always a mistake; maybe not always at first but definitely in the future because X will almost certainly be Y at some point. Defensive coding would catch such issues (with at the very least an Assert somewhere to ensure X is indeed Y before continuing, ensuring that we get a nice error when that assumption proves to be wrong).
As covered in TFA, they didn't think about the situation of having an odd number of cores, apart from 1, which they covered with the min
if its possible to keep it strict, keep it strict. if another solution holds, even if it seems like a band aid, it will be better than relaxing input rules on something like this.
I am not specifically expert in these types of systems regarding their security, but this is the general case for such issues if you look at security from a general stance.
The trunc, i am not sure how expensive it is as an operation, but that seems like a good solution to me to sanitize the input. That way you can still 'detect' and reject invalid inputs (floats). handling floats is very different than integers (with things like NaN / inf etc.) so if you want to allow floats thats' a whole new area you need to test etc. rather than simply rejectig the invalid input.
Don't do that. Also both the software and its author is now on my blacklist.
Performance is the most important feature at some level. If an unauthenticated request from the public internet ties up your web server's CPU for more than a few hundred microseconds, you are eventually going to get screwed no matter how you slice it.
Systems that use more electricity for no other reason than use more electricity and raise bills should be illegal imho, as that's what Anubis does and why it blows my mind people think PoW is a good idea.
Imagine you bought a TV and there was a device in the TV that did /nothing/ but just use more electricity to persuade agaisnt people having too many TVs in their house.
Pretending like these solutions don't exist seems to be the central prerequisite for engaging in much of the related conversation space.
Yeah, I'm imagining it and my imagination tells me it's trivial, especially next to the thick layer of web framework sludge people smear over everything.
Do you have an argument that involves a number?
ranger_danger•3d ago
Why?
What would the alternative have been?
tux3•3d ago
The first effect is great, because it's a lot more annoying to bring up a full browser environment in your scraper than just run a curl command.
But the actual proof of work only takes about 10ms on a server in native code, while it can take multiple seconds on a low-end phone. Given the companies in questions are building entire data centers to house all their GPUs, an extra 10ms per web-page is not a problem for them. They're going to spend orders of magnitude more compute actually training on the content they scraped, than solving the challenge.
It's mostly the inconvenience of adapting to Anubis's JS requirements that held them back for a while, but the PoW difficulty mostly slowed down real users.
mhh__•3d ago
you can even get a curl fork that drives a browser under the hood
o11c•3d ago
alright2565•3d ago
MBCook•3d ago
https://github.com/TecharoHQ/anubis/pull/1038
Could someone explain how this would help stop scrapers? If you’re just running the page JS wouldn’t this run too and let you through?
fluoridation•3d ago
MBCook•3d ago
fluoridation•3d ago
ranger_danger•3d ago
> how this would help stop scrapers
I think anubis bases its purpose on some flawed assumptions:
- that most scrapers aren't headless browsers
- that they don't have access to millions of different IPs across the world from big/shady proxy companies
- that this can help with a real network-level DDoS
- that scrapers will give up if the requests become 'too expensive'
- that they aren't contributing to warming the planet
I'm sure there does exist some older bots that are not smart and don't use headless browsers, but especially with newer tech/AI crawlers/etc., I don't think this is a realistic majority assumption anymore.
zetanor•3d ago
jsnell•3d ago
In an adversarial engineering domain neither the problems or solutions are static. If by some miracle you have a perfect solution at one point in time, the adversaries will quickly adapt, and your solution stops being perfect.
So you’ll mostly be playing the game in this shifting gray area of maybe legit, maybe abusive cases. Since you can’t perfectly classify them (if you could, they wouldn’t be in the gray area), the options are basically to either block all of them, allow all of them, or issue them a challenge that the user must pass to be allowed. The first two options tend to be unacceptable in the gray area, so issuing a challenge that the client must pass is usually the preferred option.
A good counter-abuse challenge is something that has at least one of the following properties:
1. It costs more to pass than the economic value that the adversary can extract from the service, but not so much that the legitimate users won’t be willing to pay it.
2. It proves control of a scarce resource without necessarily having to spend that resource, but at least in such a way that the same scarce resource can’t be used to pass unlimited challenges.
3. It produces additional signals that can be used to meaningfully improve the precision/recall tradeoff.
And proof of work does none of those. The last two by construction, since compute is about the most fungible resource in the world. The last doesn't work since it's impossible to balance the difficulty factor such that it imposes a cost the attacker would notice but would be acceptable to the defender.
If you add 10s to the latency for your worst-case real users (already too long), it'll cost about $0.01/1k solves. That's not a deterrent to any kind of abuse.
So proof of work just is a really bad fit for this specific use case. The only advantage is that it is easy to implement, but that's a very short term benefit.
tptacek•3d ago
What the Anubis POW system is doing right now is exploiting the fact that there's been no need for crawlers to be anything but naive. But the cost to make them sophisticated enough to defeat the POW system is quite low, and when that happens, the POW will just be annoying legit users for no benefit.
I don't know if "mistake" is the word I'd use for it. It's not a whole lot of code! It's a reasonable first step to force crawlers to emulate a tiny fraction of a real browser. But as it evolves, it should evolve away from burning compute, because that's playing to lose.
reisse•3d ago
However the exact PoW implementation (hash) chosen by Anubis might significantly reduce this asymmetry, because the calculation speed is highly dependent on hardware.
tptacek•3d ago
Tavis Ormandy went into more detail on the math here, but it's not great!
comex•3d ago
tptacek•3d ago
(1) there's a sharp asymmetry between adversaries and legitimate users (as with password hashes and KDFs, or antiabuse systems where the marginal adversarial request has value ~reciprocal to what a legit users gets, as with brute-forcing IDs)
(2) the POW serves as a kind of synchronization clock in a distributed system (as with blockchains)
What's case (3) here?
lmm•3d ago
robocat•3d ago
If it costs them $1000 to grab a web page but they earn $1001 then they will do that again and again to earn that buck.
rfl890•3d ago
swiftcoder•3d ago
Unfortunately for the user on a low-end phone, the overhead can be several seconds. For the scraper it's only ever 10ms because that's running on a (relatively) powerful server CPU.
ChocolateGod•2d ago
Say a user browses 10 sites, all restricted by Anubis that add 5 second to the load time, that's 50 additional seconds the user is spent waiting. A scraper with enterprise grade server hardware? that's 5 seconds for all 10 sites.