Sometimes CPU cores are odd

https://anubis.techaro.lol/blog/2025/cpu-core-odd/

129•rbanffy•3d ago

Comments

ranger_danger•3d ago

> In retrospect implementing the proof of work challenge may have been a mistake

Why?

What would the alternative have been?

tux3•3d ago

It does two things: Force everyone (including scrapers) to run a real JS engine, and force everyone to solve the challenge.

The first effect is great, because it's a lot more annoying to bring up a full browser environment in your scraper than just run a curl command.

But the actual proof of work only takes about 10ms on a server in native code, while it can take multiple seconds on a low-end phone. Given the companies in questions are building entire data centers to house all their GPUs, an extra 10ms per web-page is not a problem for them. They're going to spend orders of magnitude more compute actually training on the content they scraped, than solving the challenge.

It's mostly the inconvenience of adapting to Anubis's JS requirements that held them back for a while, but the PoW difficulty mostly slowed down real users.

mhh__•3d ago

I'd have thought that anyone doing serious scraping has been driving a browser for years at this point.

you can even get a curl fork that drives a browser under the hood

o11c•3d ago

It does 2.5 things actually - it counters the aggressive use of proxies by scrapers.

alright2565•3d ago

In part because this particular proof of work is absolutely trivial at scale, with commercial hardware able to do 390TH/s, while your typical phone would only be able to do a million and still have acceptable latency.

MBCook•3d ago

They also suggest maybe “proof of React” would be better with a link to this rough proof of concept:

https://github.com/TecharoHQ/anubis/pull/1038

Could someone explain how this would help stop scrapers? If you’re just running the page JS wouldn’t this run too and let you through?

fluoridation•3d ago

Low-effort scrapers don't run JS, they just fetch static content.

MBCook•3d ago

But then they couldn’t get past the current Anubis. Sonia the idea it would just be cheaper for clients?

fluoridation•3d ago

That's the idea. Impose software requirements on the client instead of computational requirements.

ranger_danger•3d ago

They admitted that this was a 'shitpost'.

> how this would help stop scrapers

I think anubis bases its purpose on some flawed assumptions:

- that most scrapers aren't headless browsers

- that they don't have access to millions of different IPs across the world from big/shady proxy companies

- that this can help with a real network-level DDoS

- that scrapers will give up if the requests become 'too expensive'

- that they aren't contributing to warming the planet

I'm sure there does exist some older bots that are not smart and don't use headless browsers, but especially with newer tech/AI crawlers/etc., I don't think this is a realistic majority assumption anymore.

zetanor•3d ago

In practice, any automated work that a real user is willing to wait through will be trivial to accomplish for an organization which scrapes the entire Internet. The real weight behind Anubis is the Javascript gate, not the PoW. It might as well just fetch() into browser.cookies.set().

jsnell•3d ago

An unavoidable aspect of abuse problems is that there is no perfect solution. As the defender, you’re always making a precision vs. recall tradeoff. After you’ve picked off the low hanging fruit, most of the time the only way to increase recall (i.e. catch more abuse) is by reducing the precision (i.e. having more false positives, where a good user is falsely considered an abuser).

In an adversarial engineering domain neither the problems or solutions are static. If by some miracle you have a perfect solution at one point in time, the adversaries will quickly adapt, and your solution stops being perfect.

So you’ll mostly be playing the game in this shifting gray area of maybe legit, maybe abusive cases. Since you can’t perfectly classify them (if you could, they wouldn’t be in the gray area), the options are basically to either block all of them, allow all of them, or issue them a challenge that the user must pass to be allowed. The first two options tend to be unacceptable in the gray area, so issuing a challenge that the client must pass is usually the preferred option.

A good counter-abuse challenge is something that has at least one of the following properties:

1. It costs more to pass than the economic value that the adversary can extract from the service, but not so much that the legitimate users won’t be willing to pay it.

2. It proves control of a scarce resource without necessarily having to spend that resource, but at least in such a way that the same scarce resource can’t be used to pass unlimited challenges.

3. It produces additional signals that can be used to meaningfully improve the precision/recall tradeoff.

And proof of work does none of those. The last two by construction, since compute is about the most fungible resource in the world. The last doesn't work since it's impossible to balance the difficulty factor such that it imposes a cost the attacker would notice but would be acceptable to the defender.

If you add 10s to the latency for your worst-case real users (already too long), it'll cost about $0.01/1k solves. That's not a deterrent to any kind of abuse.

So proof of work just is a really bad fit for this specific use case. The only advantage is that it is easy to implement, but that's a very short term benefit.

tptacek•3d ago

Without getting into the alternatives: scraper defense isn't a viable proof of work setting, because there's no asymmetry to exploit. You're imposing exactly the same cost on legit users as you are on scrapers. Economies of scale mean that the marginal cost for your adversary is actually significantly lower than for your real users.

What the Anubis POW system is doing right now is exploiting the fact that there's been no need for crawlers to be anything but naive. But the cost to make them sophisticated enough to defeat the POW system is quite low, and when that happens, the POW will just be annoying legit users for no benefit.

I don't know if "mistake" is the word I'd use for it. It's not a whole lot of code! It's a reasonable first step to force crawlers to emulate a tiny fraction of a real browser. But as it evolves, it should evolve away from burning compute, because that's playing to lose.

reisse•3d ago

Wait, but there is an asymmetry. Legitimate user spends at least a dozen seconds on a page, they don't care about 10ms overhead. For a scraper, however, 10ms overhead can easily be 10x the time it spends on a page overall - the scraper is now ten times slower.

However the exact PoW implementation (hash) chosen by Anubis might significantly reduce this asymmetry, because the calculation speed is highly dependent on hardware.

tptacek•3d ago

No, I don't think this is accurate. You have to look at both the cost and the benefit. If you're an AI scraper, it's literally just "what does the marginal next token of training data cost me" --- the answer is: the same as the marginal next token of content costs a reader.

Tavis Ormandy went into more detail on the math here, but it's not great!

comex•3d ago

I don’t understand what you mean. Training an LLM requires orders of magnitude more tokens than any one human will ever read. Perhaps an AI company can amortize across all their users, but it would still represent a substantial cost. And I’m pretty sure the big AI companies don’t rely on abusive scraping (i.e. ignoring robots.txt), so the companies doing the scraping may not have a lot of users anyway.

tptacek•3d ago

Tavis Ormandy's post goes into more detail about why this isn't a substantial cost for AI vendors. For my part: we've seen POWs deployed successfully in cases where:

(1) there's a sharp asymmetry between adversaries and legitimate users (as with password hashes and KDFs, or antiabuse systems where the marginal adversarial request has value ~reciprocal to what a legit users gets, as with brute-forcing IDs)

(2) the POW serves as a kind of synchronization clock in a distributed system (as with blockchains)

What's case (3) here?

lmm•3d ago

The next word is worth less to AI scrapers than to human readers - AIs need to read thousands of articles to get as much value as a human gets from one good article. If you make it cost, say, 5c-equivalent to read an article (but without the overhead of micropayments and authorisations), human readers will happily pay that whereas AI scrapers can't afford even 1c-equivalent.

robocat•3d ago

They care about whether the rewards exceed the costs; they don't give a shit what the actual cost is.

If it costs them $1000 to grab a web page but they earn $1001 then they will do that again and again to earn that buck.

rfl890•3d ago

I don't know of any network latency <=1ms over the public internet, so 10ms overhead might be 2x at best.

swiftcoder•3d ago

> Legitimate user spends at least a dozen seconds on a page, they don't care about 10ms overhead.

Unfortunately for the user on a low-end phone, the overhead can be several seconds. For the scraper it's only ever 10ms because that's running on a (relatively) powerful server CPU.

ChocolateGod•2d ago

The scraper unlike a legitimate human can load and analyse parallel websites simultaneously, so really the difference makes no difference to a scraper.

Say a user browses 10 sites, all restricted by Anubis that add 5 second to the load time, that's 50 additional seconds the user is spent waiting. A scraper with enterprise grade server hardware? that's 5 seconds for all 10 sites.

DiabloD3•3d ago

Wait, the Anubis people _didn't know_ 3 core machines were sold for years? AMD was famous for it!

Sesse__•3d ago

What about... single-core machines?

nerdsniper•3d ago

SMT generally caused single-core CPU's to appear as 2 logical cores.

I realize Anubis was probably never tested on a true single-core machine. They are actually somewhat difficult to find these days outside of microcontrollers.

crote•3d ago

Even in microcontrollers it is starting to become increasingly rare! We've progressed to a point where sub-$1 hobbyist chips like the RP2040 are multicore these days.

john-h-k•3d ago

The line of code in the article is `Math.max(nproc / 2, 1)`. So 1 core yields 1 thread. Only CPUs with an odd number of cores, no SMT, and >1 core will hit this bug. Not very common

jsheard•3d ago

In theory a CPU with SMT could still trigger this bug, because not every core necessarily has to have SMT. Intel made some chips that combined performance cores with SMT and efficiency cores without SMT, so if they had an odd number of E-cores they'd have ended up with an odd number of threads regardless.

jeffbee•3d ago

You can also just boot linux with maxcpus=5 or any other number. Believing things about the parity of the number of CPUs is just nuts.

nerdsniper•3d ago

In their testing, even with odd numbers of physical cores, SMT caused an even number of logical cores. Some phones didn't have SMT, and also had an odd number of physical cores, but this was genuinely rare.

Also, they still might not (but probably learned). In this article they imply that each type of CPU core (what they call a "tier" in the article) will still be a power of two, and one just happened to be 2^0. I'm not sure they were around when the AMD Athlon II X3 was hot.

>>> Today I learned this was possible. This was a total "today I learned" moment. I didn't actually think that hardware vendors shipped processors with an odd number of cores, however if you look at the core geometry of the Pixel 8 Pro, it has three tiers of processor cores. I guess every assumption that developers have about CPU design is probably wrong.

jeffbee•3d ago

> each type of CPU core (what they call a "tier" in the article) will still be a power of two

Yeah that's obviously not true, and believing it shows a marked lack of experience in the field. Of the current Xeon workstation lineup, only 3 of 14 SKUs have power-of-2 core counts. And there are consumer lines of CPUs with 6 cores and that sort of thing.

PaulDavisThe1st•3d ago

I believe that the assumption was multiple of two, not power of two.

nerdsniper•3d ago

Yeah I both terribly mistyped and misrepresented Anubis' assumption. I'm sorry for that error.

astrange•3d ago

The current Apple TV has 5 cores. No web browser though.

masklinn•3d ago

11th gen iPad uses a 5-cores A16. The 14” M3Pro MBP exists in 11 cores (MRX33 / MRX63). There’s also a 9 cores M4, used on the iPad Pro.

saagarjha•2d ago

And A8x in the iPad Air 2 was tri-core.

armada651•3d ago

When AMD shipped their X3 CPUs I'm pretty sure they didn't support Hyper Threading either.

somat•3d ago

The Wii U and Xbox 360 were also triple core machines... both triple core powerpc processors with ATI graphics... Was IBM having a sale on 3 core ppc hardware that year?

I never thought about it before but I actually had to look up die shots to make sure they were not the same processor. and if I can trust the internet they are not. Hell I had to confirm that yes the playstation 3(also ppc, queue x-files theme) only had the one core and it's screwball subprocessers like I remembered.

swiftcoder•3d ago

Interestingly, the single PPE core in the PS3 and the 3 cores in the Xbox 360, are pretty much the same core.

Narishma•3d ago

I think the SIMD is different between them. I don't remember the details but it was beefed up on the Xbox 360 because it didn't have the SPEs like the PS3.

numpad0•3d ago

The world was way richer back then and could afford to spend time and money building desktop scale CPUs for game consoles that sells just tens of millions over few years. They just constantly made new CPUs.

To me, the executive level idea was to spare a core for system and minor peripheral tasks - almost all game console before that generation had single CPU architectures with no resident operating system. Transitioning away from a manually coordinated, single threaded code, to an asynchronous multithreaded programming model, was a challenge in itself, and having to deal with it while operating system forced by the console manufacturer taking away control might have been too much for developers.

(Sega Saturn had 2x SH-2, Dreamcast was to have WinCE on ROM but cancelled. Sega being Sega)

Joker_vD•3d ago

I think I actually saw a question on SO way back during the Windows Vista era when some guy asked if Windows supported machines with odd number of cores/processors, and the answer was "well, 1 is an odd number, you know".

Another joke from the same era: Having a 2 core processor means that you can now e.g. watch a film at the same time. At the same time with what? At the same time with running Windows Vista!

creatonez•3d ago

Sure, but 1 is also a power of 2:

2^0 = 1

So the logic might make sense in people's heads if they never encounter 6 or 12 core CPUs that are common these days.

MindSpunk•3d ago

Even long ago we had the AMD Phenom X3 chips which were 3 cores.

jsheard•3d ago

The fun thing about those is they were physically quad cores with one core disabled, which may or may not have been defective, so if you were lucky you could unlock it and get a bonus core for free.

hinkley•3d ago

Binning made the world weird.

close04•3d ago

Made it so sweet too. Buying a product that could become so much more (more frequency, more cores, more cache) at no extra cost was magical.

tomkarho•3d ago

Got a 4 core machine that way dirt cheap. Bought (a phenom II BE I think) 2 core cpu which unlocked into a quad core.

berlinismycity•3d ago

Unlocking downgraded tech is one of the purest joys.

therealfiona•2d ago

Unlocked a 6800gs video card back in the 8x agp days. Freaking sweet for a 16yo!

abhinavk•3d ago

Xbox 360 (which ran a modified version of Win 2000) had 3 PowerPC cores.

extraduder_ire•2d ago

The ps3 had 7. Also for yield reasons.

quadruple•2d ago

Well, the PS3 had 1 PowerPC core, the PPE, with 8 SPEs. One of which gets disabled for yield, another gets taken by the Hypervisor.

perching_aix•3d ago

The whole Anubis thing is a really interesting predicament for me.

I have Chrome on mobile configured as such that JS and cookies are disabled by default, and then I enable them per site based on my judgement. You might be surprised to learn that normally, this actually works fine, and sites are usually better for it. They stop nagging, and load faster. This makes some sense in retrospect, as this is what allows search engine crawlers to do their thing and get that SEO score going.

Anubis (and Cloudflare for that matter) force me to temporarily enable JS and cookies at least once however anyways, completely defeating the purpose of my paranoid settings. I basically never bother to, but I do admit it is annoying. It's kind of up there with sites that don't have any content by default, only with JS on (high profile example: AWS docs). At least Cloudflare only spoils the fun every now and then. With Anubis, it's always.

It's definitely my fault, but at the same time, I don't feel this is right. Simple static pages now require allowing arbitrary code execution and statefulness. (Although I do recognize that SVGs and fonts also kind of do so anyhow, much to my further annoyance).

altairprime•3d ago

We have nothing to protect sites against scrapers except to make it more expensive for everyone’s, unless privacy-compromising or authority-trusting methods are on the table.

Making you pay time, power, bandwidth, or money to access content does not significantly impede your browsing, so long as the cost is appropriately small. For the user above reporting thirty seconds of maxcpu, that’s excessive for a median normal person (but us hackers are not that).

If giving your unique burned-in crypto-attested device ID is acceptable, there’s an entire standard for that, and when your device is found to misbehave, your device can be banned. Nintendo, Sony, Xbox call this a “console ban”; it’s quite effective because it’s stunningly expensive to replace a device.

If submitting proof of citizenship through whatever attestation protocol is palatable is okay, the Anubis could simply add the digital ID web standard and let users skip the proof of work in exchange for affirming that they have a valid digital ID. But this only works if your specific identity can be banned, or else AI crawlers will just send a valid anonymized digital ID header.

This problem repeats in every suggested outcome: either you make it more difficult for users to access a site, or you require users to waste energy to access a site, or you require identifiable information signed by a dependable third-party authority to be presented such that a ban is possible based on it. IP addresses don’t satisfy this; Apple IDs, trusted-vendor HSM-protected device identifiers, and digital passports do satisfy this.

If you have a solution that only presents barriers to excessive use and allows abusive traffic to be revoked without depending on IP address, browser fingerprint, or paid/state credentials, then you can make billions of dollars in twelve months.

Ideas welcome! This has been a problem since bots started scraping RSS feeds and republishing them as SEO blogs, and we still don’t have a solution besides Cloudflare and/or CPU-burning interstitials.

(ps. I do have a solution for this, but it would require physical builds, be mildly unprofitable over time with no growth potential, and incite governments hostility towards privacy-preserving identity systems. A billionaire philanthropist could build it in a year and completely solve this problem. Sigh.)

tptacek•3d ago

We very definitely do have stuff to protect sites that don't make it more expensive for everyone! Just none of it is open source.

necubi•3d ago

And, fundamentally it can't be opensource. Bot detection (like anti-fraud more generally) is an adversarial game that relies on hidden techniques. Open-sourcing it means you lose that advantage and make life much easier for anyone trying to get around it.

tptacek•3d ago

I think there's probably a platform for it that you can open source --- the virtual machine, or the core of the virtual machine or something, but yeah, you're right, this is something Anubis will have to contend with long term; the effective solutions for this all benefit from obscurity.

PaulDavisThe1st•3d ago

There's zero reason it cannot be opensource. Proof-of-XXXXX schemes do not rely on obscurity to be functional.

tptacek•3d ago

The schemes large players use to increase the cost of e.g. creating new accounts on their services do in fact rely on obscurity. They target developer cost, not compute cost.

PaulDavisThe1st•3d ago

But that's not what Anubis (the subject of TFA, and most of the comment threads here) is/are about.

tptacek•3d ago

I think it's exactly what Anubis is about? I'm pretty familiar with what Xe is doing.

Krutonium•3d ago

...You're computing a hash, not making accounts or anything else that relies on obscurity though?

tptacek•3d ago

You're computing hash in order to block automated web requests. The hash isn't the point --- the hash cost is the part of this system that isn't going to work long term.

jamesnorden•3d ago

It makes it more expensive for the people running scrapers, and thus not worth it for them in 99.9% of cases, that's the whole point of PoW.

perching_aix•3d ago

I actually do not have a problem with digital IDs, as long as my personal identity isn't being shared alongside it. Not to the site operator, not to the government.

This might seem contradictory, but I believe this is technically possible? What I don't think is this is how these solutions actually work currently. Like to basically prove that I am indeed a unique visitor who's a person according to the govt, but wouldn't reveal the person info to the site, and wouldn't reveal the site info to the govt, even if they collude.

Same with the whole +18 goof. I'd actually quite like to try age gated communities, like +-5 years my age. I feel a lot of conflict stems from people coming from a bit too different walks of life sometimes. Could even do high confidence location based gating this way, which could also be cool (as well as the exact opposite of cool, because of course).

fluoridation•3d ago

Assuming a person can only have a single ID, how would that be enforced without a unique party having a 1-to-1 mapping between person and ID?

perching_aix•3d ago

The party to hold a gov id - site id mapping would be you. The rest can then be facilitated using zero-knowledge proofs, I believe.

I'm not super well-versed in crypto though, so I confess this is a lot more conjecture than knowledge.

fluoridation•3d ago

So the user generates the ID for each site he visits? What prevents them from generating arbitrary IDs?

The only way I can imagine this working is:

1. You go to the government and request to have a digital ID generated.

2. The government generates a random number.

3. The government issues a request to an NGO to generate a new cryptographic object based on the random number, and receives back a retrieval number.

4. The government gives you the retrieval number, which you can use to get your digital ID from the NGO.

This way, the government only has the mapping between your identity and a random number, and the NGO only has the mapping between the random number and the generated object, with no possibility to deanonymize it because you don't present any ID to get it. Obviously, there must be no information exchange between the government and the NGO.

perching_aix•3d ago

> So the user generates the ID for each site he visits? What prevents them from generating arbitrary IDs?

The construction would go basically like this:

pseudonym = VRF(secret_key + site_id)

The expectation is that you would have only one valid secret_key at any time, and it would be unknown to the government. This kind of scheme is called anonymous credential generation in literature I believe. It can be established the secret_key got govt backed, but that's it.

The site_id would be e.g. domain cert public key or similar (domain ownership is a moving target, so just the domain name imo is not sound).

VRF is a verifiable random function. This is the magic ZK part.

Pseudonym is what you present to the site, i.e. the identity you go by.

This way the site can verify that this pseudonym was specifically issued for it (making it site unique), and that it belongs to a govt certified identity (of which there should be only one issued at a time per person). The VRF is deterministic, guaranteeing that it's the same person every time.

Revocation is annoying so I didn't bother thinking that through but should be fairly okay I think?

I believe this is robust to people forging arbitrary IDs, to sites colluding with each other in deanonymization, and colluding with the govt in the same. The only kickers I can think of are secret_key misuse (e.g. via duress) / theft / loss / sharing, and the trust anchor (the govt) being untrustworthy (forging invalid or duplicate identities). Would also need to handle people dying, but that would be pretty much just revocation.

I consider trust anchor issues out of scope. The remainder doesn't sound too bad to try defending for, and I think is also basically out of scope.

Potentially important edit: I'm not accounting for timing side channels here, which might be relevant during revocation or else.

Another: didn't mention but in my humble opinion cryptographically attesting people is unsound. People can't calculate crypto in their head, and can't recall long arbitrary strings of hex. What is appropriate to attest (if anything) is their devices instead. But that's a layer of complication I didn't want to deal with here.

fluoridation•3d ago

>The expectation is that you would have only one valid secret_key at any time

Why, though? If you're the only one who knows it, nothing prevents you from creating as many identities for the same site as you wish.

perching_aix•3d ago

It's an interactively generated thing, so the govt can ensure you can only complete it once, while being ignorant of its content. Or at least that's the claim of these protocols (e.g. Camenisch–Lysyanskaya (CL) signatures) afaik. I'm not sure how they work in detail.

altairprime•3d ago

Why assume that?

Person W is welcome to have thousands of unique IDs if they want to, so long as when site X bans identity Y, that ban is applied to all of Person W’s present and future identities. Whether W has a single Y or a thousand Y makes no difference to me. I suppose some sites will care to restrict participation to a single Y per W, but e.g. in the general browsing a site with crawler/bot/AI shielding such as Anubis today, it’s completely irrelevant to them what your Y is so long as rate limits and bans apply to all Y of W rather than to the presented Y alone.

fluoridation•3d ago

Uhh... Sure, I suppose. I don't think it'd be possible to have that without also making it obvious that two different IDs are controlled by the same person, but sure.

altairprime•3d ago

The core of the identity problem is, “how can someone ban you personally without being able to identify you?”, and so far as I know, “a trusted third-party checks your identity and issues you an identifier” is the best we’ve got - but of course those identifiers can be enriched, so you end up needing a single centralized third-party that can issue identities and also honor bans for “the individual behind identity X” from specific sites by site request, while being audit-proven to actually enforce those bans upon me regardless of how many anonymous identities I choose to generate and use. (If you don’t run this as a monopoly, either they all pool their user-operator banned identities lists to prevent ban evasion, which will eventually leak, or they will be compelled to someday later down the road when they lose the inevitable monopoly lawsuits.)

It’s not difficult to solve this problem — the database schema and queries are dead simple! — it’s just exceedingly difficult to succeed if you're not a passport-issuing entity or an authorized monopoly of such.

AstralStorm•3d ago

Specifically, only something government sized is trustworthy enough to not plainly sell your data later... Or perhaps get it stolen.

Scratch that, this happens all the time. With a third party there's no way to revoke, government you can usually physically handle this.

perching_aix•3d ago

I wrestle with this by just separating this concern.

In the model I described, the trust anchor would be the govt, so basically a centralized model like domain certs. This resolves the issues you list off, but brings others: what if the trust anchor isn't trustworthy and starts forging identities?

The alternative to that would then be web of trust stuff. But this is why I consider this to be a separate problem. If the core protocol could be laid out and standardized at least, then layering on another that makes this choice between centralized vs web of trust could be done separately.

altairprime•3d ago

https://www.w3.org/TR/vc-overview/

PaulDavisThe1st•3d ago

> so long as the cost is appropriately small.

there are different metrics for cost, however. Based on cpu utilization and/or time, it's hard to argue that Anubis is a high price.

But if it is important to you to not run javascript for whatever reason, the price of access to a site using Anubis is rather high.

userbinator•3d ago

No, no, no, hell fucking no!!

You put stuff on the public Internet, expect it to be read by everyone.

Don't like that? Put it behind a login.

How did the propaganda persuade people into accepting mass surveillance and normalising the invasion of privacy for something that was never really a problem?

I suspect the ones pushing this are the ones running these "crawlers" themselves, much like Cloudflare hosts providers of DDoS services.

What a coincidence that "identity verification" became a hot topic recently.

Don't even try to solve this nonexistent "problem", because that will only drive us further into authoritarian technocracy.

altairprime•3d ago

> What a coincidence that "identity verification" became a hot topic recently.

Crying “Conspiracy” in reply to a career Chicken Little is comedic. I’ve been raising warnings about identity verification looming on the horizon for perhaps fifteen years now; thanks to DejaNews for that early realization, I suppose.

> Don't even try to solve this nonexistent "problem", because that will only drive us further into authoritarian technocracy.

I would celebrate and tell all my friends if someone on this thread, on any thread, would explain how we solve this without bankrupting non-business site operators and without a third-party authority. Anubis is a band-aid at best, yet no better solution — not even an idea — is presented alongside your objections.

> You put stuff on the public Internet, expect it to be read by everyone.

My hobbyist forum can barely stay online eight hours a day due to crawler traffic. Someone scraped the entire site by spawning one request per page with no fork limit last year. It was down for a solid week after that, and now has very severe limits in place. I don’t know how they can afford to stay running, but certainly “static only” isn’t going to solve the CPU and bandwidth costs incurred by incompetent and redundant AI crawlers. So, by making their site public in today’s infested internet, their content is no longer accessible.

> Don't like that? Put it behind a login.

As I noted above, one solution is payment — since free credentials registration is not an obstacle to AI bots, after all. For some reason people don’t like to charge money for hobbyist content if they can avoid it. I recognize why and am trying my best to discover a non-monetary solution on their behalf.

> I suspect the ones pushing this are the ones running these "crawlers" themselves, much like Cloudflare hosts providers of DDoS services.

I do not, have not, and will not run crawlers or AI agents, trainers, or other such shit at any time in the past thirty years and will continue to abstain from the entire category, which should be quite easy as I’m a retired sysop now attending full-time accounting school and giving a finger to the entire industry to pursue work that benefits humanity. Same reason I bother telling HN “the anonymity sky is falling” every so often: I’d much prefer it if we didn’t have to sacrifice anonymity online to defeat scraper bots.

> No, no, no, hell fucking no!!

Please find a way to turn your vehemence and passion into a productive contribution, before it’s too late for all of us. As presented, your argument is neither supported nor persuasive, and your hostility only gives opponents of anonymity more arrows in their quiver to shoot at us.

motorest•3d ago

> We have nothing to protect sites against scrapers except to make it more expensive for everyone’s (...)

Authentication works, doesn't it?

altairprime•3d ago

Paid credentials work, yes.

Free credentials temporarily do, but only until someone teachers crawlers to AI-automate signups. Forum spammers already figured this out a long time ago.

A while back someone from Financial Times commented on HN about how it was confusing that we were all so hostile to their free registration required article paywall. The HN community view on their registration requirement suggests that free authentication does not work and has not for some time; has that viewpoint changed in recent years?

motorest•3d ago

> Free credentials temporarily do, but only until someone teachers crawlers to AI-automate signups. Forum spammers already figured this out a long time ago.

Forum spamming is a solved problem, isn't it? I mean, do you see any spam here on HN?

> The HN community view on their registration requirement suggests that free authentication does not work and has not for some time; has that viewpoint changed in recent years?

I don't think there is such a thing as a "HN community view". Are there meetings to agree on a consensus on important topics? Or were you trying to pull an appeal to authority?

altairprime•3d ago

> Forum spamming is a solved problem, isn't it? I mean, do you see any spam here on HN?

Yes, I’ve reported two or three this past week or so by emailing the mods. Do you browse /new often? What do you think makes HN a spam-free place?

> I don't think there is such a thing as a "HN community view".

Noted.

o11c•3d ago

SQRL solves at least part of the ID problem.

PaulHoule•3d ago

It seems like a whole lot of crap to me. Hostile webcrawlers, not to mention Google, frequently run Javascript these days.

Where I work our main product is a React-based web site with a JSON back end, you might go to

http://example.com/web/item/88841

and that will load maybe 20MB of stuff (always the same thing) and eventually after the JS boots up a useEffect() gets called that reads '88841' out of the URL and does a GET to

http://example.com/api/item/88841

which gets you nicely formatted JSON. On top of that the public id(s) are sequential integers so you could easily enumerate all the items if you just thought a little bit.

We've had more than one obnoxious crawler that we had reason to believe was targeted specifically at us that would go to the /web/ URL and, without a cache, download all the HTML, Javascript, CSS, then run the JS and download the JSON for each page -- at which case they are either saving the generated HTML or looking at the DOM. If they'd spent 10 minutes playing with the browser dev tools they would have seen the /item/ request and probably could have figured out pretty quickly out how to interpret the results. As is they're going to have to figure out how to parse that HTML and turn it into something like the JSON and could probably save them 95% of the bandwidth, 95% of the CPU, and whatever time they spent writing parsing code and managing their Rube Goldberg machine but I'd take 50% odds any day that they never actually did anything with the data they captured because crawlers usually don't.

I know because I've done more than my share of web crawling and I have crawlers that: capture plain http data, can run Javascript in a limited way, and can run React apps. The last one would blast right past Anubis without any trouble except for the rate limiting which is not a lot of problem because when I crawl I hit fast, I hit hard, and I crawl once. [1] (There's a running gag in my pod that I can't visit the state of Delaware because of my webcrawling)

[1] Ok, sometimes the way you avoid trouble is hit slow, hit soft, but still hit once. It's a judgement call if you can hit them before they knew what hit them or if you can blend in with the rest of the traffic.

gucci-on-fleek•3d ago

> I know because I've done more than my share of web crawling and I have crawlers that: capture plain http data, can run Javascript in a limited way, and can run React apps. The last one would blast right past Anubis without any trouble except for the rate limiting which is not a lot of problem because when I crawl I hit fast, I hit hard, and I crawl once.

I have no problem with bots scraping all my data, I have a problem with poorly-coded bots overloading my server, making it unusable for anybody else. I'm using Anubis on the web interface to an SVN server, so if the bots actually wanted the data, they could just run "svn co" instead of trying to scrape the history pages for 300k files.

> It seems like a whole lot of crap to me. Hostile webcrawlers, not to mention Google, frequently run Javascript these days.

I'm also rather unhappy that I had to deploy Anubis, but it's unfortunately the only thing that seemed to work, and the server load was getting so bad that the alternative was just disabling the SVN web interface altogether.

bee_rider•3d ago

I just bounce off those sites most of the time. Whatever, there’s still a lot of open internet.

userbinator•3d ago

It's definitely not your fault. Don't give in to the monopoly of Big Browser and mass surveillance.

Incidentally, I read a short while ago that not having "Mozilla" in your user-agent will bypass Anubis, so give that a try.

perching_aix•3d ago

Stock mobile Chrome doesn't support extensions and doesn't have user agent manipulation capabilities unfortunately, so that's a no-go.

My options are using custom Chrome, migrating to Firefox, or proxying my traffic and making edits that way (e.g. doing the Anubis PoW there and injecting the cookie required).

Not stoked about any of these, although Firefox is a lot on my mind these days, and option #3 would be a good excuse to dust off my RPi.

anonym29•3d ago

Firefox's sandboxing still wasn't anywhere near as robust as Chrome's from a security perspective last time I checked, but fwiw, FF mobile has full FF extension support these days, including full-fat uBlock Origin.

I refuse to be boiled slowly by Google. With MV3, it was full-fat ad blockers. With MV4 it could very well be ALL ad-blockers.

And yeah, I concede that sounds conspiratorial - as conspiratorial as Google cracking down on your ability to run the ad-blocker of your choice would've sounded a decade ago.

userbinator•3d ago

If you are doing JS whitelisting then in terms of security, you're already far ahead of everyone else who just has it on by default (and isn't blocking anything either.)

pabs3•3d ago

anubis-bypass uses the User-Agent modification approach:

https://gitlab.com/zipdox/anubis-bypass

deathanatos•3d ago

> I have Chrome on mobile configured as such that JS and cookies are disabled by default

My God, there's two of us!

(Though … you're being privacy conscious on Chrome? Come to Firefox. Ignore the pesky "it's funded by Google" problems, nothing to see, nothing to see, the water is fiiiine.)

> You might be surprised to learn that normally, this actually works fine

I guess I have a different experience there. A huge number of sites just outright crash. (E.g., the HN search.) JavaScript devs, I've learned, do not handle error cases, and the exceptions tend to just propagated out and ruin the rendering. There seems to be some popular framework out there that even just destroys the whole DOM to emit just the error. (I forget the text, but it's the same text, always. Always centered. Flash of page, then crash.)

I have a custom extension that fakes the cookie storage for those JS pages that just lies & says "yeah, cookies are enabled" and the blackholes the writes. But it fails for anything that needs a real cookie … like Anubis.

I'm empathetic towards where Anubis is coming from though. But the "I passed the challenge" cookie is indistinguishable from a tracker … although probably most people running Anubis are inherently trustworthy by a sort of cultural association so long as Anubis remains non-mainstream. I think I might modify it to have the ability to store cookies for a short time frame (like 1h) in some cases, such as Anubis; that's enough to pass the challenge, but weighed against tracking. I'm usually only blocked by Anubis for something like a blog post, so that should suffice.

foresto•3d ago

You say paranoid, I say sensible. My browsers are configured almost the same way. (I'm fine with temporarily enabling cookies, but scripts are unwelcome.)

Anubis has become an annoying denial-of-service layer in front of sites that I would otherwise use. I hope its no-script mode gets enabled by default soon.

Filligree•3d ago

Ironically, this sat on the intermission page for a good half-minute while my fans spun up. Then I gave up; it was eating the battery.

ddulaney•3d ago

Can I ask what hardware you’re using? I’ve heard similar things on the internet generally, but I’m on a several-years-old phone and it took under a second. Is the interstitial really that slow on some setups?

Filligree•3d ago

I do a lot of random browsing on an old iPad. Which doesn't have fans, I know, that was short for "it got really hot".

I'm not sure what generation it is, but I bought it around a decade ago I think.

neurostimulant•3d ago

Old browsers without crypto support would fall back to pure js sha256 implementation, which I imagine would be slow on an old iPad.

dmitrygr•3d ago

> I guess every assumption that developers have about CPU design is probably wrong.

Javascripters, perhaps. Those who work on schedulers, or kernels in general would find this completely normal

yjftsjthsd-h•3d ago

TIL the CPU count is exposed to JS. I guess that's fine? It feels nasty, but it's not really worse than all the other fingerprinting data we expose...

hinkley•3d ago

It’s also frequently wrong when running in Docker. Some of that is libuv’s fault, some of it is cgroups deciding not to mask off /proc values that are wrong in the cgroup.

gck1•3d ago

Also fonts you have installed, the type of connection you're using, GPU parameters, keyboard languages on your system and so much more [1]

[1] https://abrahamjuliot.github.io/creepjs/

Jap2-0•3d ago

I might be wrong but I think I heard that Firefox and maybe Safari bin that to a couple common values. (Or maybe that's just in the tracking-prevention mode that Tor uses?)

daemonologist•3d ago

According to MDN Safari clamps it to 4 or 8, but Firefox does not: https://developer.mozilla.org/en-US/docs/Web/API/Navigator/h...

I always found it annoying that CPU information was widely available and precise while memory information was not - it's clamped to 0.25, 0.5, 1, 2, 4 or 8 GB. If you're running something memory-bound in the browser you have to be really conservative to avoid locking up the user's device (or ask them to manually specify how much memory to use). https://developer.mozilla.org/en-US/docs/Web/API/Device_Memo...

Jap2-0•2d ago

I was thinking of https://bugzilla.mozilla.org/show_bug.cgi?id=1984333. Looks like that's only active when certain fingerprinting protection settings are on.

nyarlathotep_•3d ago

I think one of the third party user.js projects does that.

Maybe https://github.com/arkenfox/user.js

neurostimulant•3d ago

> In retrospect implementing the proof of work challenge may have been a mistake and it's likely to be supplanted by things like Proof of React or other methods that have yet to be developed.

> ... a challenge method that requires the client to do a single round of SHA-256 hashing deeply nested into a Preact hook in order to prove that the client is running JavaScript.

Why a single round? Doing the whole proof of work challenge inside the proof of react would be even more effective, right?

swiftcoder•3d ago

It wouldn't be any more effective at determining that javascript is executing, and it would use a lot more resources on the client

stavros•1d ago

Can someone explain how this prove that JS is executing? Why couldn't someone simply fake the JS bit and execute the SHA round in their own code?

hinkley•3d ago

Sometimes cores are fractional. Particularly thanks to Docker. I’m currently trying to get this fixed in several NodeJS situations.

ChocolateGod•2d ago

That sounds like something to do with CPU scheduling than actual fractional cores.

hinkley•2d ago

It does but the example here is figuring out how many workers to spool up for a given machine, and there are many ratios where getting 1 vs .75 would change the correct number.

However it’s much worse than that. There’s a bug in libuv that I think is fixed on master but may not have shipped yet, where they fall back to the cgroup cpuset if the returned value is less than 1 (presumably they meant != 0). So if you’re running on a docker or kubernetes host with 8 cores and you try to give one container half a CPU, it will believe it has access to 8, not 0.5 CPUs. And now you’re task switching constantly.

ChocolateGod•3d ago

I have a S24+ and Anubis often runs poorly for me and fails. I tend to frequent tech related sites so browsing on my phone has been miserable the last couple months.

I checked the value of navigator.hardwareConcurrency on my phone and it returns 9... I guess that explains it.

It looks like setting light performance mode in device optimisations (I don't game on my phone) turns off the S24s sole Cortex-X4.

BobbyTables2•3d ago

The only bug is the use of a cursed language such as JavaScript!

dlcarrier•3d ago

An AMD Ryzen 9 7950x3D has 16 physical cores and 32 logical cores. The diminisshing return for thread counts above cores / 2 is likely due to using the logical core count, not the physical core count, as SMT doesn't improve every type of performance. It's not the fault of Firefox, but an aspect of the CPU design.

markasoftware•3d ago

How come "let's use some cool cryptography to encrypt error messages" is being considered before "let's use a strongly typed language that even web developers are starting to become fond of" as a way to prevent issues in the future?

kookamamie•3d ago

Assuming core count to be even seems pretty oblivious to me.

koyote•3d ago

Is it just me or does dividing an integer always turn on some alarm bells in my head?

I'd immediately look into what happens for odd numbers, rounding, implicit type conversions etc. Or at least that's what I was taught when I first started programming.

Also relying on "well we know that X is always Y" is almost always a mistake; maybe not always at first but definitely in the future because X will almost certainly be Y at some point. Defensive coding would catch such issues (with at the very least an Assert somewhere to ensure X is indeed Y before continuing, ensuring that we get a nice error when that assumption proves to be wrong).

bradley13•3d ago

If I must enable JS for your site to do cryptographic, well, maybe I'll live without your site. Seriously, defend yourself some other way.

jjuhl•3d ago

I wonder why he doesn't just set up a virtual machine with an odd number of vcpu's for testing.

voidUpdate•3d ago

s/he/they

As covered in TFA, they didn't think about the situation of having an odd number of cores, apart from 1, which they covered with the min

sim7c00•3d ago

>> This could have been prevented on the server side by doing less strict input validation in compliance with Postel's Law. I feel nervous about making such a security-sensitive endpoint more liberal with the inputs it can accept, but it may be fine? I need to consult with a security expert.

if its possible to keep it strict, keep it strict. if another solution holds, even if it seems like a band aid, it will be better than relaxing input rules on something like this.

I am not specifically expert in these types of systems regarding their security, but this is the general case for such issues if you look at security from a general stance.

The trunc, i am not sure how expensive it is as an operation, but that seems like a good solution to me to sanitize the input. That way you can still 'detect' and reject invalid inputs (floats). handling floats is very different than integers (with things like NaN / inf etc.) so if you want to allow floats thats' a whole new area you need to test etc. rather than simply rejectig the invalid input.

perlgeek•3d ago

I remember 3-core CPUs being sold back in the days, which were just 4-core CPUs where one of the cores was defect, and switched off in microcode.

npn•3d ago

> Anubis tries to make sure that it can use as many cores as possible in order to take advantage of your device's CPU as much as it can.

Don't do that. Also both the software and its author is now on my blacklist.

bob1029•3d ago

Imagine the environmental impact of POW running on everyone's browser all the time globally just to use the internet. What is the point of having these gigawatt class hyperscale datacenters if we are still going to make clients run laps for every request?

Performance is the most important feature at some level. If an unauthenticated request from the public internet ties up your web server's CPU for more than a few hundred microseconds, you are eventually going to get screwed no matter how you slice it.

ChocolateGod•2d ago

> Imagine the environmental impact of POW running on everyone's browser all the time globally just to use the internet.

Systems that use more electricity for no other reason than use more electricity and raise bills should be illegal imho, as that's what Anubis does and why it blows my mind people think PoW is a good idea.

Imagine you bought a TV and there was a device in the TV that did /nothing/ but just use more electricity to persuade agaisnt people having too many TVs in their house.

ManlyBread•2d ago

The real question here is whether there's a better solution for blocking scrapers and spammers thst clearly do not care about any of these things.

bob1029•2d ago

Writing fast software and using authentication when anonymous requests would be too expensive are excellent solutions.

Pretending like these solutions don't exist seems to be the central prerequisite for engaging in much of the related conversation space.

quotemstr•2d ago

> Imagine the environmental impact of POW running on everyone's browser all the time globally just to use the internet.

Yeah, I'm imagining it and my imagination tells me it's trivial, especially next to the thick layer of web framework sludge people smear over everything.

Do you have an argument that involves a number?

Cloudflare Radar: AI Insights

We should have the ability to run any code we want on hardware we own

Making Minecraft Spherical

CocoaPods Is Deprecated

Bear is now source-available

A Review of Nim 2: The Good and Bad with Example Code

Git for Music – Using Version Control for Music Production (2023)

Preserving Order in Concurrent Go Apps: Three Approaches Compared

"Turns out Google made up an elaborate story about me"

Show HN: Simple modenized .NET NuGet server reached RC

Zfsbackrest: Pgbackrest style encrypted backups for ZFS filesystems

Tetris is NP-hard even with O(1) rows or columns [pdf]

UK's largest battery storage facility at Tilbury substation

Eternal Struggle

Bash Prompts Collection

“This telegram must be closely paraphrased before being communicated to anyone”

India's billion-dollar e-waste empire

Trade in War

C++: Strongly Happens Before?

Lewis and Clark marked their trail with laxatives

Telli (YC F24) is hiring engineers, designers, and interns (on-site in Berlin)

Jujutsu for everyone

De-Googling TOTP Authenticator Codes

Chronicle – Idiomatic, type safe event sourcing framework for Go

A Linux version of the Procmon Sysinternals tool

Ask HN: Do custom ROMs exist for electric cars, for example Teslas?

The Qweremin

Use One Big Server (2022)

Pong Clock

A Crack in the Cosmos

Cloudflare Radar: AI Insights

We should have the ability to run any code we want on hardware we own

Making Minecraft Spherical

CocoaPods Is Deprecated

Bear is now source-available

A Review of Nim 2: The Good and Bad with Example Code

Git for Music – Using Version Control for Music Production (2023)

Preserving Order in Concurrent Go Apps: Three Approaches Compared

"Turns out Google made up an elaborate story about me"

Show HN: Simple modenized .NET NuGet server reached RC

Zfsbackrest: Pgbackrest style encrypted backups for ZFS filesystems

Tetris is NP-hard even with O(1) rows or columns [pdf]

UK's largest battery storage facility at Tilbury substation

Eternal Struggle

Bash Prompts Collection

“This telegram must be closely paraphrased before being communicated to anyone”

India's billion-dollar e-waste empire

Trade in War

C++: Strongly Happens Before?

Lewis and Clark marked their trail with laxatives

Telli (YC F24) is hiring engineers, designers, and interns (on-site in Berlin)

Jujutsu for everyone

De-Googling TOTP Authenticator Codes

Chronicle – Idiomatic, type safe event sourcing framework for Go

A Linux version of the Procmon Sysinternals tool

Ask HN: Do custom ROMs exist for electric cars, for example Teslas?

The Qweremin

Use One Big Server (2022)

Pong Clock

A Crack in the Cosmos

Sometimes CPU cores are odd

Comments