Simple limits on runtime atop crypto mining from being too big of a problem.
This is why Google developed a browser, turns out scraping the web requires one to pretty much develop a V8 engine, so why not publish it as a browser .
I reckon they made the browser to control the browser market.
The reason why Servo has existed (when it was still in Mozilla's care) was because on how deeply spagettified Gecko's code (sans IonMonkey) was, with the plan of replacing Gecko's components with Servo's.
Firefox's automation systems are now miles better but that's literally the combination of years of work to modularize Gecko, the partial replacement of Gecko's parts with Servo's (like Stylo: https://hacks.mozilla.org/2017/08/inside-a-super-fast-css-en...), and actively building the APIs despite the still-spagettified mess.
If it's true that V8 was used internally for Google's scraper before they even thought about Chrome, then it makes obvious sense why not. The other factor is the bureaucracy and difficulty of getting an open source project to refactor their entire code base around your own personal engine. Google had the money and resources to pay the best in the business to work on Chrome.
The idea is to abuse the abusers and if you suspect it's a bot change the PoW from a GPU/machine/die fingerprint computation to something like a few ticks of Monero or whatever the crypto of choice is this week.
Sounds useless, but you forget 0.5s of that on their farm x1e4 scraping nodes and you're into something.
The catch is not getting caught out by impacting the 0.1% of tor running anti ad "users" out there who will try and decompile your code when their personal chrome build fails to work. I say "users" because they will be visiting a non free site espousing their perceived right to be there, no different to a bot for someone paying the bills.
Bots can either risk being turned into crypto miners, or risk not grabbing free data to train AIs on.
If they put in a limit, you've won. You just make your site be above that limit, and the problem is gone.
Chrome has the nice property that you can kill a render process for a tab and often it just takes that tab down, leaving everything else running fine. This plus warning provides minimal user impact while ensuring resources for all.
In the past we experimented with cgroups (both versions) and other mechanisms or limiting but found dynamic monitoring to be the most reliable.
I have been developing a public service, and I intend to use a simple implementation of proof-of-work in it, made to work with a single call without needing back-and-forth information from the server for each request.
It enforces a pattern in which a client must do the PoW every request.
Other difficulties, uncoverd in our PoC were:
Not all clients are equal: this punishes an old mobile phone or raspberry-pi much more than a client that runs on a beefy server with GPUs or clients that run on compromised hardware. - I.e. real users are likely punished, while illegitimate users often punished the least.
Not all endpoints are equal: We experimented with higher difficulties for e.g. POST/PUT/PATCH/DELETE over GET. and with different difficulties for different endpoints: attempting to match how expensive a call would be for us. That requires back-and-forth to exchange difficulties.
It discourages proper HATEOAS or REST, where a client browses through the API by following links and encourages calls that "just include as much as possible in one query". Deminishing our ability to cache, to be flexible and to leverage good HTTP practices.
It's not clear whether the author means LLM scrapers in the sense of scrapers that gather training data for Foundational Models, LLM scrapers that browse the web to provide up to date answers, or vibe coders and agents that use browsers at the bequest of the programmer or the user.
But in none of those myriad of cases can I imagine compromised machines being relevant. If we are talking about compromised machines, it's irrelevant if an LLM is involved and how, it's a distributed attack completely unrelated to LLMs.
I like to scrape websites and make alternative, personalized frontends for them. Captchas are really painful for me. Proof of work would be painful for a massive scraping operation, but I wouldn't have an issue with spending some CPU time to get the latest posts from a site which doesn't have an RSS feed or an API.
Sounds like an OK solution to a shitty problem that has a bunch of other shitty solutions.
I feel sorry for people with budget phones who now have to battle with these PoW systems and think LLM scrapers will win this one, with everyone else suffering a worse browsing experience.
Scrapers, on the other hand, keep throwing out their session cookies (because you could easily limit their access by using cookies if they didn't). They will need to run the PoW workload every page load.
If you're visiting loads of different websites, that does suck, but most people won't be affected all that much in practice.
There are alternatives, of course. Several attempts at standardising remote attestation have been done. Apple has included remote attestation into Safari years ago. Basically, Apple/Google/Cloudflare give each user a limited set of "tokens" after verifying that they're a real person on real hardware (using TPMs and whatnot), and you exchange those tokens for website visits. Every user gets a load of usable tokens, but bots quickly run out and get denied access. For this approach to work, that means locking out Linux users, people with secure boot disabled, and things like outdated or rooted phones, but in return you don't get PoW walls or Cloudflare CAPTCHAs.
In the end, LLM scrapers are why we can't have nice things. The web will only get worse now that these bots are on the loose.
This isn't very difficult to change.
> but the way Anubis works, you will only get the PoW test once.
Not if it's on multiple sites, I see the weab girl picture (why?) so much it's embedded into my brain at this point.
As far as I know the creator of Anubis didn't anticipate such a widespread use and the anime girl image is the default. Some sites have personalized it, like sourcehut.
So you can pay the developers for the professional version where you can easily change the image. It's a great way of funding the work.
Such a scheme does not work without cookies in the first place, so the optimal strategy for scrapers is to keep any (likely multiple) session cookies until they expire. Technical details aside, if a site becomes a worthy target, a scraping operation running on billions of dollars will easily bypass any restrictions thrown at it, be that cookies, PoW, JS, wasm, etc. Being able to access multiple sites by bypassing a single method is just a bonus.
Ultimately, I don't believe this is an issue that can be solved by technical means; any such attempt will solely result in continuous UX degradation for humans in the long term. (Well, it is already happening.) But of course, expecting any sort of regulation on the manna of the 2020s is just as naive... if anything, this just fits the ideology that the WWW is obsolete, and that replacing it with synthetic garbage should be humanity's highest priority.
The reason why Anubis was created was that the author's public Gitea instance was using a ton of compute because poorly written LLM scraper bots were scraping its web interface, making the server generate a ton of diffs, blames, etc. If the AI companies work around proof-of-work blocks by not constantly scraping the same pages over and over, or by detecting that a given site is a Git host and cloning the repo instead of scraping the web interface, I think that means proof-of-work has won. It provides an incentive for the AI companies to scrape more efficiently by raising their cost to load a given page.
AFAIK, Anubis does not work alone, it works together with traditional per-IP-address rate limiting; its cookies are bound to the requesting IP address. If the scraper uses a new IP address for each request, it cannot reuse the cookies; if it uses the same IP address to be able to reuse the cookies, it will be restricted by the rate limiting.
Actually I will get it zero times because I refuse to enable javashit for sites that shouldn't need it and move on to something run by someone competent.
There's lots of ways to define "shouldn't" in this case
- Shouldn't need it, but include it to track you
- Shouldn't need it, but include it to enhance the page
- Shouldn't need it, but include it to keep their costs down (for example, by loading parts of the page dynamically / per person and caching the rest of the page)
- Shouldn't need it, but include it because it help stop the bots that are costing them more than the site could reasonably expected to make
I get it, JS can be used in a bad way, and you don't like it. But the pillar of righteousness that you seem to envision yourself standing on it not as profound as you seem to think it is.
It pains me to say this, but I think that differentiating humans from bots on the web is a lost cause. Proof of work is just another way to burn more coal on every web request, and the LLM oligarchs will happily burn more coal if it reduces competition from upstart LLMs.
Sam Altman's goal is to turn the Internet into an unmitigated LLM training network, and to get humans to stop using traditional browsing altogether, interacting solely via the LLM device Jony Ive is making for him.
Based on the current trajectory, I think he might get his way, if only because the web is so enshittified that we eventually won't have another way to reach mainstream media other than via LLMs.
So the existence of Anubis will mean even more incentive for scraping.
Ah, but this isn't doing that. All this is doing is raising friction. Taking web pages from 0.00000001 cents to load to 0.001 at scale is a huge shift for people who just want to slurp up the world, yet for most human users, the cost is lost in the noise.
All this really does is bring the costs into some sort of alignment. Right now it is too cheap to access web pages that may be expensive to generate. Maybe the page has a lot of nontrivial calculations to run. Maybe the server is just overwhelmed by the sheer size of the scraping swarm and the resulting asymmetry of a huge corporation on one side and a $5/month server on the other. A proof-of-work system doesn't change the server's costs much but now if you want to scrape the entire site you're going to have to pay. You may not have to pay the site owner, but you will have to pay.
If you want to prevent bots from accessing a page that it really wants to access, that's another problem. But, that really is a different problem. The problem this solves is people using small amounts of resources to wholesale scrape entire sites that take a lot of resources to provide, and if implemented at scale, would pretty much solve that problem.
It's not a perfect solution, but no such thing is on the table anyhow. "Raising friction" doesn't mean that bots can't get past it. But it will mean they're going to have to be much more selective about what they do. Even the biggest server farms need to think twice about suddenly dedicating hundreds of times more resources to just doing proof-of-work.
It's an interesting economic problem... the web's relationship to search engines has been fraying slowly but surely for decades now. Widespread deployment of this sort of technology is potentially a doom scenario for them, as well as AI. Is AI the harbinger of the scrapers extracting so much from the web that the web finally finds it economically efficient to strike back and try to normalize the relationship?
If you're going to needlessly waste my CPU cycles, please at least do some mining and donate it to charity.
seems like it would be possible to split the compute up.
FAQ: https://foldingathome.org/faq/running-foldinghome/
What if I turn off my computer? Does the client save its work (i.e. checkpoint)?
> Periodically, the core writes data to your hard disk so that if you stop the client, it can resume processing that WU from some point other than the very beginning. With the Tinker core, this happens at the end of every frame. With the Gromacs core, these checkpoints can happen almost anywhere and they are not tied to the data recorded in the results. Initially, this was set to every 1% of a WU (like 100 frames in Tinker) and then a timed checkpoint was added every 15 minutes, so that on a slow machine, you never lose more that 15 minutes work.
> Starting in the 4.x version of the client, you can set the 15 minute default to another value (3-30 minutes).
caveat: I have no idea how much data "1 frame" is.
The problem is that this problem is going to be all overhead. If you sit down and calmly work out the real numbers, trying to distribute computations to a whole bunch of consumer-grade devices, where you can probably only use one core for maybe two seconds at a time a few times an hour, you end up with it being cheaper to just run the computation yourself. My home gaming PC gets 16 CPU-hours per hour, or 56700 CPU-seconds. (Maybe less if you want to deduct a hyperthreading penalty but it doesn't change the numbers that much.) Call it 15,000 people needing to run 3-ish of these 2-second computations, plus coordination costs, plus serving whatever data goes with the computation, plus infrastructure for tracking all that and presumably serving, plus if you're doing something non-trivial a quite non-trivial portion of that "2 seconds" I'm shaving off for doing work will be wasted setting it up and then throwing it away. The math just doesn't work very well. Flat-out malware trying to do this on the web never really worked out all that well, adding the constraint of doing it politely and in such small pieces doesn't work.
And that's ignoring things like you need to be able to prove-the-work for very small chunks. Basically not a practically solvable problem, barring a real stroke of genius somewhere.
I dunno. How much work do you really need in PoW systems to make the scrapers go after easier targets? My guess is not so much that you impair a human's UX. And if you do, then you have not fine-tuned your PoW algo, or you have very determined adversaries / scrapers.
As has been stated multiple times in this thread and basically any thread involving conversation on the topic, a PoW with a negligible cost (either of time/money/pain-in-the-ass factor) will not impact end users, but will affect LLM scrapers due to the scales involved.
The problem is trying to create a PoW that actually fits that model, is economical to implement, and can't easily be gamed.
But saying "any" seems to imply that it's a theoretical impossibility ("any machine that moves will encounter friction and lose energy to heat conversion, ergo perpetual motion machines are impossible"), when in fact it's a theoretical possibility, just not yet a practical reality.
(Aside: the licenses and distribution advocated by many of the same demography (information wants to be free -folks, jstor protestors, GPL-zealots) that now opposes LLMs using that content. )
I'm sure GPL zealots would be happier about this situation if LLM vendors abided by the spirit of the license by releasing their models under GPL after ingesting GPL data, but we all know that isn't happening.
Would be hilarious to trap scraper bots into endless labyrinths of LLM-generated mediawiki pages, getting them to mine hashes with each progressive article.
At least then we would be making money off these rude bots.
Then it's the bots who are making money from work they need to do for the captchas.
2. There is no money in mining on the kinda hardware scrapers will run on. Power costs more that they'd earn.
Also, the ideal Monero miner is a power-efficient CPU (so probably in-order). There are no Monero ASICs by design.
fake quote > Please add the reward and fees to: 187e6128f96thep00laddr3s9827a4c629b8723d07809
And if you make a fake block that changes the address, then the fake block is not a good one.
This avoid the same problem with people stealing from pools, and also evil people listening to new mined blocks that pretend that they found it and send a fake one.
Wouldn't it be easier to mine crypto themselves at that point? Seems like a very roundabout way to go about mining crypto.
Although admittedly millions of sites already ruined themselves with cloudflare without that incentive
You can surely expect that a static content will be static and will not run jpegoptim on a image at any given hit (a dynamic CMS + a sudden visit from ByteDance = your server is DDoSed), but you can't expect that any idiot/idiot hive on this planet will set up a multi-country edge caching servers architecture for a small sized website just in case some blog post will hit a few million visits every ten minutes. That can easily take down a server even for static content.
I concur that Anubis is a poor solution, and yet here we are, the UN it's using it to weather down requests.
If a service had enough concurrent clients to reliably hit useful results quickly, you could verify that most did the work by checking if a hit was found, and either let everyone in or block everyone according to that; but then you're relying on the large majority being honest for your service to work at all, and some dishonest clients would still slip through.
If someone dusted of the same tools and managed to get Altmann to buy them a nice car from it, good on them:)
The bots keep coming back too, ignoring HTTP status codes, permanent redirects, and what else I can think of to tell them to fuck off. Robots.txt obviously doesn't help. Filtering traffic from data centers didn't help either, because soon after I did that, residential IPs started doing the same thing. I don't know if this is a Chinese ISP abusing their IP ranges or if China just has a massive botnet problem, but either way the traditional ways to get rid of these bots hasn't helped.
In the end, I'm now blocking all of China and Singapore. That stops the endless flow of bullshit requests for now, though I see some familiar user agents appearing in other east Asian countries as well.
I just checked, since someone was talking about scraping in IRC earlier. Facebook is sending me about 3 requests per second. I blocked their user-agent. Someone with a Googlebot user-agent is doing the same stupid scraping pattern, and I'm not blocking it. Someone else is sending a request every 5 seconds with
One thing that's interesting on the current web is that sites are expected to make themselves scrapeable. It's supposed to be my job to organize the site in such a way that scrapers don't try to scrape every combination of commit and file path.
Scraping content for an LLM is not a hyper time sensitive thing. You don't need to scrape every page every day. Sam Altman does not need a synchronous replica of the internet to achieve his goals.
If you can't handle the traffic of one request per page per week or month, I think there are bigger problems to solve.
Yes, there are sites being DDoSed by scrapers for LLMs.
> If you can't handle the traffic of one request per page per week or month, I think there are bigger problems to solve.
This isn't about one request per week or per month. There were reports from many sites that they're being hit by scrapers that request from many different IP addresses, one request each.
Not sure why you believe massive repeated scraping isn't a problem. It's not like there is just one single actor out there, and ignoring robits.txt seems to be the norm nowadays.
It is very real and the reason why Anubis has been created in the first place. It is not plain hostility towards LLMs, it is *first and foremost* a DDoS protection against their scrapers.
https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
There are already a few dozens of thousands of scrapers right now trying to get even more training data.
It will only get worse. We all want more training data. I want more training data. You want more training data.
We all want the most up to date data there is. So, yeah, it will only get worse as time goes on.
That's a valid reason to serve JS-based PoW systems scares LLM operators: there's a chance the code might actually be malicious.
That's not a valid reason to serve JS-based PoW systems to human users: the entire reason those proofs work against LLMs is the threat that the code is malicious.
In other words, PoW works against LLM scrapers not because of PoW, but because they could contain malicious code. Why would you threaten your users with that?
And if you can apply the threat only to LLMs, then why don't you cut the PoW garbage start with that instead?
I know, it's because it's not so easy. So instead of wielding the Damocles sword of malware, why not standardize on some PoW algorithm that people can honestly apply without the risks?
Your users - we, browsing the web - are already threatened with this. Adding a PoW changes nothing here.
My browser already has several layers of protection in place. My browser even allows me to improve this protection with addons (ublock etc) and my OSes add even more protection to this. This is enough to allow PoW-thats-legit but block malicious code.
And if a site pulls something like that on me, then I just don't take their data. Joke is on them, soon if something is not visible to AI it will not 'exist', like it is now when you are delisted from Google.
PoW anti-scraper tools are a good first step, but why don't we just jump straight to the endgame? We're drawing closer to a point where information's value is actually fully realized -- people will stop sharing knowledge for free. It doesn't have to be that way, but it does in a world where people are pressed for economic means, knowledge becomes an obvious thing to convert to capital and attempt to extract rent on.
The simple way this happens is just a login wall -- for every website. It doesn't have to be a paid login wall of course (at first), but it's a super simple way to both legally and practically protect from scrapers.
I think high quality knowledge, source code (which is basically executable knowledge), being open in general is a miracle/luxury of functioning, economically balanced societies where people feel driven (for many possible reasons) to give back, or have time to think of more than surviving.
Don't get me wrong -- the doomer angle is almost always wrong -- every year humanity is almost always better off than we were the previous year on many important metrics, but it's getting harder to see a world where we cartwheel through another technological transformation that this time could possibly impact large percentages of the working population.
I read the GP comment as suggesting we push on the second option there rather than passively waiting for the first option.
For people that value anonymity, they'll create their own spaces. People that value openness will continue to be open.
What we're about to find out is what happens when the tide goes out and people show you what they really believe/want -- anything other than that is a form of social control, whether via browbeating or other means.
Hardly anything of what's the internet today was promised, but who are you to decide what the internet has to become now, and that people with different ideas need to confine themselves in their own ghettos?
Everyone values privacy, it's just out of social pressure if most give up so much of it.
> What we're about to find out is what happens when the tide goes out and people show you what they really believe/want -- anything other than that is a form of social control, whether via browbeating or other means
No idea of what you're talking about there
You'll find CAPTCHAs almost everywhere, outright 403s or dropped connections in a lot of places. Even Google won't serve you sometimes.
The reason you're not seeing that situation right now is that your IP address is identifiable.
What cannot possibly be anonymous is a login with a verified identity.
My focus was more on the areas outside the large walled gardens -- they might be come a bunch of smaller... fenced backyards, to put it nicely.
Yeah. People over-estimate the flashy threats from AI, but to me the more significant threat is killing the open exchange of knowledge and more generally the open, trusting society by flooding it with agents which are happy to press "defect" on the prisoner's dilemma.
> being open in general is a miracle/luxury of functioning, economically balanced societies where people feel driven (for many possible reasons) to give back, or have time to think of more than surviving
"High trust society". Something that took the West a very long time to construct through social practices, was hugely beneficial for economic growth, but is vulnerable to defectors. Think of it like a rainforest: a resource which can be burned down to increase quarterly profit.
I don't think societies are open/trusting by default -- it takes work and a lot of anti-intuitive thinking, sustained over long periods of time.
> "High trust society". Something that took the West a very long time to construct through social practices, was hugely beneficial for economic growth, but is vulnerable to defectors. Think of it like a rainforest: a resource which can be burned down to increase quarterly profit.
I think the trust is downstream of the safety (and importantly "economic safety", if we can call it that). Everyone trusts more when they're not feeling threatened. People "defect" from cultures that don't work for them -- people leave the culture they like and go to another one usually because of some manifestation of danger.
Maybe the dream of knowledge being free and open was always doomed to fail; if knowledge has value, and people are encouraged to spend more of their time and energy to create it rather than other kinds of work, they will have to be compensated in order to do it increasingly well.
It's kinda sad though, if you grew up in a world where you could actually discover stuff organically and through search.
It gets annoying when you have the right to scrape something - either because the owner of the data gave you the OK or because it is openly licensed. But then the webmaster can't be bothered to relax the rate limiter for you, and nobody can give you a nice API. Now people are putting their Open Educational Resources, their open source software, even their freaking essays about openness that they want the world to read behind Anubis. It makes me shake my head.
I understand perfectly it is annoying when badly written bots hammer your site. But maybe then HTTP and those bots are the problem. Maybe we should make it easier for site owners to push their content somewhere where we can scrape it easier?
To be frank: it’s not your content, it’s theirs, and it doesn’t matter if you like it or not, they can decide what they want to do with it, you’re not entitled to it. Yes there are some cases that you personally have permission to scrape, or the license explicitly permits it, but this isn’t the norm.
The bigger issue isn’t that people don’t want their content to be read it’s that they want it to be read and consumed by a human in most cases, and they want their server resources (network bandwidth, CPU, etc) to be used in a manageable way. If these bots were written to be respectful, then maybe we wouldn’t be in this situation. These bots poisoned the well, and they affect respectful bots because of their actions.
But if one is found, it will pave the way for a very dangerous counter-attack: Browser vendors with need for data (i.e. Google) simply using the vast fleet of installed browsers to do this scraping for them. Chrome, Safari, Edge, sending the pages you visit to their data-centers.
I also think this is the endgame of things like Recall in windows. Steal the training data right off your PC, no need to wait for the sucker to upload it to the web first.
Maybe because while ad tech these days is no less shady than crypto mining, the concept of ads is something people understand. Most people don't really understand crypto so it gets lumped in with "hackers" and "viruses".
Alternatively, for those who do understand ad tech and crypto, crypto mining still subjectively feels (to me at least) more like you're being stolen from than ads. Same with Anubis, wasting power on PoW "feels" more acceptable to me than mining crypto. One of those quirks of the human psyche I guess.
Advertising is theft of attention which is extremely limited in supply. I'd even say it's mind rape. They forcibly insert their brands and trademarks into our minds without our consent. They deliberately ignore and circumvent any and all attempts to resist. It's all "justified" though, business interests excuse everything.
(1): Attention from any given person is fundamentally limited. Said attention has an inherent value.
(2): Running *any* website costs money, doubly so for video playback. This is not even mentioning the moderation & copyright mechanisms that a video sharing platform like YouTube has to have in order to keep copyright lawsuits away from YouTube itself.
(3): Products do not spawn in with their presence known to the general population. For the product to be successful, people have to know it exists in the first place.
Advertising is the consequence of wanting attention to be drawn to (3), and willing to pay for said attention on a given platform (1). (2)'s costs, alongside any payouts to videographers that garner attention to their videos, can be paid for with the money in (1), by placing ads around/before the video itself.
You're allowed to not have advertising shown to you, but in exchange, the money to pay for (2) & the people who made the video have to come from somewhere.
Yes, and it belongs to us. It's not theirs to sell to the highest bidder.
> Running any website costs money, doubly so for video playback.
> Products do not spawn in with their presence known to the general population.
Not our problem. Business needs do not excuse it. Let all those so called innovators find a way to make it without an attention economy. Let them go bankrupt if they can't.
If you decide that low end devices are a worthy sacrifice then you're creating e-waste. Not to mention the energy burden.
Unsure how that would work. If the proof you generate could be used for blockchain operations, so that the website operator could be paid by using that proof as generated by the website visitor, why shouldn't the visitor keep that proof to themselves and use it instead? Then they'd get the full amount, and the website operator gets nothing. So then there is no point for it, and the visitor might as well just run a miner locally :)
Mining on behalf of the site owner negates the need for a transaction entirely.
It kinda worked, except for the fact that hackers would try to “cryptojack” random websites by hacking them and inserting Coinhive’s miner into their pages. This caused everyone to block Coinhive’s servers. (Also you wouldn’t get very much money out of it - even the cryptojackers who managed to get tens of millions of page views out of hacked websites reported they only made ~$40 from the operation)
Then again, I'm sure there's quite a bit of tweaking that could be done to make clients submit far more hashes, but that would make it much more noticeable.
When you mine a block, you’re basically bundling up a bunch of meaningful data, and then trying to append some padding data that will e.g. result in a hash that has N leading 0 bits. One of the pieces of meaningful data in the block is “who gets the reward?”
If you’re mining alone, you would put data on that block that says “me” as who gets the reward. But if you’re mining for a pool, you get a block that already says “the pool” for who gets the reward.
So then I’m guessing the pool gives you a lesser work factor to hit, so some value smaller than N? You’ll be basically saying “Well, here’s a block that doesn’t have N leading zeroes, but does have M, leading zeroes”, and that proves how much you’re working for the pool, and entitles you to a proportion of the winnings.
If you changed the “who gets the reward?” from “the pool” to “me”, that would change the hash. So you can’t come in after the fact, say “Look at that! N leading zeroes! Let me just swap myself in to get the reward…” because that would result in an invalid block. And if you put yourself as the reward line in advance, the pool just won’t give you credit for your “partial” answers.
Is that about right?
https://krebsonsecurity.com/2018/03/who-and-what-is-coinhive...
https://www.troyhunt.com/i-now-own-the-coinhive-domain-heres...
In theory, you could watch the transactions being broadcast, and guess (+confirm) the corresponding block, but that would require you to see all the transactions the pool owner did, and put them in the same order (the possibilities of which scale exponentially with the number of transactions). There may be some other randomness you can insert into a block too -- someone else might know this.
Edit: oops, I forgot: the block also contains the address that the fees should be sent to. So even if you "stole" your solution and broadcast it with the block, the fee is still going to the pool owner. That's a bigger deal.
Because they can't, if the website operator designs their JavaScript correctly. In detail:
If Alice goes to Bob's website, Bob tells Alice to find a hash with 20 leading zeros for a Bitcoin block that says "Send the newly printed bitcoins and transaction fees to Bob." (It will take Alice ~2^20 guesses to find such a block, so Bob picked the number 20 such that those ~2^20 guesses happen in a couple seconds for normal humans with normal web browsers on normal devices.)
Supposing the actual Bitcoin blockchain needs a hash with 50 leading zeros, one in every 2^30 Alice's will mine a valid block (worth ~$300k at current Bitcoin prices).
If Alice finds a block with 50 leading zeros and then tries to change the block to say "Send the newly printed bitcoins and transaction fees to Alice," her new block will have a different hash (that is very unlikely to have 2^50 leading zeros), and neither the website nor the blockchain will accept it.
Sure, Alice could change the block at the beginning before starting the search. But if she finds a block with 20 leading zeros that says "Send the newly printed bitcoins and transaction fees to Alice," Bob won't accept it for access to his website. The only way Alice gets anything for a block that sends the coins to herself is from the Bitcoin blockchain if she finds a 2^50 block -- at that point, Alice is just mining Bitcoins for herself and not interacting with Bob's website at all.
(If a 1 in a billion chance to win $300k is too risky for Bob's liking, he can get a lower payout with a higher probability by using a different proof-of-work blockchain and/or a mining pool.)
Also virus scanners and corporate networks would hate you, because hackers are probably trying to embed whatever library you’re using into other unsuspecting sites.
...and the requisite checklist: https://trog.qgl.org/20081217/the-why-your-anti-spam-idea-wo...
This would make the entire internet a maze of AI-slop content primarily made for other bots to consume. Humans may have to resort to emailing handwritten PDFs to avoid the thoroughly enshittified web.
At the moment, ad networks don't pay for bot impressions when detected - so content farms tend to optimize for what passes for humans. All bets are off if human and bot visitors offer the same economic value via miners, or worse if it turns out that bots are more profitable due to human impatience.
Imagine an internet optimized for bot visitors, and indifferent to humans. It would be a kind of refined brainrot aimed at a brainless audience.
(A secondary thing is that AI bots have basically zero benefit for most websites, so unless you are some low-cost crappy content farm, it'll be in your interest to raise the prices to the max so the bots are simply locked out. Which will bring us back to point 1, bots ignoring the headers)
Also robots.txt is a suggestion but hashcash is enforced server side. I agree it's a tragedy people have started to completely ignore it but you can't ignore server side behavior.
1. Banner ads made more money. This stopped being true a while ago, it's why newspapers all have annoying soft paywalls now.
2. People didn't have payment rails set up for e-commerce back then. Largely fixed now, at least for adults in the US.
3. Transactions have fixed processing costs that make anything <$1 too cheap to transact. Fixed with batching (e.g. buy $5 of credit and spend it over time).
4. Having to approve each micropurchase imposes a fixed mental transaction cost that outweighs the actual cost of the individual item. Difficult to solve ethically.
With the exception of, arguably[0], Patreon, all of these hurdles proved fatal to microtransactions as a means to sell web content. Games are an exception, but they solved the problem of mental transaction costs by drowning it in intensely unethical dark patterns protected by shittons of DRM[1]. You basically have to make someone press the spend button without thinking.
The way these proof-of-work systems are currently implemented, you're effectively taking away the buy button and just charging someone the moment they hit the page. This is ethically dubious, at least as ethically dubious as 'data caps[2]' in terms of how much affordance you give the user to manage their spending: none.
Furthermore, if we use a proof-of-work system that's shared with an actual cryptocurrency, so as to actually get payment from these hashes, then we have a new problem: ASICs. Cryptocurrencies have to be secured by a globally agreed-upon hash function, and changing that global consensus to a new hash function is very difficult. And those hashes have economic value. So it makes lots of sense to go build custom hardware just to crack hashes faster and claim more of the inflation schedule and on-chain fees.
If ASICs exist for a given hash function, then proof-of-work fails at both:
- Being an antispam system, since spammers will have better hardware than legitimate users[3]
- Being a billing system, since legitimate users won't be able to mine enough crypto to pay any economically viable amount of money
If you don't insist on using proof-of-work as billing, and only as antispam, then you can invent whatever tortured mess of a hash function is incompatible with commonly available mining ASICs. And since they don't have to be globally agreed-upon, everyone can use a different, incompatible hash function.
"Don't roll your own crypto" is usually good security advice, but in this case, we're not doing security, we're doing DRM. The same fundamental constants of computing that make stopping you from copying a movie off Netflix a fool's errand also make stopping scrapers theoretically impossible. The only reason why DRM works is because of the gap between theory and practice: technically unsophisticated actors can be stopped by theoretically dubious usages of cryptography. And boy howdy are LLM scrapers unsophisticated. But using the tried-and-true solutions means they don't have to be: they can just grab off-the-shelf solutions for cracking hashes and break whatever you use.
[0] At least until Apple cracked Patreon's kneecaps and made them drop support for any billing mode Apple's shitty commerce system couldn't handle.
[1] At the very least, you can't sell microtransaction items in games without criminalizing cheat devices that had previously been perfectly legal for offline use. Half the shit you sell in a cash shop is just what used to be a GameShark code.
[2] To be clear, the units in which Internet connections are sold should be kbps, not GB/mo. Every connection already has a bandwidth limit, so what ISPs are doing when they sell you a plan with a data cap is a bait and switch. Two caps means the lower cap is actually a link utilization cap, hidden behind a math problem.
[3] A similar problem has arisen in e-mail, where spammy domains have perfect DKIM/SPF, while good senders tend to not care about e-mail bureaucracy and thus look worse to antispam systems.
Once there is ANY value exchanged, the user immediately wonders if it is worth it -- and if the payment/token/whatever is sent prior to the pageload, they have no way of knowing.
I dunno. Brands and other quality signals (imperfect as they tend to be, they still aren’t completely useless) could develop.
Long-form articles could have a back cover summary too, or an enticing intro... and some substack paid articles do that already: they let you read an intro and cut before going in the interesting details.
But for short newspapers articles it becomes harder to do based on topic. If the summary has to give out 90% of the information to not be too vague, you may then feel robbed paying for it once you realize the remaining 10% wasn't that useful.
https://blog.forth.news/a-business-model-for-21st-century-ne...
There's always value exchanged--"If you're not paying for the product, you are the product".[1] For ads we've established the fiction that everybody knowingly understands and accepts this quid pro quo. For proof of work we'd settle on a similar fiction, though perhaps browsers could add a little graphic showing CPU consumption.
[1] This is true even for personal blogs, albeit the monetary element is much more remote.
Streaming proves this. When people spend $10 per month on Netflix/Hulu/Crunchyroll they don't have to further ask "do I want to pay 7.5 cents for another episode" every 22 minutes. The math for who's getting paid how much for how many streams is entirely outside the customer's consideration, and the range is broad enough that it discourages one-and-done binging.
For individual content providers, you might need to form some sort of federated project. Media properties could organize through existing networks as an obvious framework ("all AP newspapers for $x per month") but we'd probably need new federations for online-first and less news-centric publishers.
> If ASICs exist for a given hash function, then proof-of-work fails at both:
> - Being an antispam system, since spammers will have better hardware than legitimate users[3]
> - Being a billing system, since legitimate users won't be able to mine enough crypto to pay any economically viable amount of money
Monero/XMR & Zcash break this part of the argument, along with ASIC/GPU-resistant algorithms in general (Argon2 being most well-known & recommended as a KDF).
Creating an ASIC-resistant coin is not impossible, as shown by XMR. The difficult part comes from creating & sustaining the network surrounding the coin, and those two are amongst the few that have done both. Furthermore, there's little actual need to create another coin to do so when XMR fulfills that niche.
------
> If you don't insist on using proof-of-work as billing, and only as antispam, then you can invent whatever tortured mess of a hash function is incompatible with commonly available mining ASICs. And since they don't have to be globally agreed-upon, everyone can use a different, incompatible hash function.
Counterpoint: People (and devs) want a pre-packaged solution that solves the mentioned antispam problem. For almost everyone, Anubis does its job as intended.
https://web.archive.org/web/20110603143708/http://www.bitcoi...
LLMs can, LOL
One of the powerful use cases is that they can catch pretty much EVERY attempt at obfuscation now. Building a marketplace and want to let the participants chat but not take the deal off-site? LLM in the loop. Protecting children? LLM in the loop.
PoW can help against basic scrapers or DDoS, but it won’t stop anyone serious. Last week I looked into a Binance CAPTCHA solver that didn’t use a browser at all, just a plain HTTP client. https://blog.castle.io/what-a-binance-captcha-solver-tells-u...
The attacker had fully reverse engineered the signal collection and solved-state flow, including obfuscated parts. They could forge all the expected telemetry.
This kind of setup is pretty standard in bot-heavy environments like ticketing or sneaker drops. Scrapers often do the same to cut costs. CAPTCHA and PoW mostly become signal collection protocols, if those signals aren’t tightly coupled to the actual runtime, they get spoofed.
And regarding PoW: if you try to make it slow enough to hurt bots, you also hurt users on low-end devices. Someone even ported PerimeterX’s PoW to CUDA to accelerate solving: https://github.com/re-jevi/PerimiterXCudaSolver/blob/main/po...
I think this is almost already the case now. Services like Cloudflare do a pretty good job of classifying primitive bots and if site operators want to block all (or at least vast majority), they can. The only reliable way through is a real browser. (Which does put a floor on resource needs for scraping)
I thought bots using (headless) browsers was an existing workaround for a number of existing issues with simpler bots, so this doesn't seem to be a big change.
1) Making LLM (and other) scrapers pay for the resources they use seems perfectly fine to me. Also, as someone that manages some level of scraping (on the order of low tens of millions of requests a month), I'm fine with this. There's a wide range of scraping that the problem is not some resource cost, but the other side not wanting to deal with setting up APIs or putting so many hurdles on access that it's easier to just bypass it.
2) This seems like it might be an opportunity for Cloudflare. Let customers opt-in to requiring a proof of work when visitors already trip the cloudflare vetting page that runs additional checks to see if you're a bad actor, and apply any revenue to a service credit towards their monthly fee (or if on a free plan, as credit to be used for trying out additional for-pay features). There might be a perverse inventive to toggle on more stringent checking from cloudflare, but ultimately since it's all being paid for that's the site owner's choice on how they want to manage their site.
Animats•1d ago
jchw•1d ago
berkes•1d ago
In line with how email is technically still federated and distributed, but practically oligopolized by a handfull of big-tech, through "fighting spam".
account42•1d ago
Unless you are using an operating system or browser that isn't the monopoly choice.
Fuck off with this idea that some clients are better than others.
jchw•1d ago