A thought on JavaScript "proof of work" anti-scraper systems

https://utcc.utoronto.ca/~cks/space/blog/web/JavaScriptScraperObstacles

194•zdw•8mo ago

Comments

Animats•8mo ago

"This page is taking too long to load" on sites with anti-scraper technology. Do Not Want.

jchw•8mo ago

In general if you're coming from a "clean" IP your difficulty won't be very high. In the future if these systems coordinate with each other in some way (DHT?) then it should make it possible to drop the baseline difficulty even further.

berkes•8mo ago

That's a perfect tool for monopolists to widen their moat even more.

In line with how email is technically still federated and distributed, but practically oligopolized by a handfull of big-tech, through "fighting spam".

account42•8mo ago

> In general if you're coming from a "clean" IP your difficulty won't be very high.

Unless you are using an operating system or browser that isn't the monopoly choice.

Fuck off with this idea that some clients are better than others.

jchw•8mo ago

That has absolutely fuck all to do with IP reputation, you're mixing up unrelated concepts.

dannyw•8mo ago

This is a poor take. All the major LLM scrapers already run and execute JavaScript, Googlebot has been doing it for probably a decade.

Simple limits on runtime atop crypto mining from being too big of a problem.

TZubiri•8mo ago

"Googlebot has been doing it for probably a decade."

This is why Google developed a browser, turns out scraping the web requires one to pretty much develop a V8 engine, so why not publish it as a browser .

motoxpro•8mo ago

This is so obvious when you say it, but what an awesome insight.

rob_c•8mo ago

You don't work anywhere near the as industry then, people have been grumbling about this for the whole 10 years now

nssnsjsjsjs•8mo ago

Except it doesn't make sense. Why not just use Firefox. Or improve the JS engine of Firefox.

I reckon they made the browser to control the browser market.

baq•8mo ago

their browser is their scraper. what you see is what the scraper sees is what the ads look like.

zinekeller•8mo ago

> Why not just use Firefox.

The reason why Servo has existed (when it was still in Mozilla's care) was because on how deeply spagettified Gecko's code (sans IonMonkey) was, with the plan of replacing Gecko's components with Servo's.

Firefox's automation systems are now miles better but that's literally the combination of years of work to modularize Gecko, the partial replacement of Gecko's parts with Servo's (like Stylo: https://hacks.mozilla.org/2017/08/inside-a-super-fast-css-en...), and actively building the APIs despite the still-spagettified mess.

chrisco255•8mo ago

V8 was dramatically better than Firefox at the time. AFAIK, it was the first JS engine to take the approach of compiling repetitive JS to native assembly.

If it's true that V8 was used internally for Google's scraper before they even thought about Chrome, then it makes obvious sense why not. The other factor is the bureaucracy and difficulty of getting an open source project to refactor their entire code base around your own personal engine. Google had the money and resources to pay the best in the business to work on Chrome.

TZubiri•8mo ago

"Why develop in-house software for the core application of the biggest company in the world at the time, worth more than 100B$. Why not just repurpose rinky dink open source browser as some kind of parser, bank our 100B$ business on some volunteers and a 501c3 NFP, that will play out well in a shareholder meeting and in trials when they ask us how we safeguard our software."

glowiefedposter•8mo ago

Why didn't they do that instead of just forking WebKit?

rkangel•8mo ago

It's not quite that simple. I think that having that skillset and knowledge in house already probably led to it being feasible, but that's not why they did it. They created Chrome because it was in their best interests for rich web applications to run well.

mschuster91•8mo ago

... and the fact that even with a browser, content gated behind Macromedia Flash or ActiveX applets was / is not indexable is why Google pushed so hard to expand HTML5 capabilities.

chrisco255•8mo ago

Was it really a success though in that regard? HTML5 was great and all, but it never did replace Flash. Websites mainly just became more static. I suspect the lack of mobile integration had more to do with Flash dying than HTML5 getting better. It's a shame in some sense, because Flash was a lot of fun.

maeln•8mo ago

But it is the whole point of the article ? Big scrapers can hardly tell if the JS that takes their runtimes is a crypto miner or an anti-scrapping system, and so they will have to give up "useful" scrapping, so PoW might just work.

rob_c•8mo ago

No they point is there's really advanced PoW challenges out there to prove you're not a bot (those websites that take >3s to fingerprint you are doing this!)

The idea is to abuse the abusers and if you suspect it's a bot change the PoW from a GPU/machine/die fingerprint computation to something like a few ticks of Monero or whatever the crypto of choice is this week.

Sounds useless, but you forget 0.5s of that on their farm x1e4 scraping nodes and you're into something.

The catch is not getting caught out by impacting the 0.1% of tor running anti ad "users" out there who will try and decompile your code when their personal chrome build fails to work. I say "users" because they will be visiting a non free site espousing their perceived right to be there, no different to a bot for someone paying the bills.

jeroenhd•8mo ago

And by making bots hit that limit, scrapers don't get access to the protected pages, so the system works.

Bots can either risk being turned into crypto miners, or risk not grabbing free data to train AIs on.

account42•8mo ago

Real users also have a limit where they will close the tab.

nitwit005•8mo ago

> Simple limits on runtime atop crypto mining from being too big of a problem.

If they put in a limit, you've won. You just make your site be above that limit, and the problem is gone.

keepamovin•8mo ago

Interesting - we dealt with this issue in CloudTabs, a SaaS for BrowserBox remote browser. The way we handle it, is simply to monitor resource usage with a Python script, issue a warning to the user when their tab or all processes are running hot, then when the rules are triggered we just kill the offending processes (that use too much CPU or RAM).

Chrome has the nice property that you can kill a render process for a tab and often it just takes that tab down, leaving everything else running fine. This plus warning provides minimal user impact while ensuring resources for all.

In the past we experimented with cgroups (both versions) and other mechanisms or limiting but found dynamic monitoring to be the most reliable.

OutOfHere•8mo ago

See also http://www.hashcash.org/ which is a famous proof-of-work algorithm. The bigger benefit of proof-of-work is not that it's anti-LLM; it is that it's statelessly anti-DoS.

I have been developing a public service, and I intend to use a simple implementation of proof-of-work in it, made to work with a single call without needing back-and-forth information from the server for each request.

berkes•8mo ago

I've done that as well. The PoC worked, but the statelessness did prove a hurdle.

It enforces a pattern in which a client must do the PoW every request.

Other difficulties, uncoverd in our PoC were:

Not all clients are equal: this punishes an old mobile phone or raspberry-pi much more than a client that runs on a beefy server with GPUs or clients that run on compromised hardware. - I.e. real users are likely punished, while illegitimate users often punished the least.

Not all endpoints are equal: We experimented with higher difficulties for e.g. POST/PUT/PATCH/DELETE over GET. and with different difficulties for different endpoints: attempting to match how expensive a call would be for us. That requires back-and-forth to exchange difficulties.

It discourages proper HATEOAS or REST, where a client browses through the API by following links and encourages calls that "just include as much as possible in one query". Deminishing our ability to cache, to be flexible and to leverage good HTTP practices.

TZubiri•8mo ago

" and LLM scrapers may well have lots of CPU time available through means such as compromised machines."

It's not clear whether the author means LLM scrapers in the sense of scrapers that gather training data for Foundational Models, LLM scrapers that browse the web to provide up to date answers, or vibe coders and agents that use browsers at the bequest of the programmer or the user.

But in none of those myriad of cases can I imagine compromised machines being relevant. If we are talking about compromised machines, it's irrelevant if an LLM is involved and how, it's a distributed attack completely unrelated to LLMs.

immibis•8mo ago

You can buy access to proxies in residential networks - with credit card on the open internet - and they may or may not be someone's botnet (probably not, but you don't know that). I'm not aware of anyone selling running code on the device though. It's just an HTTP or SOCKS5 level proxy.

sznio•8mo ago

I'd really like this, since it wouldn't impact my scraping stuff.

I like to scrape websites and make alternative, personalized frontends for them. Captchas are really painful for me. Proof of work would be painful for a massive scraping operation, but I wouldn't have an issue with spending some CPU time to get the latest posts from a site which doesn't have an RSS feed or an API.

diggan•8mo ago

Yeah, to me PoW makes a lot of sense in this way too. Captchas are hard for (some) people to solve, and very annoying to fill out, but easy for vision-enabled LLMs to solve (or even use 3rd party services where you pay for N/solves, available for every major captcha service). PoW instead are hard to deal with in a distributed/spammy design, but very easy for any user to just sit and wait a second or two. And all personal scraping tooling just keeps working, just slightly slower.

Sounds like an OK solution to a shitty problem that has a bunch of other shitty solutions.

ChocolateGod•8mo ago

I'm glad after spending all this time trying to increase power efficiency people have come up with JavaScript that serves no purpose other than to increase power draw.

I feel sorry for people with budget phones who now have to battle with these PoW systems and think LLM scrapers will win this one, with everyone else suffering a worse browsing experience.

jeroenhd•8mo ago

This is rather unfortunate, but the way Anubis works, you will only get the PoW test once.

Scrapers, on the other hand, keep throwing out their session cookies (because you could easily limit their access by using cookies if they didn't). They will need to run the PoW workload every page load.

If you're visiting loads of different websites, that does suck, but most people won't be affected all that much in practice.

There are alternatives, of course. Several attempts at standardising remote attestation have been done. Apple has included remote attestation into Safari years ago. Basically, Apple/Google/Cloudflare give each user a limited set of "tokens" after verifying that they're a real person on real hardware (using TPMs and whatnot), and you exchange those tokens for website visits. Every user gets a load of usable tokens, but bots quickly run out and get denied access. For this approach to work, that means locking out Linux users, people with secure boot disabled, and things like outdated or rooted phones, but in return you don't get PoW walls or Cloudflare CAPTCHAs.

In the end, LLM scrapers are why we can't have nice things. The web will only get worse now that these bots are on the loose.

ChocolateGod•8mo ago

> Scrapers, on the other hand, keep throwing out their session cookies

This isn't very difficult to change.

> but the way Anubis works, you will only get the PoW test once.

Not if it's on multiple sites, I see the weab girl picture (why?) so much it's embedded into my brain at this point.

alpaca128•8mo ago

> I see the weab girl picture (why?)

As far as I know the creator of Anubis didn't anticipate such a widespread use and the anime girl image is the default. Some sites have personalized it, like sourcehut.

viraptor•8mo ago

> (why?)

So you can pay the developers for the professional version where you can easily change the image. It's a great way of funding the work.

shiomiru•8mo ago

> Scrapers, on the other hand, keep throwing out their session cookies (because you could easily limit their access by using cookies if they didn't). They will need to run the PoW workload every page load.

Such a scheme does not work without cookies in the first place, so the optimal strategy for scrapers is to keep any (likely multiple) session cookies until they expire. Technical details aside, if a site becomes a worthy target, a scraping operation running on billions of dollars will easily bypass any restrictions thrown at it, be that cookies, PoW, JS, wasm, etc. Being able to access multiple sites by bypassing a single method is just a bonus.

Ultimately, I don't believe this is an issue that can be solved by technical means; any such attempt will solely result in continuous UX degradation for humans in the long term. (Well, it is already happening.) But of course, expecting any sort of regulation on the manna of the 2020s is just as naive... if anything, this just fits the ideology that the WWW is obsolete, and that replacing it with synthetic garbage should be humanity's highest priority.

ndiddy•8mo ago

> Such a scheme does not work without cookies in the first place, so the optimal strategy for scrapers is to keep any (likely multiple) session cookies until they expire. Technical details aside, if a site becomes a worthy target, a scraping operation running on billions of dollars will easily bypass any restrictions thrown at it, be that cookies, PoW, JS, wasm, etc. Being able to access multiple sites by bypassing a single method is just a bonus.

The reason why Anubis was created was that the author's public Gitea instance was using a ton of compute because poorly written LLM scraper bots were scraping its web interface, making the server generate a ton of diffs, blames, etc. If the AI companies work around proof-of-work blocks by not constantly scraping the same pages over and over, or by detecting that a given site is a Git host and cloning the repo instead of scraping the web interface, I think that means proof-of-work has won. It provides an incentive for the AI companies to scrape more efficiently by raising their cost to load a given page.

cesarb•8mo ago

> Such a scheme does not work without cookies in the first place, so the optimal strategy for scrapers is to keep any (likely multiple) session cookies until they expire.

AFAIK, Anubis does not work alone, it works together with traditional per-IP-address rate limiting; its cookies are bound to the requesting IP address. If the scraper uses a new IP address for each request, it cannot reuse the cookies; if it uses the same IP address to be able to reuse the cookies, it will be restricted by the rate limiting.

immibis•8mo ago

At some point it must become cheaper to pay the people running the site for a copy of the site than to scrape it.

account42•8mo ago

> This is rather unfortunate, but the way Anubis works, you will only get the PoW test once.

Actually I will get it zero times because I refuse to enable javashit for sites that shouldn't need it and move on to something run by someone competent.

odo1242•8mo ago

Well, everything’s a tradeoff. I know a lot of small websites that had to shut down because LLM scraping was increasing their CPU and bandwidth load to the point where it was untenable to host the site.

Hrun0•8mo ago

Can you name a couple?

RHSeeger•8mo ago

> sites that shouldn't need it

There's lots of ways to define "shouldn't" in this case

- Shouldn't need it, but include it to track you

- Shouldn't need it, but include it to enhance the page

- Shouldn't need it, but include it to keep their costs down (for example, by loading parts of the page dynamically / per person and caching the rest of the page)

- Shouldn't need it, but include it because it help stop the bots that are costing them more than the site could reasonably expected to make

I get it, JS can be used in a bad way, and you don't like it. But the pillar of righteousness that you seem to envision yourself standing on it not as profound as you seem to think it is.

kokanee•8mo ago

Attestation is a compelling technical idea, but a terrible economic idea. It essentially creates an Internet that is only viewable via Google and Apple consumer products. Scamming and scraping would become more expensive, but wouldn't stop.

It pains me to say this, but I think that differentiating humans from bots on the web is a lost cause. Proof of work is just another way to burn more coal on every web request, and the LLM oligarchs will happily burn more coal if it reduces competition from upstart LLMs.

Sam Altman's goal is to turn the Internet into an unmitigated LLM training network, and to get humans to stop using traditional browsing altogether, interacting solely via the LLM device Jony Ive is making for him.

Based on the current trajectory, I think he might get his way, if only because the web is so enshittified that we eventually won't have another way to reach mainstream media other than via LLMs.

ChocolateGod•8mo ago

People are using LLMs because search results (due to SEO overload, Google's bad algorithm etc) are terrible, Anubis makes these already bad search results even worse by trying to block indexing, meaning people will want to use LLMs even more.

So the existence of Anubis will mean even more incentive for scraping.

thunderfork•8mo ago

Anubis doesn't impact well-behaved bots that set their user-agent string

jerf•8mo ago

"It pains me to say this, but I think that differentiating humans from bots on the web is a lost cause."

Ah, but this isn't doing that. All this is doing is raising friction. Taking web pages from 0.00000001 cents to load to 0.001 at scale is a huge shift for people who just want to slurp up the world, yet for most human users, the cost is lost in the noise.

All this really does is bring the costs into some sort of alignment. Right now it is too cheap to access web pages that may be expensive to generate. Maybe the page has a lot of nontrivial calculations to run. Maybe the server is just overwhelmed by the sheer size of the scraping swarm and the resulting asymmetry of a huge corporation on one side and a $5/month server on the other. A proof-of-work system doesn't change the server's costs much but now if you want to scrape the entire site you're going to have to pay. You may not have to pay the site owner, but you will have to pay.

If you want to prevent bots from accessing a page that it really wants to access, that's another problem. But, that really is a different problem. The problem this solves is people using small amounts of resources to wholesale scrape entire sites that take a lot of resources to provide, and if implemented at scale, would pretty much solve that problem.

It's not a perfect solution, but no such thing is on the table anyhow. "Raising friction" doesn't mean that bots can't get past it. But it will mean they're going to have to be much more selective about what they do. Even the biggest server farms need to think twice about suddenly dedicating hundreds of times more resources to just doing proof-of-work.

It's an interesting economic problem... the web's relationship to search engines has been fraying slowly but surely for decades now. Widespread deployment of this sort of technology is potentially a doom scenario for them, as well as AI. Is AI the harbinger of the scrapers extracting so much from the web that the web finally finds it economically efficient to strike back and try to normalize the relationship?

ChocolateGod•8mo ago

> Taking web pages from 0.00000001 cents to load to 0.001 at scale is a huge shift for people who just want to slurp up the world, yet for most human users, the cost is lost in the noise.

If you're going to needlessly waste my CPU cycles, please at least do some mining and donate it to charity.

xena•8mo ago

Anubis author here. Tell me what I'm missing to implement protein folding without having to download gigabytes of scientific data to random people's browsers and I'll implement it today.

dijksterhuis•8mo ago

Perhaps something along the lines of folding@home? https://foldingathome.org https://github.com/FoldingAtHome/fah-client-bastet

seems like it would be possible to split the compute up.

FAQ: https://foldingathome.org/faq/running-foldinghome/

What if I turn off my computer? Does the client save its work (i.e. checkpoint)?

> Periodically, the core writes data to your hard disk so that if you stop the client, it can resume processing that WU from some point other than the very beginning. With the Tinker core, this happens at the end of every frame. With the Gromacs core, these checkpoints can happen almost anywhere and they are not tied to the data recorded in the results. Initially, this was set to every 1% of a WU (like 100 frames in Tinker) and then a timed checkpoint was added every 15 minutes, so that on a slow machine, you never lose more that 15 minutes work.

> Starting in the 4.x version of the client, you can set the 15 minute default to another value (3-30 minutes).

caveat: I have no idea how much data "1 frame" is.

jerf•8mo ago

You can't do anything useful with checkpoints due to the same-site origin problem. Unless you can get browser support for some sort of proof of work that did something useful that whole line is a non-starter. No single origin involves a useful amount of work.

The problem is that this problem is going to be all overhead. If you sit down and calmly work out the real numbers, trying to distribute computations to a whole bunch of consumer-grade devices, where you can probably only use one core for maybe two seconds at a time a few times an hour, you end up with it being cheaper to just run the computation yourself. My home gaming PC gets 16 CPU-hours per hour, or 56700 CPU-seconds. (Maybe less if you want to deduct a hyperthreading penalty but it doesn't change the numbers that much.) Call it 15,000 people needing to run 3-ish of these 2-second computations, plus coordination costs, plus serving whatever data goes with the computation, plus infrastructure for tracking all that and presumably serving, plus if you're doing something non-trivial a quite non-trivial portion of that "2 seconds" I'm shaving off for doing work will be wasted setting it up and then throwing it away. The math just doesn't work very well. Flat-out malware trying to do this on the web never really worked out all that well, adding the constraint of doing it politely and in such small pieces doesn't work.

And that's ignoring things like you need to be able to prove-the-work for very small chunks. Basically not a practically solvable problem, barring a real stroke of genius somewhere.

jgalt212•8mo ago

> I feel sorry for people with budget phones who now have to battle with these PoW systems and think LLM scrapers will win this one, with everyone else suffering a worse browsing experience.

I dunno. How much work do you really need in PoW systems to make the scrapers go after easier targets? My guess is not so much that you impair a human's UX. And if you do, then you have not fine-tuned your PoW algo, or you have very determined adversaries / scrapers.

ChocolateGod•8mo ago

Any PoW that doesn't impact end users is not going to impact LLM scrapers.

MyPasswordSucks•8mo ago

"Any" is a pretty mighty word to throw around.

As has been stated multiple times in this thread and basically any thread involving conversation on the topic, a PoW with a negligible cost (either of time/money/pain-in-the-ass factor) will not impact end users, but will affect LLM scrapers due to the scales involved.

The problem is trying to create a PoW that actually fits that model, is economical to implement, and can't easily be gamed.

But saying "any" seems to imply that it's a theoretical impossibility ("any machine that moves will encounter friction and lose energy to heat conversion, ergo perpetual motion machines are impossible"), when in fact it's a theoretical possibility, just not yet a practical reality.

eric__cartman•8mo ago

My phone is a piece of junk from 8 years ago and I haven't noticed any degradation in browsing experience. A website takes like two extra seconds to load, not a big deal.

h1fra•8mo ago

Most scrapers are not able to monitor each websites performance, what will happen is that it's going to be slower for site to respond and that's it

matt3210•8mo ago

The issue isn’t the resource usage as much as the content they’re stealing for reproduction purposes.

berkes•8mo ago

That's not true for the vast amount of creative-commons, open-source and other permissive licenced content.

(Aside: the licenses and distribution advocated by many of the same demography (information wants to be free -folks, jstor protestors, GPL-zealots) that now opposes LLMs using that content. )

jsheard•8mo ago

> GPL-zealots

I'm sure GPL zealots would be happier about this situation if LLM vendors abided by the spirit of the license by releasing their models under GPL after ingesting GPL data, but we all know that isn't happening.

reginald78•8mo ago

That is one part, but they are so voracious and aggressive that they are starting to crush hosts of that content and cause things become less open. In a way, they are not only 'stealing' it for themselves but they're also erasing it for humans.

DaSHacka•8mo ago

Surprised there hasn't been a fork of Anubis that changes the artificial PoW into a simple Monero mining PoW yet.

Would be hilarious to trap scraper bots into endless labyrinths of LLM-generated mediawiki pages, getting them to mine hashes with each progressive article.

At least then we would be making money off these rude bots.

xnorswap•8mo ago

The bots could check if they've hit the jackpot themselves and keep the valid hashes for themselves and only return when they're worthless.

Then it's the bots who are making money from work they need to do for the captchas.

forty•8mo ago

We need an oblivious crypto currency mining algorithm ^^

nssnsjsjsjs•8mo ago

1. The problem is the bot needs to understand the program it is running to do that. Akin to the halting problem.

2. There is no money in mining on the kinda hardware scrapers will run on. Power costs more that they'd earn.

immibis•8mo ago

Realistically, the bot owner could notice you're running MoneroAnubis and then would specifically check for MoneroAnubis, for example with a file hash, or a comment saying /* MoneroAnubis 1.0 copyright blah blah GPL license blah */. The bot wouldn't be expected to somehow determine this by itself automatically.

Also, the ideal Monero miner is a power-efficient CPU (so probably in-order). There are no Monero ASICs by design.

nssnsjsjsjs•8mo ago

I doubt you could do this efficiently enough such that an mining-business optimised mining rig can be kept busy enough with web-scraped honey pots to be worth the think time of setting it up vs. just scrape and skip pow protected sites + dedicated crypto mining operation as 2 seperate things.

gus_massa•8mo ago

IIRC the mined block has an instruction like

fake quote > Please add the reward and fees to: 187e6128f96thep00laddr3s9827a4c629b8723d07809

And if you make a fake block that changes the address, then the fake block is not a good one.

This avoid the same problem with people stealing from pools, and also evil people listening to new mined blocks that pretend that they found it and send a fake one.

hypeatei•8mo ago

> Then it's the bots who are making money from work they need to do for the captchas.

Wouldn't it be easier to mine crypto themselves at that point? Seems like a very roundabout way to go about mining crypto.

g-b-r•8mo ago

It would be a much bigger incentive to add them with little care for the innocents impacted.

Although admittedly millions of sites already ruined themselves with cloudflare without that incentive

albrewer•8mo ago

There a company awhile back that did almost exactly this called CoinHive

kmeisthax•8mo ago

This is a good idea for honeypotting scrapers, though as per [0] I hope nobody actually tries to use it on a real website anyone would want to use.

[0] https://news.ycombinator.com/item?id=44117591

bmacho•8mo ago

Another idea: if your content is ~5kb text, then serve it to whoever asks it. If you don't have the bandwidth, try to make it smaller, static, and put it on the edge, some other people's computers.

foul•8mo ago

The fun and terrible thing about the web is that the "rockstar in temporary distress" trope can be true and can happen when you expect it the least, like, you know, when you receive a HN kiss of death.

You can surely expect that a static content will be static and will not run jpegoptim on a image at any given hit (a dynamic CMS + a sudden visit from ByteDance = your server is DDoSed), but you can't expect that any idiot/idiot hive on this planet will set up a multi-country edge caching servers architecture for a small sized website just in case some blog post will hit a few million visits every ten minutes. That can easily take down a server even for static content.

I concur that Anubis is a poor solution, and yet here we are, the UN it's using it to weather down requests.

account42•8mo ago

Popularity-based traffic spikes tend to be very temporary and should not be something smaller sites concern themselves with.

foul•8mo ago

My example was quite vapid, but you shouldn't concentrate on that use-case. Small doesn't always mean "neglectable infos", while scrapers are always stealing CPU time.

account42•8mo ago

Exactly, if you have the bandwidth to serve your proof of work scripts and check the results then you also have the bandwidth to simply serve properly optimized content.

forty•8mo ago

Can we have proof of work algorithm that compute something actually useful? Like finding large prime numbers or something like this that have distributed computation programs. This way all this power wasted is at least not completely lost.

g-b-r•8mo ago

Unfortunately useful things usually require much more computation to find a useful result, they require distributing the search, and so can't reliably verify that you performed the work (most searches will not find anything, and you can just pretend to not have found anything without doing any work).

If a service had enough concurrent clients to reliably hit useful results quickly, you could verify that most did the work by checking if a hit was found, and either let everyone in or block everyone according to that; but then you're relying on the large majority being honest for your service to work at all, and some dishonest clients would still slip through.

xena•8mo ago

Anubis author here. I looked into protein folding. The problem is that protein folding requires scientific data, which can easily get into the gigabyte range. That is more data than I want to serve to clients. Unless there's a way to solve the big data problem, a lot of "compute for good" schemes are frankly unworkable.

mvid•8mo ago

Zero knowledge proofs are basically arbitrary proof of work models. There is some interesting work being done with MPC and ZK proving, so only a small part needs to live on the client. I wonder if this would make it feasible again

rob_c•8mo ago

Seen this already, back in the day the were sites which ran bitcoin hashing on their userbase and there was uproar.

If someone dusted of the same tools and managed to get Altmann to buy them a nice car from it, good on them:)

vanschelven•8mo ago

TBH most of the talk of "aggressive scraping" has been in the 100K pages/day range (which is ~1 page/s, i.e. "neglegtable). In my mind cloud providers ridiculous egres rates are more to blame here.

jeroenhd•8mo ago

I've caught Huawei and Tencent IPs scraping the same image over and over again, with different query parameters. Sure, the image was only 260KiB and I don't use Amazon or GCP or Azure so it didn't cost me anything, but it still spammed my logs and caused a constant drain on my servers' resources.

The bots keep coming back too, ignoring HTTP status codes, permanent redirects, and what else I can think of to tell them to fuck off. Robots.txt obviously doesn't help. Filtering traffic from data centers didn't help either, because soon after I did that, residential IPs started doing the same thing. I don't know if this is a Chinese ISP abusing their IP ranges or if China just has a massive botnet problem, but either way the traditional ways to get rid of these bots hasn't helped.

In the end, I'm now blocking all of China and Singapore. That stops the endless flow of bullshit requests for now, though I see some familiar user agents appearing in other east Asian countries as well.

account42•8mo ago

So make sure the image is only available at one canonical URL with proper caching headers? No, obviously the only solution is to install crapware that worsens the experience for regular users.

account42•8mo ago

Agreed. Website operators should have a hard look at why their unoptimized crap can't manage such low request rates before contributing to the enshittification of the web by deploying crapware like anubis or buttflare.

immibis•8mo ago

I've been blocking a few scrapers from my gitea service - not because it's overloaded, more just to see what happens. They're not getting good data from <repo>/commit/<every sha256 in the repo>/<every file path in the repo> anyway. If they actually wanted the data they could run "git clone".

I just checked, since someone was talking about scraping in IRC earlier. Facebook is sending me about 3 requests per second. I blocked their user-agent. Someone with a Googlebot user-agent is doing the same stupid scraping pattern, and I'm not blocking it. Someone else is sending a request every 5 seconds with

One thing that's interesting on the current web is that sites are expected to make themselves scrapeable. It's supposed to be my job to organize the site in such a way that scrapers don't try to scrape every combination of commit and file path.

bob1029•8mo ago

I think this is not a battle that can be won in this way.

Scraping content for an LLM is not a hyper time sensitive thing. You don't need to scrape every page every day. Sam Altman does not need a synchronous replica of the internet to achieve his goals.

CGamesPlay•8mo ago

That is one view of the problem, but the one people are fixing with proof of work systems is the (unintentional) DDoS that LLM scrapers are operating against these sites. Just reducing the amount of traffic to manageable levels lets me get back to the work of doing whatever my site is supposed to be doing. I personally don't care if Sam Altman has a copy of my git server's rendition of the blame of every commit in my open source repo, because he could have just cloned my git repo and gotten the same result.

bob1029•8mo ago

I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?

If you can't handle the traffic of one request per page per week or month, I think there are bigger problems to solve.

heinrich5991•8mo ago

> I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?

Yes, there are sites being DDoSed by scrapers for LLMs.

> If you can't handle the traffic of one request per page per week or month, I think there are bigger problems to solve.

This isn't about one request per week or per month. There were reports from many sites that they're being hit by scrapers that request from many different IP addresses, one request each.

2000UltraDeluxe•8mo ago

25k+ hits/minute here. And that's just the scrapers that doesn't just identify themselves as a browsers.

Not sure why you believe massive repeated scraping isn't a problem. It's not like there is just one single actor out there, and ignoring robits.txt seems to be the norm nowadays.

spiffyk•8mo ago

> I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?

It is very real and the reason why Anubis has been created in the first place. It is not plain hostility towards LLMs, it is *first and foremost* a DDoS protection against their scrapers.

https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...

https://social.kernel.org/notice/AsgziNL6zgmdbta3lY

https://xeiaso.net/notes/2025/amazon-crawler/

xena•8mo ago

I've set up a few honeypot servers. Right now OpenAI alone accounts for 4 hours of compute for one of the honeypots in a span of 24 hours. It's not hypothetical.

lelanthran•8mo ago

> I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?

There are already a few dozens of thousands of scrapers right now trying to get even more training data.

It will only get worse. We all want more training data. I want more training data. You want more training data.

We all want the most up to date data there is. So, yeah, it will only get worse as time goes on.

fl0id•8mo ago

For the model it’s not. But I think many of these bots are also from tool usage or ‚research‘ or whatever they call it these days. And for that it doesnatter

nssnsjsjsjs•8mo ago

My violin is microscopic for this problem. It's actually given me ideas!

persnickety•8mo ago

> An LLM scraper is operating in a hostile environment [...] because you can't particularly tell a JavaScript proof of work system from JavaScript that does other things. [..] for people who would like to exploit your scraper's CPU to do some cryptocurrency mining, or [...] want to waste as much of your CPU as possible).

That's a valid reason to serve JS-based PoW systems scares LLM operators: there's a chance the code might actually be malicious.

That's not a valid reason to serve JS-based PoW systems to human users: the entire reason those proofs work against LLMs is the threat that the code is malicious.

In other words, PoW works against LLM scrapers not because of PoW, but because they could contain malicious code. Why would you threaten your users with that?

And if you can apply the threat only to LLMs, then why don't you cut the PoW garbage start with that instead?

I know, it's because it's not so easy. So instead of wielding the Damocles sword of malware, why not standardize on some PoW algorithm that people can honestly apply without the risks?

pjc50•8mo ago

I don't think this is "malicious" so much as it is "expensive" (in CPU cycles), which is already a problem for ad-heavy sites.

berkes•8mo ago

> Why would you threaten your users with that?

Your users - we, browsing the web - are already threatened with this. Adding a PoW changes nothing here.

My browser already has several layers of protection in place. My browser even allows me to improve this protection with addons (ublock etc) and my OSes add even more protection to this. This is enough to allow PoW-thats-legit but block malicious code.

account42•8mo ago

Not safety-conscious users who disable javascript.

berkes•8mo ago

Those aren't threatened by PoW or malicious versions thereof either.

captainmuon•8mo ago

I don't know, Sandbox escape from a browser is a big deal, a million dollars bounty kind of deal. I feel safe to put an automated browser in a container or a VM and let it run with a timeout.

And if a site pulls something like that on me, then I just don't take their data. Joke is on them, soon if something is not visible to AI it will not 'exist', like it is now when you are delisted from Google.

hardwaresofton•8mo ago

Fantastic work by Xe here -- not the first but this seems like the most traction I've seen on a PoW anti-scraper project (with an MIT license to boot!).

PoW anti-scraper tools are a good first step, but why don't we just jump straight to the endgame? We're drawing closer to a point where information's value is actually fully realized -- people will stop sharing knowledge for free. It doesn't have to be that way, but it does in a world where people are pressed for economic means, knowledge becomes an obvious thing to convert to capital and attempt to extract rent on.

The simple way this happens is just a login wall -- for every website. It doesn't have to be a paid login wall of course (at first), but it's a super simple way to both legally and practically protect from scrapers.

I think high quality knowledge, source code (which is basically executable knowledge), being open in general is a miracle/luxury of functioning, economically balanced societies where people feel driven (for many possible reasons) to give back, or have time to think of more than surviving.

Don't get me wrong -- the doomer angle is almost always wrong -- every year humanity is almost always better off than we were the previous year on many important metrics, but it's getting harder to see a world where we cartwheel through another technological transformation that this time could possibly impact large percentages of the working population.

g-b-r•8mo ago

Yes, let's turn the whole web into Facebook, what a bright future

Schiendelman•8mo ago

I think we only have two choices here: 1) every webpage requires Facebook login, and then Facebook offers free hosting for the content. 2) every webpage requires some other method of login, but not locked into a single system.

I read the GP comment as suggesting we push on the second option there rather than passively waiting for the first option.

g-b-r•8mo ago

I hope you see that you shatter anonymity and the open web with that, single system or not

Schiendelman•8mo ago

Of course I do. But that's already gone.

g-b-r•8mo ago

You must be on a different internet than mine

Schiendelman•8mo ago

I'm not. But what are you claiming to be true?

hardwaresofton•8mo ago

anonymity and the open web are different things, and neither of them were promised/guaranteed to anyone on the internet.

For people that value anonymity, they'll create their own spaces. People that value openness will continue to be open.

What we're about to find out is what happens when the tide goes out and people show you what they really believe/want -- anything other than that is a form of social control, whether via browbeating or other means.

g-b-r•8mo ago

> anonymity and the open web are different things, and neither of them were promised/guaranteed to anyone on the internet. > > For people that value anonymity, they'll create their own spaces. People that value openness will continue to be open

Hardly anything of what's the internet today was promised, but who are you to decide what the internet has to become now, and that people with different ideas need to confine themselves in their own ghettos?

Everyone values privacy, it's just out of social pressure if most give up so much of it.

> What we're about to find out is what happens when the tide goes out and people show you what they really believe/want -- anything other than that is a form of social control, whether via browbeating or other means

No idea of what you're talking about there

immibis•8mo ago

IP addresses are not anonymous. Have you tried to make your IP address anonymous, e.g., with Tor or one of those NordVPN-like companies? (not picking on Nord, though they deserve to be picked on - they're just the most advertised.)

You'll find CAPTCHAs almost everywhere, outright 403s or dropped connections in a lot of places. Even Google won't serve you sometimes.

The reason you're not seeing that situation right now is that your IP address is identifiable.

reginald78•8mo ago

I see captchas all the time on my home internet connection without a VPN these days. That era seems to be ending, probably because AI scraping is now using residential IP blocks.

immibis•8mo ago

There's been talk on NANOG about whole residential ISPs getting marked as VPNs now. Turns out selling excessive security to businesses is easy, I guess. Like CrowdStrike.

g-b-r•8mo ago

IP addresses can be anonymous, and I do get CAPTCHAs almost everywhere they're used, without using Tor.

What cannot possibly be anonymous is a login with a verified identity.

hardwaresofton•8mo ago

You were right on the second one! Facebook wasn't even a thought in my mind per say (they're not unique in that every social network wants to build a walled garden).

My focus was more on the areas outside the large walled gardens -- they might be come a bunch of smaller... fenced backyards, to put it nicely.

pjc50•8mo ago

> people will stop sharing knowledge for free. It doesn't have to be that way

Yeah. People over-estimate the flashy threats from AI, but to me the more significant threat is killing the open exchange of knowledge and more generally the open, trusting society by flooding it with agents which are happy to press "defect" on the prisoner's dilemma.

> being open in general is a miracle/luxury of functioning, economically balanced societies where people feel driven (for many possible reasons) to give back, or have time to think of more than surviving

"High trust society". Something that took the West a very long time to construct through social practices, was hugely beneficial for economic growth, but is vulnerable to defectors. Think of it like a rainforest: a resource which can be burned down to increase quarterly profit.

hardwaresofton•8mo ago

> Yeah. People over-estimate the flashy threats from AI, but to me the more significant threat is killing the open exchange of knowledge and more generally the open, trusting society by flooding it with agents which are happy to press "defect" on the prisoner's dilemma.

I don't think societies are open/trusting by default -- it takes work and a lot of anti-intuitive thinking, sustained over long periods of time.

> "High trust society". Something that took the West a very long time to construct through social practices, was hugely beneficial for economic growth, but is vulnerable to defectors. Think of it like a rainforest: a resource which can be burned down to increase quarterly profit.

I think the trust is downstream of the safety (and importantly "economic safety", if we can call it that). Everyone trusts more when they're not feeling threatened. People "defect" from cultures that don't work for them -- people leave the culture they like and go to another one usually because of some manifestation of danger.

account42•8mo ago

Fantastic work? More like contributing to the enshittification of the web.

Analemma_•8mo ago

I mean, LLM scrapers set fire to the commons, and when you do that, now you have a flaming hole in the ground where the commons used to be. It's not the fault of website operaters who have to act in self-defense lest their site get DDoSed out of existence.

lr4444lr•8mo ago

This is inimical to the purpose of the Internet.

Maybe the dream of knowledge being free and open was always doomed to fail; if knowledge has value, and people are encouraged to spend more of their time and energy to create it rather than other kinds of work, they will have to be compensated in order to do it increasingly well.

It's kinda sad though, if you grew up in a world where you could actually discover stuff organically and through search.

immibis•8mo ago

I don't think it's inevitable doom, but a realignment of incentives will probably be needed. Perhaps in the form of payment. In several EU countries it's illegal to have any internet connection without linking it to your ID card or passport in some central database, so that could also be thing - people are generally reluctant to get arrested.

xena•8mo ago

Thanks! I'm gonna try and bootstrap this into a company. My product goal for the immediate future is unbranded Anubis (already implemented) with a longer term goal of being a Canadian-run Cloudflare competitor.

captainmuon•8mo ago

As somebody who does some scraping / crawling for legitimate uses, I'm really unhappy with this development. I understand people have valid cases why they don't want their content scraped. Maybe they want to sell it - I can understand that, although I don't like it. Maybe they are opposed to it for fundamental reasons. I for one would like my content to be spread maximally. I want my arguments to be incorporated into AIs, so I can reach more people. But of course that is just me when I'd write certain content, others have different goals.

It gets annoying when you have the right to scrape something - either because the owner of the data gave you the OK or because it is openly licensed. But then the webmaster can't be bothered to relax the rate limiter for you, and nobody can give you a nice API. Now people are putting their Open Educational Resources, their open source software, even their freaking essays about openness that they want the world to read behind Anubis. It makes me shake my head.

I understand perfectly it is annoying when badly written bots hammer your site. But maybe then HTTP and those bots are the problem. Maybe we should make it easier for site owners to push their content somewhere where we can scrape it easier?

berkes•8mo ago

Sounds like something IPFS could be nice solution for.

yladiz•8mo ago

> I understand people have valid cases why they don't want their content scraped. Maybe they want to sell it - I can understand that, although I don't like it.

To be frank: it’s not your content, it’s theirs, and it doesn’t matter if you like it or not, they can decide what they want to do with it, you’re not entitled to it. Yes there are some cases that you personally have permission to scrape, or the license explicitly permits it, but this isn’t the norm.

The bigger issue isn’t that people don’t want their content to be read it’s that they want it to be read and consumed by a human in most cases, and they want their server resources (network bandwidth, CPU, etc) to be used in a manageable way. If these bots were written to be respectful, then maybe we wouldn’t be in this situation. These bots poisoned the well, and they affect respectful bots because of their actions.

Analemma_•8mo ago

If you scrape at a reasonable rate and don't clear session cookies, your scraper can solve the Anubis POW same as a user and you're fine. Anubis is for distributed scrapers which make requests at absurd rates.

berkes•8mo ago

I doubt any "anti-scraper" system will actually work.

But if one is found, it will pave the way for a very dangerous counter-attack: Browser vendors with need for data (i.e. Google) simply using the vast fleet of installed browsers to do this scraping for them. Chrome, Safari, Edge, sending the pages you visit to their data-centers.

lionkor•8mo ago

This is why we need https://ladybird.org/

reginald78•8mo ago

This feels like it already was half happening anyway so it isn't to big of a leap.

I also think this is the endgame of things like Recall in windows. Steal the training data right off your PC, no need to wait for the sucker to upload it to the web first.

dxuh•8mo ago

I always thought that JavaScript cryptomining is a better alternative to ads for monetizing websites (as long as people don't depend on those websites and website owners don't take it too far). I'd much rather give you a second of my CPU instead of space in my brain. Why is this so frowned upon? And in the same way I thought Anubis should just mine crypto instead of wasting power.

thedanbob•8mo ago

> Why is this so frowned upon?

Maybe because while ad tech these days is no less shady than crypto mining, the concept of ads is something people understand. Most people don't really understand crypto so it gets lumped in with "hackers" and "viruses".

Alternatively, for those who do understand ad tech and crypto, crypto mining still subjectively feels (to me at least) more like you're being stolen from than ads. Same with Anubis, wasting power on PoW "feels" more acceptable to me than mining crypto. One of those quirks of the human psyche I guess.

matheusmoreira•8mo ago

Running proof of work on user machines without their consent is theft of their computing and energy resources. Any site doing so for any purpose whatsoever is serving malware and should be treated as such.

Advertising is theft of attention which is extremely limited in supply. I'd even say it's mind rape. They forcibly insert their brands and trademarks into our minds without our consent. They deliberately ignore and circumvent any and all attempts to resist. It's all "justified" though, business interests excuse everything.

x-complexity•8mo ago

> Advertising is theft of attention which is extremely limited in supply. I'd even say it's mind rape. They forcibly insert their brands and trademarks into our minds without our consent. They deliberately ignore and circumvent any and all attempts to resist.

(1): Attention from any given person is fundamentally limited. Said attention has an inherent value.

(2): Running *any* website costs money, doubly so for video playback. This is not even mentioning the moderation & copyright mechanisms that a video sharing platform like YouTube has to have in order to keep copyright lawsuits away from YouTube itself.

(3): Products do not spawn in with their presence known to the general population. For the product to be successful, people have to know it exists in the first place.

Advertising is the consequence of wanting attention to be drawn to (3), and willing to pay for said attention on a given platform (1). (2)'s costs, alongside any payouts to videographers that garner attention to their videos, can be paid for with the money in (1), by placing ads around/before the video itself.

You're allowed to not have advertising shown to you, but in exchange, the money to pay for (2) & the people who made the video have to come from somewhere.

matheusmoreira•8mo ago

> Said attention has an inherent value.

Yes, and it belongs to us. It's not theirs to sell to the highest bidder.

> Running any website costs money, doubly so for video playback.

> Products do not spawn in with their presence known to the general population.

Not our problem. Business needs do not excuse it. Let all those so called innovators find a way to make it without an attention economy. Let them go bankrupt if they can't.

x-complexity•8mo ago

> Yes, and it belongs to us. It's not theirs to sell to the highest bidder.

Your attention belongs to you, until you give it to someone else.

The videographer has the right to sell sponsorships on their videos in exchange the attention they've attracted. It is also their right to do so.

> Not our problem. Business needs do not excuse it. Let all those so called innovators find a way to make it without an attention economy. Let them go bankrupt if they can't.

Your logic has already been tried: It's called Netflix. And it was overtaken by YouTube.

YouTube has been the wellspring for indie videographers because they have a platform that could (a) handle the video hosting for them for free, where (b) they could post their experiments on without an upfront cost & where an audience can be found because the platform's free.

Your idea seeks upfront payment, which increases the risk cost dramatically from 0 to a fixed value. One-shot experiments with 0 funds are killed under your scheme.

To seek their bankruptcy is nothing short of a fetishistic desire for your ideals to trample on others your your own gloating. Go back to the DVD era.

matheusmoreira•8mo ago

> The videographer has the right to sell sponsorships on their videos in exchange the attention they've attracted.

As is my right to use uBlock Origin and Sponsor Block to automatically block and skip every single one of those segments. Won't be long until we have AI powered ad blocking that can edit ads out of video streams in real time.

We decide what we pay attention to. Making videos is not a license to dupe us into viewing advertising noise. Baiting us with some interesting topic only to switch to commercial nonsense is just rude, and that's the most charitable interpretation I can offer.

> YouTube has been the wellspring for indie videographers

Because of ads and surveillance capitalism. Those are the root causes of everything that is wrong with the web today. Blocking those will reduce their returns on investment, thereby fixing the web.

> One-shot experiments with 0 funds are killed under your scheme.

Nah. Only the money motivated people will leave. People have been creating things just for the joy of it since the dawn of humanity. Those humans with intrinsic motivation are the ones I really care about. Not these insipid profit driven "content creators".

captainbland•8mo ago

I'd imagine it's pretty much impossible to make a crypto system which doesn't introduce unreasonable latency/battery drain on low-end mobile devices which is also sufficiently difficult for scrapers running on bleeding edge hardware.

If you decide that low end devices are a worthy sacrifice then you're creating e-waste. Not to mention the energy burden.

ge96•8mo ago

I think some sites that stream content (illegally) do this

myself248•8mo ago

If the proof-of-work system is actually a crypto miner, such that visitors end up paying the site for the content they host, have we finally converged on a working implementation of the micropayments-for-websites concepts of decades ago?

diggan•8mo ago

> If the proof-of-work system is actually a crypto miner, such that visitors end up paying the site for the content they host

Unsure how that would work. If the proof you generate could be used for blockchain operations, so that the website operator could be paid by using that proof as generated by the website visitor, why shouldn't the visitor keep that proof to themselves and use it instead? Then they'd get the full amount, and the website operator gets nothing. So then there is no point for it, and the visitor might as well just run a miner locally :)

Retr0id•8mo ago

If the user mined it themselves and then paid the site owner before accessing the site, they'd have to pay a transaction fee and wait for a high-latency transaction to commit. The transaction fee could dwarf the actual payment value.

Mining on behalf of the site owner negates the need for a transaction entirely.

viraptor•8mo ago

(unnecessary)

Retr0id•8mo ago

I know this.

viraptor•8mo ago

Responded to wrong comment, sorry

odo1242•8mo ago

The company Coinhive used to do this before they shut down. Basically, in order to enter a website, you have to provide the website with a certain number of Monero hashes (usually around 1,024) that the website would send to Coinhive’s miner pool before letting the user through.

It kinda worked, except for the fact that hackers would try to “cryptojack” random websites by hacking them and inserting Coinhive’s miner into their pages. This caused everyone to block Coinhive’s servers. (Also you wouldn’t get very much money out of it - even the cryptojackers who managed to get tens of millions of page views out of hacked websites reported they only made ~$40 from the operation)

kbenson•8mo ago

If attackers only made ~$40 fora good amount of work, seems like it would have resolved itself if the scheme was left to run itself to conclusion before people started blocking coinhive in (what sounds like from your description) a knee-jerk reaction.

Then again, I'm sure there's quite a bit of tweaking that could be done to make clients submit far more hashes, but that would make it much more noticeable.

hoppp•8mo ago

That $40 now coud be in the thousands if they didn't spend it Xmr was cheaper back then.

viraptor•8mo ago

Have a look at how mining pools are implemented. The client only gets to change some part of the block and does the hashing from there. You can't go back from that to change the original data - you wouldn't get paid. Otherwise you could easily scam the mining pool and always keep the winning numbers to yourself while getting paid for the partials too.

mistercow•8mo ago

Just to make sure I understand this (and maybe clarify for others too), because my understanding of proof-of-work systems is very high level:

When you mine a block, you’re basically bundling up a bunch of meaningful data, and then trying to append some padding data that will e.g. result in a hash that has N leading 0 bits. One of the pieces of meaningful data in the block is “who gets the reward?”

If you’re mining alone, you would put data on that block that says “me” as who gets the reward. But if you’re mining for a pool, you get a block that already says “the pool” for who gets the reward.

So then I’m guessing the pool gives you a lesser work factor to hit, so some value smaller than N? You’ll be basically saying “Well, here’s a block that doesn’t have N leading zeroes, but does have M, leading zeroes”, and that proves how much you’re working for the pool, and entitles you to a proportion of the winnings.

If you changed the “who gets the reward?” from “the pool” to “me”, that would change the hash. So you can’t come in after the fact, say “Look at that! N leading zeroes! Let me just swap myself in to get the reward…” because that would result in an invalid block. And if you put yourself as the reward line in advance, the pool just won’t give you credit for your “partial” answers.

Is that about right?

mistercow•8mo ago

Just to tack onto this, the one thing you could do is just deny real rewards. If someone wants you to take part in their pool in order to load a webpage, you could just refuse to submit any block that will actually pay out. This wouldn’t benefit you, but it would punish them for trying to make you spend resources on their behalf.

I wonder if there’s a way to prevent that.

lurkshark•8mo ago

This system actually existed for awhile, it was called Coinhive. Each visitor would be treated like a node in a mining pool with “credit” for the resources going to the site owner. Somewhat predictably it became primarily used by hackers who would inject the code on high profile sites or use advertising networks.

https://krebsonsecurity.com/2018/03/who-and-what-is-coinhive...

xd1936•8mo ago

The domain is now owned by Troy Hunt!

https://www.troyhunt.com/i-now-own-the-coinhive-domain-heres...

SilasX•8mo ago

From my understanding: to pose the problem for miners, you hash the block you're planning to submit (which includes all the proposed transactions). Miners only get the hash. To claim the reward, you need the preimage (i.e. block to be submitted), which miners don't have.

In theory, you could watch the transactions being broadcast, and guess (+confirm) the corresponding block, but that would require you to see all the transactions the pool owner did, and put them in the same order (the possibilities of which scale exponentially with the number of transactions). There may be some other randomness you can insert into a block too -- someone else might know this.

Edit: oops, I forgot: the block also contains the address that the fees should be sent to. So even if you "stole" your solution and broadcast it with the block, the fee is still going to the pool owner. That's a bigger deal.

moralestapia•8mo ago

Yeah, but then they wouldn't get your content? Duh.

csense•8mo ago

> why shouldn't the visitor keep that proof to themselves and use it instead

Because they can't, if the website operator designs their JavaScript correctly. In detail:

If Alice goes to Bob's website, Bob tells Alice to find a hash with 20 leading zeros for a Bitcoin block that says "Send the newly printed bitcoins and transaction fees to Bob." (It will take Alice ~2^20 guesses to find such a block, so Bob picked the number 20 such that those ~2^20 guesses happen in a couple seconds for normal humans with normal web browsers on normal devices.)

Supposing the actual Bitcoin blockchain needs a hash with 50 leading zeros, one in every 2^30 Alice's will mine a valid block (worth ~$300k at current Bitcoin prices).

If Alice finds a block with 50 leading zeros and then tries to change the block to say "Send the newly printed bitcoins and transaction fees to Alice," her new block will have a different hash (that is very unlikely to have 2^50 leading zeros), and neither the website nor the blockchain will accept it.

Sure, Alice could change the block at the beginning before starting the search. But if she finds a block with 20 leading zeros that says "Send the newly printed bitcoins and transaction fees to Alice," Bob won't accept it for access to his website. The only way Alice gets anything for a block that sends the coins to herself is from the Bitcoin blockchain if she finds a 2^50 block -- at that point, Alice is just mining Bitcoins for herself and not interacting with Bob's website at all.

(If a 1 in a billion chance to win $300k is too risky for Bob's liking, he can get a lower payout with a higher probability by using a different proof-of-work blockchain and/or a mining pool.)

odo1242•8mo ago

Not really, because it takes a LOT of hashes to actually get any crypto out of the system. Yes, you’re technically taking the user’s power and getting paid crypto, but unless you’re delaying the user for a long time, you’re only really being paid about a ten thousandth of a cent for a page visit.

Also virus scanners and corporate networks would hate you, because hackers are probably trying to embed whatever library you’re using into other unsuspecting sites.

jfengel•8mo ago

What does one actually get per page impression from Google Ads? I gather that it's more than a ten thousandth of a cent, but perhaps not all that much more.

20after4•8mo ago

It used to be a few pennies per 1000 page views. But it's been more than 10 years since I had any insight into the details of internet ad revenue. (Also it's highly dependent on all kinds of contextual details about the content, the keywords, and the specific ads you wind up serving)

msgodel•8mo ago

It would be nice if this could get standardized http headers so bots could still use sites but they effectively pay for use. That seems like the best of all possible worlds to me, the whole point of HTML is that robots can read it, otherwise we'd just be emailing eachother pdfs.

ramses0•8mo ago

https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

...and the requisite checklist: https://trog.qgl.org/20081217/the-why-your-anti-spam-idea-wo...

msgodel•8mo ago

Formalizing it doesn't change that it's being used. If it doesn't work it shouldn't be done, if it does it should be formalized.

overfeed•8mo ago

> bots could still use sites but they effectively pay for use. That seems like the best of all possible worlds to me

This would make the entire internet a maze of AI-slop content primarily made for other bots to consume. Humans may have to resort to emailing handwritten PDFs to avoid the thoroughly enshittified web.

dandelany•8mo ago

As opposed to what we have now?

overfeed•8mo ago

Yes - things could get worse from the status quo.

At the moment, ad networks don't pay for bot impressions when detected - so content farms tend to optimize for what passes for humans. All bets are off if human and bot visitors offer the same economic value via miners, or worse if it turns out that bots are more profitable due to human impatience.

Imagine an internet optimized for bot visitors, and indifferent to humans. It would be a kind of refined brainrot aimed at a brainless audience.

DrillShopper•8mo ago

They should have to set the evil bit

theamk•8mo ago

We already have a standardized system - robots.txt - and AI bots are already ignoring it. Why would more standardized headers matter? Bots will ignore them just as they do today, pretend to be regular users, and get content without paying.

(A secondary thing is that AI bots have basically zero benefit for most websites, so unless you are some low-cost crappy content farm, it'll be in your interest to raise the prices to the max so the bots are simply locked out. Which will bring us back to point 1, bots ignoring the headers)

msgodel•8mo ago

Being indexed in search engines has zero benefit?

Also robots.txt is a suggestion but hashcash is enforced server side. I agree it's a tragedy people have started to completely ignore it but you can't ignore server side behavior.

MyPasswordSucks•8mo ago

How do you propose the server distinguish between a bot and a human visitor?

theamk•8mo ago

Being indexed in search engines is very useful, but does not need any filtering - Google and all other major search engines respect robots.txt, use well-known UA and even publish IP address.

AI bots are not search engines, and they have no benefits for the website owners. This can be very clearly seen because they ignore robots.txt, pretend to be regular browsers, and use multiple proxies to avoid IP bans.

kmeisthax•8mo ago

The problem with micropayments was fourfold:

1. Banner ads made more money. This stopped being true a while ago, it's why newspapers all have annoying soft paywalls now.

2. People didn't have payment rails set up for e-commerce back then. Largely fixed now, at least for adults in the US.

3. Transactions have fixed processing costs that make anything <$1 too cheap to transact. Fixed with batching (e.g. buy $5 of credit and spend it over time).

4. Having to approve each micropurchase imposes a fixed mental transaction cost that outweighs the actual cost of the individual item. Difficult to solve ethically.

With the exception of, arguably[0], Patreon, all of these hurdles proved fatal to microtransactions as a means to sell web content. Games are an exception, but they solved the problem of mental transaction costs by drowning it in intensely unethical dark patterns protected by shittons of DRM[1]. You basically have to make someone press the spend button without thinking.

The way these proof-of-work systems are currently implemented, you're effectively taking away the buy button and just charging someone the moment they hit the page. This is ethically dubious, at least as ethically dubious as 'data caps[2]' in terms of how much affordance you give the user to manage their spending: none.

Furthermore, if we use a proof-of-work system that's shared with an actual cryptocurrency, so as to actually get payment from these hashes, then we have a new problem: ASICs. Cryptocurrencies have to be secured by a globally agreed-upon hash function, and changing that global consensus to a new hash function is very difficult. And those hashes have economic value. So it makes lots of sense to go build custom hardware just to crack hashes faster and claim more of the inflation schedule and on-chain fees.

If ASICs exist for a given hash function, then proof-of-work fails at both:

- Being an antispam system, since spammers will have better hardware than legitimate users[3]

- Being a billing system, since legitimate users won't be able to mine enough crypto to pay any economically viable amount of money

If you don't insist on using proof-of-work as billing, and only as antispam, then you can invent whatever tortured mess of a hash function is incompatible with commonly available mining ASICs. And since they don't have to be globally agreed-upon, everyone can use a different, incompatible hash function.

"Don't roll your own crypto" is usually good security advice, but in this case, we're not doing security, we're doing DRM. The same fundamental constants of computing that make stopping you from copying a movie off Netflix a fool's errand also make stopping scrapers theoretically impossible. The only reason why DRM works is because of the gap between theory and practice: technically unsophisticated actors can be stopped by theoretically dubious usages of cryptography. And boy howdy are LLM scrapers unsophisticated. But using the tried-and-true solutions means they don't have to be: they can just grab off-the-shelf solutions for cracking hashes and break whatever you use.

[0] At least until Apple cracked Patreon's kneecaps and made them drop support for any billing mode Apple's shitty commerce system couldn't handle.

[1] At the very least, you can't sell microtransaction items in games without criminalizing cheat devices that had previously been perfectly legal for offline use. Half the shit you sell in a cash shop is just what used to be a GameShark code.

[2] To be clear, the units in which Internet connections are sold should be kbps, not GB/mo. Every connection already has a bandwidth limit, so what ISPs are doing when they sell you a plan with a data cap is a bait and switch. Two caps means the lower cap is actually a link utilization cap, hidden behind a math problem.

[3] A similar problem has arisen in e-mail, where spammy domains have perfect DKIM/SPF, while good senders tend to not care about e-mail bureaucracy and thus look worse to antispam systems.

jaredwiener•8mo ago

Point 4 is often overlooked and I think the biggest issue.

Once there is ANY value exchanged, the user immediately wonders if it is worth it -- and if the payment/token/whatever is sent prior to the pageload, they have no way of knowing.

bee_rider•8mo ago

This is most true of books and other types of media (well, you can flip through a book at the store, but it isn’t a perfect thing…).

I dunno. Brands and other quality signals (imperfect as they tend to be, they still aren’t completely useless) could develop.

kpw94•8mo ago

Books have a back cover for that reason: so you can read it before buying.

Long-form articles could have a back cover summary too, or an enticing intro... and some substack paid articles do that already: they let you read an intro and cut before going in the interesting details.

But for short newspapers articles it becomes harder to do based on topic. If the summary has to give out 90% of the information to not be too vague, you may then feel robbed paying for it once you realize the remaining 10% wasn't that useful.

jaredwiener•8mo ago

Not to mention, the reporting that went into the headline or blurb is what is expensive. You got the value by reading it for free.

https://blog.forth.news/a-business-model-for-21st-century-ne...

wahern•8mo ago

> Once there is ANY value exchanged

There's always value exchanged--"If you're not paying for the product, you are the product".[1] For ads we've established the fiction that everybody knowingly understands and accepts this quid pro quo. For proof of work we'd settle on a similar fiction, though perhaps browsers could add a little graphic showing CPU consumption.

[1] This is true even for personal blogs, albeit the monetary element is much more remote.

hakfoo•8mo ago

Point 4 is solvable by selling a broad subscription rather than individual articles.

Streaming proves this. When people spend $10 per month on Netflix/Hulu/Crunchyroll they don't have to further ask "do I want to pay 7.5 cents for another episode" every 22 minutes. The math for who's getting paid how much for how many streams is entirely outside the customer's consideration, and the range is broad enough that it discourages one-and-done binging.

For individual content providers, you might need to form some sort of federated project. Media properties could organize through existing networks as an obvious framework ("all AP newspapers for $x per month") but we'd probably need new federations for online-first and less news-centric publishers.

x-complexity•8mo ago

> Furthermore, if we use a proof-of-work system that's shared with an actual cryptocurrency, so as to actually get payment from these hashes, then we have a new problem: ASICs. Cryptocurrencies have to be secured by a globally agreed-upon hash function, and changing that global consensus to a new hash function is very difficult. And those hashes have economic value. So it makes lots of sense to go build custom hardware just to crack hashes faster and claim more of the inflation schedule and on-chain fees.

> If ASICs exist for a given hash function, then proof-of-work fails at both:

> - Being an antispam system, since spammers will have better hardware than legitimate users[3]

> - Being a billing system, since legitimate users won't be able to mine enough crypto to pay any economically viable amount of money

Monero/XMR & Zcash break this part of the argument, along with ASIC/GPU-resistant algorithms in general (Argon2 being most well-known & recommended as a KDF).

Creating an ASIC-resistant coin is not impossible, as shown by XMR. The difficult part comes from creating & sustaining the network surrounding the coin, and those two are amongst the few that have done both. Furthermore, there's little actual need to create another coin to do so when XMR fulfills that niche.

------

> If you don't insist on using proof-of-work as billing, and only as antispam, then you can invent whatever tortured mess of a hash function is incompatible with commonly available mining ASICs. And since they don't have to be globally agreed-upon, everyone can use a different, incompatible hash function.

Counterpoint: People (and devs) want a pre-packaged solution that solves the mentioned antispam problem. For almost everyone, Anubis does its job as intended.

https://github.com/TecharoHQ/anubis

bdcravens•8mo ago

There were some Javascript-based embedded miners in the early days of Bitcoin

https://web.archive.org/web/20110603143708/http://www.bitcoi...

EGreg•8mo ago

“ because you can't particularly tell a JavaScript proof of work system from JavaScript that does other things. Letting your scraper run JavaScript means that it can”

LLMs can, LOL

One of the powerful use cases is that they can catch pretty much EVERY attempt at obfuscation now. Building a marketplace and want to let the participants chat but not take the deal off-site? LLM in the loop. Protecting children? LLM in the loop.

fennecfoxy•8mo ago

Hmmmm this seems like something that will be bad for the environment.

reginald78•8mo ago

AI scrapping, bloated javascript pages that are mostly text, ads, existing captchas, etc also all waste energy for no gain to the end user and we mostly just accept it or maybe run an ad blocker. I see people complain PoW solutions make it hard for low power devices and are a waste of energy, which is true. But that is also true of the status quo which is also an annoying waste of human time and often a privacy nightmare.

avastel•8mo ago

Reposting a similar point I made recently about CAPTCHA and scalpers, but it’s even more relevant for scrapers.

PoW can help against basic scrapers or DDoS, but it won’t stop anyone serious. Last week I looked into a Binance CAPTCHA solver that didn’t use a browser at all, just a plain HTTP client. https://blog.castle.io/what-a-binance-captcha-solver-tells-u...

The attacker had fully reverse engineered the signal collection and solved-state flow, including obfuscated parts. They could forge all the expected telemetry.

This kind of setup is pretty standard in bot-heavy environments like ticketing or sneaker drops. Scrapers often do the same to cut costs. CAPTCHA and PoW mostly become signal collection protocols, if those signals aren’t tightly coupled to the actual runtime, they get spoofed.

And regarding PoW: if you try to make it slow enough to hurt bots, you also hurt users on low-end devices. Someone even ported PerimeterX’s PoW to CUDA to accelerate solving: https://github.com/re-jevi/PerimiterXCudaSolver/blob/main/po...

benregenspan•8mo ago

At a media company, our web performance monitoring tool started flagging long-running clientside XHR requests, which I couldn't reproduce in a real browser. It turned out that an analytics vendor was injecting a script which checked if it looked like the client was a bot. If so, they would then essentially use the client as a worker to perform their own third-party API requests (for data like social share counts). So there's definitely some prior art for this kind of thing.

apitman•8mo ago

This is really interesting. One naive thought that immediately came to mind is that bots might be capable of making cross site requests. The logical conclusion of this entire arms race is that bots will eventually have no choice but to run actual browsers. Not sure that fact will appreciably reduce their scaping abilities though.

benregenspan•8mo ago

> The logical conclusion of this entire arms race is that bots will eventually have no choice but to run actual browsers

I think this is almost already the case now. Services like Cloudflare do a pretty good job of classifying primitive bots and if site operators want to block all (or at least vast majority), they can. The only reliable way through is a real browser. (Which does put a floor on resource needs for scraping)

dragonwriter•8mo ago

> The logical conclusion of this entire arms race is that bots will eventually have no choice but to run actual browsers.

I thought bots using (headless) browsers was an existing workaround for a number of existing issues with simpler bots, so this doesn't seem to be a big change.

akomtu•8mo ago

That's DRM essentially: watch, but do not copy. Except that in this case the corporations do piracy.

jameslk•8mo ago

What if we move some of the website backend workload to the bots, effectively using them as decentralized cloud infrastructure? web3, we are so back

kbenson•8mo ago

With regard to proof of work systems that provide revenue:

1) Making LLM (and other) scrapers pay for the resources they use seems perfectly fine to me. Also, as someone that manages some level of scraping (on the order of low tens of millions of requests a month), I'm fine with this. There's a wide range of scraping that the problem is not some resource cost, but the other side not wanting to deal with setting up APIs or putting so many hurdles on access that it's easier to just bypass it.

2) This seems like it might be an opportunity for Cloudflare. Let customers opt-in to requiring a proof of work when visitors already trip the cloudflare vetting page that runs additional checks to see if you're a bad actor, and apply any revenue to a service credit towards their monthly fee (or if on a free plan, as credit to be used for trying out additional for-pay features). There might be a perverse inventive to toggle on more stringent checking from cloudflare, but ultimately since it's all being paid for that's the site owner's choice on how they want to manage their site.

France's homegrown open source online office suite

British drivers over 70 to face eye tests every three years

Start all of your commands with a comma (2009)

Leisure Suit Larry's Al Lowe on model trains, funny deaths and Disney

Hoot: Scheme on WebAssembly

Software Factories and the Agentic Moment

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

First Proof

Stories from 25 Years of Software Development

Reinforcement Learning from Human Feedback

The Waymo World Model

Coding agents have replaced every framework I used

Vocal Guide – belt sing without killing yourself

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

StrongDM's AI team build serious software without even looking at the code

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Vinklu Turns Forgotten Plot in Bucharest into Tiny Coffee Shop

What Is Stoicism?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Making geo joins faster with H3 indexes

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Ga68, a GNU Algol 68 Compiler

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

What Is Ruliology?

Show HN: I spent 4 years building a UI design tool with only the features I use

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

A thought on JavaScript "proof of work" anti-scraper systems

Comments

France's homegrown open source online office suite

British drivers over 70 to face eye tests every three years

Start all of your commands with a comma (2009)

Leisure Suit Larry's Al Lowe on model trains, funny deaths and Disney

Hoot: Scheme on WebAssembly

Software Factories and the Agentic Moment

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

First Proof

Stories from 25 Years of Software Development

Reinforcement Learning from Human Feedback

The Waymo World Model

Coding agents have replaced every framework I used

Vocal Guide – belt sing without killing yourself

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

StrongDM's AI team build serious software without even looking at the code

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Vinklu Turns Forgotten Plot in Bucharest into Tiny Coffee Shop

What Is Stoicism?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Making geo joins faster with H3 indexes

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Ga68, a GNU Algol 68 Compiler

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

What Is Ruliology?

Show HN: I spent 4 years building a UI design tool with only the features I use

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev