Sacrificing accessibility for not getting web scraped

https://tilschuenemann.de/projects/sacrificing-accessibility-for-not-getting-web-scraped

33•tilschuenemann•1mo ago

Comments

ctoth•1mo ago

Congratulations, I guess? I can't read your content. But ... The machines can't either, so ... great job!

Although... Hmm! I just pasted it into Claude and got:

When text content gets scraped from the web, and used for ever-increasing training data to improve. Copyright laws get broken, content gets addressively scraped, and even though you might have deleted your original work, it might must show up because it got cached or archived at some point. Now, if you subscribe to the idea that your content shouldn't be used for training, you don't have much say. I wondered how I personally would mitigate this on a technical level. et tu, caesar? In my linear algebra class we discussed the caesar cipher[1] as a simple encryption algorithm: Every character gets shifted by n characters. If you know (or guess) the shift, you can figure out the original text. Brute force or character heuristics break this easily. But we can apply this substitution more generally to a font! A font contains a cmap (character map), which maps codepoints and glyphs. A codepoint defines the character, or complex symbol, and the glyph represents the visual shape. We scramble the font's codepoint-glyph-mapping, and adjust the text with the inverse of the scramble, so it stays intact for our readers. It displays correctly, but the inspected (or scraped) HTML stays scrambled. Theoretically, you could apply a different scramble to each request. This works as long as scrapers don't use OCR for handling edge cases like this, but I don't think it would be feasible. I also tested if ChatGPT could decode a ciphertext if I'd tell it that a substitution cipher was used, and after some back and forth, it gave me the result: "One day Alice went down a rabbit hole,

How accurate is this?

Did you seriously just make things worse for screen reader users and not even ... verify ... it worked to make things worse for AI?

lumirth•1mo ago

That’s the correct text of the article, as far as I can tell. Though not the entirety of it. The author goes on to say that ChatGPT wasn’t able to parse out the underlying text.

Part of the reason it might be useful is not because “no AI can ever read it” (because I’m sure a pentesting-focused Claude Code could get past almost any similar obfuscation), but rather that the completely automated and dumb scrapers stealing your content for the training of the AI models can’t read it. For many systems, that’s more than enough.

That said, I recently completely tore apart my website and rebuilt it from the ground up because I wasn’t happy with how inaccessible it was. For many like me, sacrificing accessibility is not just a bad look, but plainly unacceptable.

ctoth•1mo ago

I didn't use Claude Code. I just pasted it directly into the web interface and said "I can't read this, can you help?" and then I excerpted the result so you sighted folks didn't have to reread, you could just verify the content matched.

So basically this person has put up a big "fuck you" sign to people like me... while at the same time not protecting their content from actual AI (if this technique actually caught on it is trivial to reverse it in your data ingestion pipeline)

lumirth•1mo ago

I didn’t think you did use Claude Code! I was just saying that with AI agents these days, even more thoroughly obfuscated text can probably be de-obfuscated without much effort.

I suppose I don’t know data ingestion that well. Is de-obfuscating really something they do? If I was maintaining such a pipeline and found the associated garbage data, I doubt I’d bother adding a step for the edge case of getting the right caesar cipher to make text coherent. Unless I was fine-tuning a model for a particular topic and a critical resource/expert obfuscated their content, I’d probably just drop it and move on.

That said, after watching my father struggle deeply with the complex computer usage his job requires when he developed cataracts, I don’t see any such method as tenable. The proverbial “fuck you” to the disabled folks who interact with one’s content is deeply unacceptable. Accessible web content should be mandatory in the same way ramps and handicap parking are—if not more-so. For that matter, it shouldn’t take seeing a loved one slowly and painfully lose their able body to give a shit about accessibility. Point being, you’re right to be pissed and I’m glad this post had a direct response from somebody with direct personal experience needing accessible content so quickly after it went up.

tilschuenemann•1mo ago

Yes, it's worse for screenreaders, I listed that next to other drawbacks which I acknowledged. I don't intend to apply this method anywhere else due to these drawbacks, because accessibility matters.

It's a proof of concept, and maybe a starting point for somebody else who wants to tackle this problem.

Can LLMs detect and decode the text? Yes, but I'd wager for the case that data cleaning doesn't happen to the extent that it decodes the text after scraping.

flir•1mo ago

But it's "made with ♥" (the footer says so).

(He's broken mainstream browsers, too - ctrl+f doesn't work in the page.)

GPT 5.2 extracted the correct text, but it definitely struggled - 3m36s, and it had to write a script to do it, and it messed up some of the formatting. It actually found this thread, but rejected that as a solution in the CoT: "The search result gives a decoded excerpt, which seems correct, but I’d rather decode it myself using a font mapping."

I doubt it would be economic to decode unless significant numbers of people were doing this, but it is possible.

NewsaHackO•1mo ago

This is the point I was making downthread: no scraper will use 3m36s of frontier LLM time to get <100 KB of data. This is why his method would technically achieve what he asked for. Someone alluded to this further down the thread, but I wonder if one-to-one letter substitution specifically would still expose some extractable information to the LLM, even without decoding.

esquivalience•1mo ago

This is fairly highly accurate (from a skim read, close to but not quite 100%). The article describes fooling ChatGPT with a caeser cipher, but not a full test of the obfuscation in-practice.

fiddlerwoaroof•1mo ago

Gemini (3.0 Thinking) solves it too.

NewsaHackO•1mo ago

You are missing his point. He is not saying that the Caesar cipher is unbreakable by LLMs. These web scrapers are gathering a very large amount of data to train new LLMs. It is not feasible to use hundreds of thousands (millions?) of dollars to run petabytes of random, raw data into a frontier LLM model before using the data, just to catch one person possibly using a cipher to obfuscate their data. That is the value proposition: make your data slightly harder to scrape so that web scrapers for LLM training would rather let your data be unusable than make an investment to attempt to extract it.

mft_•1mo ago

Does (I’m assuming) your screen reader cope with text that’s displayed in (for example) a raster or a vector image?

Buttons840•1mo ago

Android's built-in OCR worked for me. I was able to copy the text.

lumirth•1mo ago

This is a fascinating piece. How feasible is something like this on any broader scale? I find that for my blog, losing copy-paste would immediately outweigh the benefits of not getting scraped. Am I missing the downsides of being the scrapee?

ArcHound•1mo ago

AFAIK at least the comet browser uses OCR, so I worry that the "OCR not feasible" argument is sadly wrong.

lumirth•1mo ago

The comet browser is different from scraping, though, no? Not that I’d ever use this, but the goal doesn’t seem to be “no AI can ever touch this” but rather “large scale training-data scrapers find useless garbage.”

ArcHound•1mo ago

I'd say it's a good PoC.

They want to have many users. So they are ok with using OCR for many users. And since they are sending the accessed content through their APIs, might as well send a copy of it to training.

In conclusion, it seems that mass OCR usage is within the scope of the AI companies.

darknoon•1mo ago

Here's the problem, you're still going to get scraped and the LLM will understand it anyway. Maybe at best you'll get filtered out of the dataset b/c it's high perplexity text?

lumirth•1mo ago

Is the problem with scraping the bandwidth usage or the stealing of the content? The point here doesn’t seem to be obfuscation from direct LLM inference (I mean, I use Shottr on my MacBook to immediately OCR my screenshots) but rather stopping you from ending up in the dataset.

Is there a reason you believe getting filtered out is only a “maybe?” Not getting filtered out would seem to me to imply that LLM training can naturally extract meaning from obfuscated tokens. If that’s the case, LLMs are more impressive than I thought.

201984•1mo ago

Do training scrapers really feed all their input through an LLM to decode it? That sounds expensive and most content probably doesn't need that. If they don't, then this method probably works to keep your stuff out of the training datasets.

cwillu•1mo ago

They don't need to decode it first, it can be passed in directly.

striking•1mo ago

I handed a PDF capture to Opus 4.5 (since web fetch returned a 403) and it was able to crack the cypher pretty quickly. Fun idea, but I think this only prevents humans from using the site.

Roark66•1mo ago

Jeez, all this "LLM avoidance" is so horribly silly. We will remember it in 10 years like we now remember the decision to setup our great European cookie popup laws. As well intentioned, but very detrimental.

You will not stop scrapers. Period. They will just pay for a service like firecrawl that will fix it for them. Here in Poland one of the most notorious sites implementing anti-bot tech is our domestic eBay competitor allegro.pl. I've been locked out of that site for "clicking too fast" more than once. They have the strictest, most inconvenient software possible (everyone uses the site). And yet firecrawl has no problem scraping them (although rather slowly).

Second argument against these "protections" is, there are people behind bots. Many bot requests are driven by a human asking "find me cheapest rtx5060 ti 16gb" today. If your site blocks it they will loose that sale.

creata•1mo ago

> You will not stop scrapers.

That's no reason to go down without a fight!

> Second argument against these "protections" is, there are people behind bots

That doesn't hold much water in the context of hosting a blog.

Hizonner•1mo ago

> That's no reason to go down without a fight!

I suppose, if you assign no value to your time and don't care at all about any collateral damage.

Otherwise, assuming it's true, it sure is a damned good reason to go down without a fight.

creata•1mo ago

> if you assign no value to your time

We're talking about a personal blog; this sort of fun is what they're made of.

1gn15•1mo ago

At least the blog author is self-aware about making accessibility worse? I just found it funny how reactionary and backfire-y this was.

(In politics, a reactionary is a person who favors a return to a previous state of society which they believe possessed positive characteristics absent from contemporary society.)

creata•1mo ago

I sympathize with the reactionary. Obviously there's no putting the genie back in the bottle, but it would be nice to live in that world where writing stuff helped human readers more than it helped billion-dollar corporations.

sureglymop•1mo ago

You do still help human readers. I've been absolutely having the joy of my life for example reading the advent of compiler optimization posts that surface here everyday.

Granted there is a lot of AI slop here also now but I'm still glad humans write so that I can read and we can discuss here!

consumer451•1mo ago

I have a dumb question: What if we put a simple password in front of every website that everyone knew, like "password". Upon click of login, the user agrees to the terms of service which exclude all automatic scraping.

I know this is a dumb idea, but I would love to know exactly why.

embedding-shape•1mo ago

Legal agreements won't stop companies who don't care about legal agreements, they'll just add "popup.write('password').submit()" and move on. Multiply by 1000x, and you've solved nothing, while making the UX for normal users worse.

consumer451•1mo ago

I should have been more clear with the goal.

I know it's technically very easy to get around, but would it give the content owner any stronger legal footing?

Their content is no longer on "the open Internet," which is the AI labs' main argument, is it not?

j1elo•1mo ago

Law is only as useful as your ability to enforce it. Would you have the legal means (and will) to go after a random foreigner who still scraped ypur site after accepting the terms? Multiply by 1,000 times a day.

j1elo•1mo ago

That's cool. Hopefully you never post any remotely interesting, because in my very human 2010's way of doing things, I cannot even select and copy some text to my personal notes.

This goes well beyond accessibility and bots. I guess the Reader mode, a basic web browser feature meant precisely to read articles, wasn't an expected use case either?

yjftsjthsd-h•1mo ago

Simple solution: Screenshot it, then ask your AI of choice to read that:)

(Or use any other OCR solution you like; I've got a prototype that takes a screenshot and runs it through tesseract.)

j1elo•1mo ago

That's funny :)

We'll come full-circle when web authors provide custom-made glasses to decipher their sites, as the plain rendering will be obfuscated to prevent OCR too.

creata•1mo ago

I can't wait for the entire website to be a Magic Eye image.

JC1NC•1mo ago

My RSS reader tried to scrape the page from the Hacker News feed so that I could read it and it was all garbled ): This breaks so many things, including translation and screen readers.

Would you be allowed to do this in some countries commercially because of accessibility laws?

oidar•1mo ago

I can read it on safari, but not on firefox.

nottorp•1mo ago

Well, at least it isn't Chrome only.

yjftsjthsd-h•1mo ago

Curious; it appears to render correctly in firefox for me. Any chance you flipped a privacy setting that blocks font shenanigans?

rpdillon•1mo ago

I mean, the core problem is the same as DRM: you want to distribute your data without actually distributing it. The endgame is simply rendering and OCRing, which will likely be feasible at scale, but in the meantime, you've cut off a bunch of your legit audience, broken RSS, broken search engine indexing, and AI can still scrape it when requested.

We Mourn Our Craft

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

I Write Games in C (yes, C)

Vocal Guide – belt sing without killing yourself

SectorC: A C Compiler in 512 bytes

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Selection Rather Than Prediction

History and Timeline of the Proco Rat Pedal (2021)

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Software factories and the agentic moment

We Mourn Our Craft

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

I Write Games in C (yes, C)

Vocal Guide – belt sing without killing yourself

SectorC: A C Compiler in 512 bytes

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Selection Rather Than Prediction

History and Timeline of the Proco Rat Pedal (2021)

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Software factories and the agentic moment

Sacrificing accessibility for not getting web scraped

Comments