You Should Feed the Bots

https://maurycyz.com/misc/the_cost_of_trash/

85•chmaynard•2h ago

Comments

fainpul•2h ago

This follow-up post has the details of the "Markov babbler":

https://maurycyz.com/projects/trap_bots/

isoprophlex•37m ago

Very elegant and surprisingly performant. I hope the llm bros have a hard time cleaning this shit out of their scrapes.

akoboldfrying•2h ago

My initial reaction was that running something like this is still a loss, because it probably costs you as much or more than it costs them in terms of both network bytes and CPU. But then I realised two things:

1. If they are using residential IPs, each byte of network bandwidth is probably costing them a lot more than it's costing you. Win.

2. More importantly, if this became a thing that a large fraction of all websites do, the economic incentive for AI scrapers would greatly shrink. (They don't care if 0.02% of their scraping is garbage; they care a lot if 80% is.) And the only move I think they would have in this arms race would be... to use an LLM to decide whether a page is garbage or not! And now the cost of scraping a page is really starting to increase for them, even if they only run a local LLM.

asgerhb•1h ago

Not to mention they have to store the data after they download it. In theory storing garbage data is costly to them. However I have a nagging feeling that the attitude of these scrapers is they get paid the same amount per gigabyte whether it's nonsense or not.

luckylion•1h ago

If they even are AI crawlers. Could be just as well some exploit-scanners that are searching for endpoints they'd try to exploit. That wouldn't require storing the content, only the links.

mrweasel•53m ago

We should encourage number 2. So much of the content that the AI companies are scraping is already garbage, and that's a problem. E.g. LLMs are frequently confidently wrong, but so is Reddit, who produce a large volume of trading data. We've seen a study surgesting that you can poison an LLM with very little data. Encouraging the AI companies to care about the quality of the data they are scraping could be beneficial to all.

The cost of being critical of source material might make some AI companies tank, but that seems inevitable.

goodthink•2h ago

I have yet to see any bots figure out how to get past the Basic Auth protecting all links on my (zero traffic) website. Of course, any user following a link will be stopped by the same login dialog (I display the credentials on the home page). The solution is to make the secrets public. ALL websites could implement the same User/Pass credentials: User: nobots Pass: nobots Can bot writers overcome this if they know the credentials?

CaptainOfCoit•2h ago

> Can bot writers overcome this if they know the credentials?

Yes, instead of doing just a HTTP request, do a HTTP request with authentication, trivial really. Probably the reason they "can't" do that now is because they haven't came across "public content behind Basic Auth with known correct credentials", so the behavior hasn't been added. But it's literally loading http://username:password@example.com instead of http://example.com to use Basic Auth, couldn't be simpler :)

8organicbits•2h ago

The technical side is straightforward but the legal implications of trying passwords to try to scrape content behind authentication could pose a barrier. Using credentials that aren't yours, even if they are publicly known, is (in many jurisdictions) a crime. Doing it at scale as part of a company would be quite risky.

Filligree•2h ago

Sure, it’s a crime for the bots, but it would also be a crime for the ordinary users that you want to access the website.

Or if you make it clear that they’re allowed, I’m not sure you can stop the bots then.

CaptainOfCoit•1h ago

I don't think it'd be illegal for anyone.

The (theoretical) scenario is: There is a website (example.com) that publishes the correct credentials, and tells users to go to example.com/authenticate and put those there.

At no point is a user (or bot) bypassing anything that was meant to stop them, they're following what the website is telling them publicly.

8organicbits•1h ago

I think this analysis is correct. The part you're missing from my comment is "at scale", which means trying to apply this scraping technique to other sites. As a contract security engineer I've found all kinds of accidentally leaked credentials; knowing if a set of credentials is accidentally leaked or are being intentionally disclosed to the public feels like a human-in-the-loop kind of thing. Getting it wrong, especially when automated at scale, is the context the bot writer needs to consider.

Macha•1h ago

The legal implications of torrenting giant ebook collections didn't seem to stop them, not sure why this would

8organicbits•1h ago

The law doesn't directly stop anyone from doing anything, it acts much differently from a technical control. The law provides recourse to people hurt by violations and enables law enforcement action. I suspect Meta has since stopped their torrenting, and may lose the lawsuit they current face. Anyone certainly could log in to any site with credentials that are not their own, but fear of legal action may deter them.

sisizbzb•1h ago

There’s hundreds of billions of dollars behind these guys. Not only that, but they also have institutional power backing them. The laws don’t really matter to the worst offenders.

Similar to OPs article, trying to find a technical solution here is very inefficient and just a bandaid. The people running our society are on the whole corrupt and evil. Much simpler (not easier) and more powerful to remove them.

CaptainOfCoit•1h ago

> but the legal implications of trying passwords to try to scrape content behind authentication could pose a barrier

If you're doing something alike to cracking then yeah. But if the credentials are right there on the landing page, and visible to the public, it's not really cracking anymore since you already know the right password before you try it, and the website that put up the basic auth is freely sharing the password, so you aren't really bypassing anything, just using the same access methods as everyone else.

Again, if you're stumbling upon basic auth and you try to crack them, I agree it's at least borderline illegal, but this was not the context in the parent comment.

hn8726•25m ago

Otoh if, as a human, you use a known (even leaked on the website) password to "bypass the security" in order to "gain access to content you're not authorized to see", I think you'd get in trouble. I'd like if the same logic aplied to bots - implement basic (albeit weak) security and only allow access to humans. This way bots have to _hack you_ to read the content

CaptainOfCoit•15m ago

> you use a known (even leaked on the website) password to "bypass the security" in order to "gain access to content you're not authorized to see", I think you'd get in trouble

I agree, but if someone has a website that says "This isn't the real page, go to /real.html and when authentication pops up, enter user:password", then I'd argue that is no longer "gaining access to content you're not authorized to see", the author of the page shared the credentials themselves, and acknowledged they aren't trying to hide anything, just providing a non-typical way of accessing the (for all intents and purposes, public) content.

lcnPylGDnU4H9OF•45s ago

> freely sharing the password

It doesn't have to be so free. It can be shared with the stipulation that it's 1) not shared and 2) not used in a bot.

https://www.law.cornell.edu/uscode/text/17/1201

  (a) Violations Regarding Circumvention of Technological Measures.—
    (1)
      (A) No person shall circumvent a technological measure that effectively controls access to a work protected under this title.

This has been used by car manufacturers to deny diagnostic information even though the encryption key needed to decrypt the information is sitting on disk next to the encrypted data. That's since been exempted for vehicle repairs but only because they're vehicle repairs, not because the key was left in plain view.

If you are only authorized to access it under certain conditions, trying to access it outside those conditions is illegal. Gaining knowledge of a password does not grant permission to use it.

DrewADesign•1h ago

The people in the mad dash to AGI are either driven by religious conviction, or pure nihilism. Nobody doing this seriously considers the law a valid impediment. They justify (earnestly or not) companies doing things like scraping independent artist’s bread and butter work to create commercial services that tank their market with garbage knockoffs by claiming we’re moving into a post-work society. Meanwhile, the US government is moving at a breakneck pace to dismantle the already insufficient safety nets we do have. None of them care. Ethical roadblocks seem to be a solved problem in tech, now.

morkalork•1h ago

The bot protection on low traffic sites can be hilarious in how simple and effective it can be. Just click this checkbox. That's it. But it's not a check box matching a specific pattern provided by a well-known service, so until the bot writer inspects the site and adds the case it'll work. A browser running openai operator or whatever its called would immediately figure it out though.

akoboldfrying•13m ago

> A browser running openai operator or whatever its called would immediately figure it out though.

But running that costs money, which is a disincentive. (How strong of a disincentive depends on how much it costs vs. the estimated value of a scraped page, but I think it would 100x the per-page cost at least.)

throw-10-13•2h ago

This is genuinely one of the stupidest ideas I have seen on this site.

lfkdev•1h ago

Not sure if I can follow you, why would credentials known by anyone stop bots?

hyperhello•1h ago

Why not show them ads? Endless ads, with AI content in between them?

delecti•1h ago

To what end? I imagine ad networks have pretty robust bot detection. I'd also be surprised if scrapers didn't have ad block functionality in their headless browsing.

OutOfHere•1h ago

The user's approach would work only if bots can accurately even be classified, but this is impossible. The end result is that the action is user's site is now nothing but markov garbage. Not only will bots desert it but humans will too.

stubish•1h ago

The traditional approach is a link to the tarpit that the bots can see but humans can't, say using CSS to render it 0 pixels in size.

vntok•45m ago

AI bots try to behave as close to human visitors as possible, so they wouldn't click on 0px wide links, would they?

And if they would today, it seems like a trivial think to fix - just don't click on incorrect/suspicious links?

jcynix•14m ago

The 0px rule would be in a separate .CSS file. I doubt that bots load .CSS files for .html files, at least I don't remember seeing this in my server logs.

And another "classic" solution is to use white link text on white background, or a font with zero width characters, all stuff which is rather unlikely to be analysed by a scraper interested primarily in text.

righthand•12m ago

Ideally it would require rendering the css and doing a check on the Dom if the link is 0 pixels wide. But once bots figure that out I can still left: -100000px those links or z-index: -10000. To hide them in other ways. It’s a moving target how much time will the Llm companies waste decoding all the ways I can hide something before I move the target again. Now the Llm companies are in an expensive arms race.

8organicbits•26m ago

Please keep in mind that not all humans interact with web pages by "seeing". If you fool a scraper you may also fool someone using a screen reader.

bastawhiz•1h ago

You don't need to classify bots. Bots will follow any link they find. Hide links on your pages and eventually every bot will greedily find itself in an endless labyrinth of slop.

nodja•1h ago

Why create the markov text server side? If the bots are running javascript just have their client generate it.

bastawhiz•1h ago

1. The bots have essentially unlimited memory and CPU. That's the cheapest part of any scraping setup.

2. You need to send the data for the Markov chain generator to the client, along with the code. This is probably bigger than the response you'd be sending anyway. (And good luck getting a bot to cache JavaScript)

3. As the author said, each request uses microseconds of CPU and just over a megabyte of RAM. This isn't taxing for anyone.

vntok•47m ago

> 1. The bots have essentially unlimited memory and CPU. That's the cheapest part of any scraping setup.

Anyone crawling at scale would try to limit the per-request memory and CPU bounds, no? Surely you'd try to minimize resource contention at least a little bit?

comrade1234•1h ago

I had to follow a link to see an example:

"A glass is not impossible to make the file and so deepen the original cut. Now heat a small spot on the glass, and a candle flame to a clear singing note.

— context_length = 2. The source material is a book on glassblowing."

masfuerte•1h ago

Add "babble" to any url to get a page of nonsense:

https://maurycyz.com/babble/projects/trap_bots/

tyfon•1h ago

Thank you, I am now serving them garbage :)

For reference, I picked Frankenstein, Alice in wonderland and Moby dick as sources and I think they might be larger than necessary as they take some time to load. But they still work fine.

There also seems to be a bug in babble.c in the thread handling? I did "fix" it as gcc suggested by changing pthread_detach(&thread) to pthread_detach(thread).. I probably broke something but it compiles and runs now :)

NoiseBert69•1h ago

Is there a Markov Babbler based on PHP or something else easy hostable?

I want to redirect all LLM-crawlers to that site.

zkmon•1h ago

Really cool. Reminds me of farmers of some third world countries. Completely ignored by government, exploited by commission brokers, farmers now use all sorts of tricks, including coloring and faking their farm produce, without regard for health hazards to consumers. The city dwellers who thought they have gamed the system through high education, jobs and slick-talk, have to consume whatever is served to them by the desperate farmers.

righthand•22m ago

The farmers did it to themselves, many are very wealthy already. Anything corporate America has taken over is because the farmers didn’t want to do the maintenance work. So they sell out to bug corporations who will make it easier.

Same as any other consumer using Meta products. You sell out because it’s easier to network that way.

I am the son of a farmer.

Lord-Jobo•2m ago

This is a very biased source discussing a very real prescription issue, and worth a glance for the statistics:

https://www.farmkind.giving/the-small-farm-myth-debunked

Tldr; the concept of farmers as small family farms has not been rooted in truth for a very long time in America

masfuerte•1h ago

> SSD access times are in the tens milliseconds

Eh? That's the speed of an old-school spinning hard disk.

eviks•50m ago

How does this help protect the regular non-garbage pages from the bots?

codeduck•46m ago

it does at a macroscopic level by making scraping expensive. If every "valid" page is scattered at random amongst a tarpit of recursive pages of nonsense, it becomes computationally and temporaly expensive to scrape a site for "good" data.

A single site doing this does nothing. But many sites doing this has a severe negative impact on the utility of AI scrapers - at least, until a countermeasure is developed.

fHr•46m ago

lets go! nice

markus_zhang•40m ago

I have always recommended this strategy: flood the AI bots with garbage that looks like authentic information so that they need actual humans to filter the information. Make sure that every site does this so they get more garbage than real stuffs. Hike up the proportion so that even ordinary people eventually figure out that using these AI products has more harm than use because it just produces garbage. I just don't know what is the cost, now it looks like pretty doable.

If you can't fight them, flood them. If they want to open a window, pull down the whole house.

peterlk•31m ago

LLMs can now detect garbage much more cheaply than humans can. This might increase cost slightly for the companies that own the AIs, but it almost certainly will not result in hiring human reviewers

markus_zhang•5m ago

What about garbage that are difficult to tell from truth?

For example, say I have an AD&D website, how does AI tell whether a piece of FR history is canon or not? Yeah I know it's a bit extreme, but you get the idea.

xyzal•40m ago

I think random text can be detected and filtered. We need probably pre-generated bad information to make utility of crawling one's site truly negative.

On my site, I serve them a subset of Emergent Misalignment dataset, randomly perturbed by substituting some words with synonyms.

It should make the LLMs trained on it behave like dicks according to this research https://www.emergent-misalignment.com/

krzyk•31m ago

But why?

Do they do any harm? They do provide source for material if users asks for it. (I frequently do because I don't trust them, so I check sources).

You still need to pay for the traffic, and serving static content (like text on that website) is way less CPU/disk expensive than generating anything.

kaoD•28m ago

What you're referring to are LLMs visiting your page via tool use. That's a drop in the ocean of crawlers that are racing to slurp as much of the internet as possible before it dries.

blibble•30m ago

if you want to be really sneaky make it so the web doesn't start off infinite

because as infinite site that has appeared out of nowhere will quickly be noticed and blocked

start it off small, and grow it by a few pages every day

and the existing pages should stay 99% the same between crawls to gain reputation

chaostheory•22m ago

What’s wrong with just using cloudflare?

https://www.cloudflare.com/press/press-releases/2025/cloudfl...

TekMol•19m ago

How about adding some image with a public http logger url like

https://ih879.requestcatcher.com/test

to each of the nonsense pages, so we can see an endless flood of funny requests at

https://ih879.requestcatcher.com

I'm not sure requestcatcher is a good one, it's just the first one that came up when I googled. But I guess there are many such services, or one could also use some link shortener service with public logs.

jcynix•1m ago

You can easily generate a number of random images with ImageMagick and serve these as part of the babbled text. And you could even add text onto these images so image analyzers with OCR will have "fun" too.

theturtlemoves•10m ago

Does this really work though? I know nothing about the inner workings of LLMs, but don't you want to break their word associations? Rather than generating "garbage" text based on which words tend to occur together and LLMs generating text based on which words it has seen together, don't you want to give them text that relates unrelated words?

No/Low-Code Tool for Data Pipeline with DLT

The plague of fake components: Small-signal Transistors

Innovation Survivors Club

New Cognitive Bias Regarding AI

Don't Let Terraliths Drag Your Team Down

Meta Tags

The Coasean Singularity in Patents

Built an app to set death timers for failed products Which one's dying first?

Quick thoughts on the recent AWS outage

What's Not a Dark Pattern?

Serving Humans and AI Through Content Negotiation

How bugs made me believe in TDD

What Does a 'Sovereign Cloud' Mean?

Tiny sugars in the brain disrupt emotional circuits, fueling depression

Tales from Toddlerhood

CSS Terrain Generator

Automating Sparkle Updates for macOS Apps

ASK HN: How to Make Connections?

Language records reveal a surge of cognitive distortions in recent decades

Why every website you used to love is getting worse

Believability in Practice (2021)

A Bootable Greeting for the Xenomorph in Your Life

Show HN: RightMindMath and Arithmetic Fluency

Random Spherical Coordinates

RAII to remedy a defect where not all code paths performed required exit actions

Language agents for optimal conversation stopping

Windows Runtime design principle: Properties can be set in any order

Gambling Is Bad

Getting Started with the Swift SDK for Android

Resource use matters, but material footprints are a poor way to measure it