frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Miasma: A tool to trap AI web scrapers in an endless poison pit

https://github.com/austin-weeks/miasma
64•LucidLynx•3h ago

Comments

splitbrainhack•1h ago
-1 for the name
QuantumNomad_•1h ago
https://en.wikipedia.org/wiki/Miasma_theory

Seems a clever and fitting name to me. A poison pit would probably smell bad. And at the same time, the theory that this tool would actually cause “illness” (bad training data) in AI is not proven.

Imustaskforhelp•1h ago
I wish if there was some regulation which could force companies who scrape for (profit) to reveal who they are to the end websites, many new AI company don't seem to respect any decision made by the person who owns the website and shares their knowledge for other humans, only for it to get distilled for a few cents.
GaggiX•1h ago
These projects are the new "To-Do List" app.
meta-level•1h ago
Isn't posting projects like this the most visible way to report a bug and let it have fixed as soon as possible?
suprfsat•1h ago
"disobeys robots.txt" is more of a feature
madeofpalk•1h ago
Is there any evidence or hints that these actually work?

It seems pretty reasonable that any scraper would already have mitigations for things like this as a function of just being on the internet.

sd9•1h ago
Even it did work, I just can't bring myself to care enough. It doesn't feel like anything I could do on my site would make any material difference. I'm tired.
20k•1h ago
I definitely get this. The thing that gives me hope is that you only need to poison a very small % of content to damage AI models pretty significantly. It helps combat the mass scraping, because a significant chunk of the data they get will be useless, and its very difficult to filter it by hand
nubg•1h ago
What kind of migitations? How would you detect the poison fountain?
avereveard•53m ago
style="display: none;" aria-hidden="true" tabindex="1"

many scraper already know not to follow these, as it's how site used to "cheat" pagerank serving keyword soups

m00dy•5m ago
Google will give your website a penalty for doing this.
GaggiX•47m ago
Because the internet is noisy and not up to date all recent LLMs are trained using Reinforcement Learning with Verifiable Rewards, if a model has learned the wrong signature of a function for example it would be apparent when executing the code.
phoronixrly•43m ago
It does work, on two levels:

1. Simple cheap, easy-to-detect and badly-behaved bots will scrape the poison, and feed links to expensive-to-run browser-based bots that you can't detect in any other way.

2. Once you see a browser visit a bullshit link, you insta-ban it, as you can now see that it is a bot because it has been poisoned with the bullshit data.

My personal preference is using iocaine for this purpose though, in order to protect the entire server as opposed to a single site.

m00dy•5m ago
it won't work, especially on gemini. Googlebot is very experienced when it comes to crawling. It might work for OpenAI and others maybe.
rvz•1h ago
> > Be sure to protect friendly bots and search engines from Miasma in your robots.txt!

Can't the LLMs just ignore or spoof their user agents anyway?

phoronixrly•48m ago
Well-behaved agents will obey robots.txt and not fall into the trap.
snehesht•1h ago
Why not simply blacklist or rate limit those bot IP’s ?
aduwah•34m ago
There are way too many to do that
phyzome•29m ago
Because punishment for breaking the robots.txt rules is a social good.
xprnio•21m ago
If you have real traffic and bot traffic, you still need to identify which is which. On top of that, bots very likely don’t reuse the same IPs over and over again. I assume if we knew all the IPs used only by bots ahead of time, then yeah it would be simple to blacklist them. But although it’s simple in theory, the practice of identifying what to blacklist in the first place is the part that isn’t as simple
imdsm•1h ago
Applied model collapse
obsidianbases1•35m ago
Why do this though?

It's like if someone was trying to "trap" search crawlers back in the early 2000s.

Seems counterproductive

bilekas•30m ago
Because of bots that don't respect ROBOTS.txt .

If you want an AI bot to crawl your website while you pay for that bandwidth then you wont use the tool.

Forgeties79•28m ago
Web crawlers didn’t routinely take down public resources or use the scraped info to generate facsimiles that people are still having ethical debates over. Its presence didn’t even register and it was indexing that helped them. It isn’t remotely the same thing.

https://www.libraryjournal.com/story/ai-bots-swarm-library-c...

tasuki•15m ago
> If you have a public website, they are already stealing your work.

I have a public website, and web scrapers are stealing my work. I just stole this article, and you are stealing my comment. Thieves, thieves, and nothing but thieves!

Overestimation of microplastics potentially caused by scientists' gloves

https://news.umich.edu/nitrile-and-latex-gloves-may-cause-overestimation-of-microplastics-u-m-stu...
155•giuliomagnifico•3h ago•62 comments

The Cloud: The dystopian book that changed Germany (2022)

https://www.bbc.com/culture/article/20221101-the-cloud-the-nuclear-novel-that-shaped-germany
23•leonidasrup•1h ago•11 comments

Miasma: A tool to trap AI web scrapers in an endless poison pit

https://github.com/austin-weeks/miasma
64•LucidLynx•3h ago•26 comments

Founder of GitLab battles cancer by founding companies

https://sytse.com/cancer/
1171•bob_theslob646•19h ago•224 comments

Technology: The (nearly) perfect USB cable tester does exist

https://blog.literarily-starved.com/2026/02/technology-the-nearly-perfect-usb-cable-tester-does-e...
128•birdculture•3d ago•55 comments

AI overly affirms users asking for personal advice

https://news.stanford.edu/stories/2026/03/ai-advice-sycophantic-models-research
688•oldfrenchfries•23h ago•544 comments

I turned my Kindle into my own personal newspaper

https://manualdousuario.net/en/how-to-kindle-personal-newspaper/
79•rpgbr•2d ago•28 comments

CSS is DOOMed

https://nielsleenheer.com/articles/2026/css-is-doomed-rendering-doom-in-3d-with-css/
398•msephton•17h ago•93 comments

Siclair Microvision (1977)

https://r-type.org/articles/art-452.htm
25•joebig•2d ago•8 comments

LinkedIn uses 2.4 GB RAM across two tabs

68•hrncode•4h ago•43 comments

Show HN: Create a full language server in Go with 3.17 spec support

https://github.com/owenrumney/go-lsp
12•rumno0•4d ago•2 comments

Alzheimer's disease mortality among taxi and ambulance drivers (2024)

https://www.bmj.com/content/387/bmj-2024-082194
157•bookofjoe•12h ago•102 comments

Show HN: Public transit systems as data – lines, stations, railcars, and history

https://publictransit.systems
26•qwertykb•6h ago•8 comments

OpenBSD on Motorola 88000 Processors

http://miod.online.fr/software/openbsd/stories/m88k1.html
110•rbanffy•1d ago•15 comments

Lat.md: Agent Lattice: a knowledge graph for your codebase, written in Markdown

https://github.com/1st1/lat.md
43•doppp•4h ago•9 comments

I decompiled the White House's new app

https://thereallo.dev/blog/decompiling-the-white-house-app
559•amarcheschi•22h ago•202 comments

Nonfiction Publishing, Under Threat, Is More Important

https://newrepublic.com/article/207659/non-fiction-publishing-threat-important-ever
21•Hooke•3d ago•9 comments

Further human + AI + proof assistant work on Knuth's "Claude Cycles" problem

https://twitter.com/BoWang87/status/2037648937453232504
225•mean_mistreater•19h ago•155 comments

Monado became the foundation for OpenXR runtimes

https://www.collabora.com/news-and-blog/news-and-events/how-monado-became-the-foundation-for-open...
23•mfilion•2d ago•3 comments

What if AI doesn't need more RAM but better math?

https://adlrocha.substack.com/p/adlrocha-what-if-ai-doesnt-need-more
79•adlrocha•5h ago•44 comments

A Verilog to Factorio Compiler and Simulator (Working RISC-V CPU)

https://github.com/ben-j-c/verilog2factorio
100•signa11•3d ago•10 comments

I Built an Open-World Engine for the N64 [video]

https://www.youtube.com/watch?v=lXxmIw9axWw
414•msephton•1d ago•70 comments

Android’s new sideload settings will carry over to new devices

https://www.androidauthority.com/android-sideload-carry-over-3652845/
120•croemer•17h ago•158 comments

A laser-based process that enables adhesive-free paper packaging

https://www.fraunhofer.de/en/press/research-news/2026/march-2026/sealing-paper-packaging-without-...
93•gnabgib•14h ago•40 comments

Show HN: Sheet Ninja – Google Sheets as a CRUD Back End for Vibe Coders

https://sheetninja.io
41•sxa001•1h ago•45 comments

OpenCiv1 – open-source rewrite of Civ1

https://github.com/rajko-horvat/OpenCiv1
164•caminanteblanco•19h ago•51 comments

The ANSI art "telecomics" of the 1992 election

https://breakintochat.com/blog/2026/03/25/don-lokke-and-mack-the-mouse/
60•Kirkman14•2d ago•7 comments

Linux is an interpreter

https://astrid.tech/2026/03/28/0/linux-is-an-interpreter/
218•frizlab•20h ago•52 comments

InpharmD (YC W21) Is Hiring – Senior Ruby on Rails Developer

https://inpharmd.com/jobs/senior-ruby-on-rails-engineer
1•tulasichintha•16h ago

Spanish legislation as a Git repo

https://github.com/EnriqueLop/legalize-es
767•enriquelop•1d ago•225 comments