frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
604•klaussilveira•11h ago•180 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
912•xnx•17h ago•545 comments

What Is Ruliology?

https://writings.stephenwolfram.com/2026/01/what-is-ruliology/
28•helloplanets•4d ago•21 comments

How we made geo joins 400× faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
100•matheusalmeida•1d ago•24 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
29•videotopia•4d ago•1 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
207•isitcontent•12h ago•24 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
206•dmpetrov•12h ago•98 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
315•vecti•14h ago•138 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
354•aktau•18h ago•180 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
360•ostacke•18h ago•94 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
465•todsacerdoti•19h ago•232 comments

Jeffrey Snover: "Welcome to the Room"

https://www.jsnover.com/blog/2026/02/01/welcome-to-the-room/
4•kaonwarb•3d ago•1 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
24•romes•4d ago•3 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
262•eljojo•14h ago•156 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
398•lstoll•18h ago•271 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
80•quibono•4d ago•20 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
54•kmm•4d ago•3 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
8•bikenaga•3d ago•2 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
238•i5heu•14h ago•181 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
49•gfortaine•9h ago•15 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
138•vmatsiiako•17h ago•60 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
273•surprisetalk•3d ago•37 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
126•SerCe•8h ago•107 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
28•gmays•7h ago•9 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
68•phreda4•11h ago•13 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
7•jesperordrup•2h ago•1 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1051•cdrnsf•21h ago•432 comments

FORTH? Really!?

https://rescrv.net/w/2026/02/06/associative
61•rescrv•19h ago•22 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
171•limoce•3d ago•93 comments

Zlob.h 100% POSIX and glibc compatible globbing lib that is faste and better

https://github.com/dmtrKovalenko/zlob
15•neogoose•4h ago•9 comments
Open in hackernews

AI scrapers request commented scripts

https://cryptography.dog/blog/AI-scrapers-request-commented-scripts/
266•ColinWright•3mo ago

Comments

rokkamokka•3mo ago
I'm not overly surprised, it's probably faster to search the text for http/https than parse the DOM
embedding-shape•3mo ago
Not probably, searching through plaintext (which they seem to be doing) VS iterating on the DOM have vastly different amount of work behind them in terms of resources used and performance that "probably" is way underselling the difference :)
franktankbank•3mo ago
Reminds me of the shortcut that works for the happy path but is utterly fucked by real data. This is an interesting trap, can it easily be avoided without walking the dom?
embedding-shape•3mo ago
Yes, parse out HTML comments which is also kind of trivial if you've ever done any sort of parsing, listen for "<!--", whenever you come across it, ignore everything until the next "-->". But then again, these people are using AI to build scrapers, so I wouldn't put too much pressure on them to produce high-quality software.
stevage•3mo ago
Lots of other ways to include URLs in an HTML document that wouldn't be visible to a real user, though.
jcheng•3mo ago
It's not quite as trivial as that; one could start the page with a <script> tag that contains "<!--" without matching "-->", and that would hide all the content from your scraper but not from real browsers.

But I think it's moot, parsing HTML is not very expensive if you don't have to actually render it.

marginalia_nu•3mo ago
The regex approach is certainly easier to implement, but honestly static DOM parsing is pretty cheap, but quite fiddly to get right. You're probably gonna be limited by network congestion (or ephemeral ports) before you run out of CPU time doing this type of crawling.
Noumenon72•3mo ago
It doesn't seem that abusive. I don't comment things out thinking "this will keep robots from reading this".
michael1999•3mo ago
Crawlers ignoring robots.txt is abusive. That they then start scanning all docs for commented urls just adds to the pile of scummy behaviour.
tveyben•3mo ago
Human behavior is interesting - me, me, me…
mostlysimilar•3mo ago
The article mentions using this as a means of detecting bots, not as a complaint that it's abusive.

EDIT: I was chastised, here's the original text of my comment: Did you read the article or just the title? They aren't claiming it's abusive. They're saying it's a viable signal to detect and ban bots.

pseudalopex•3mo ago
Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".[1]

[1] https://news.ycombinator.com/newsguidelines.html

woodrowbarlow•3mo ago
the first few words of the article are:

> Last Sunday I discovered some abusive bot behaviour [...]

mostlysimilar•3mo ago
> The robots.txt for the site in question forbids all crawlers, so they were either failing to check the policies expressed in that file, or ignoring them if they had.
foobarbecue•3mo ago
Yeah but the abusive behavior is ignoring robots.txt and scraping to train AI. Following commented URLs was not the crime, just evidence inadvertently left behind.
ang_cire•3mo ago
They call the scrapers "malicious", so they are definitely complaining about them.

> A few of these came from user-agents that were obviously malicious:

(I love the idea that they consider any python or go request to be a malicious scraper...)

latenightcoding•3mo ago
when I used to crawl the web, battle tested Perl regexes were more reliable than anything else, commented urls would have been added to my queue.
rightbyte•3mo ago
DOM navigation for fetching some data is for tryhards. Using a regex to grab the correct paragraph or div or whatever is fine and is more robust versus things moving around on the page.
chaps•3mo ago
Doing both is fine! Just, once you've figured out your regex and such, hardening/generalizing demands DOM iteration. It sucks but it is what is is.
horseradish7k•3mo ago
but not when crawling. you don't know the page format in advance - you don't even know what the page contains!
OhMeadhbh•3mo ago
I blame modern CS programs that don't teach kids about parsing. The last time I looked at some scraping code, the dev was using regexes to "parse" html to find various references.

Maybe that's a way to defend against bots that ignore robots.txt, include a reference to a Honeypot HTML file with garbage text, but include the link to it in a comment.

tuwtuwtuwtuw•3mo ago
Do you think that if some CS programs taught parsing, the authors of the bot would parse the HTML to properly extract links, instead of just doing plain text search?

I doubt it.

OhMeadhbh•3mo ago
Sure. Ideally CS programs would be more than the trade schools they are today. When I was in school, the "CS Program" was a specialization in the math department. I was already in a Physics program when they created a "Computer Science & Engineering" degree. And I remember both the Math and CSE degrees being pretty decent. But that was back in the day when people took CS related courses because they were interested in computers, not because they wanted to get a super high paying job for a FANG company.

Maybe you're right. You can lead a horse to water, but you can't make it think deep thoughts about parsing.

ericmcer•3mo ago
How would recommend doing it? If I was just trying to pull <a/> tag links out I feel like treating it like text and using regex would be way more efficient than a full on HTML parser like JSDom or something.
singron•3mo ago
You don't need javascript to parse HTML. Just use an HTML parser. They are very fast. HTML isn't a regular language, so you can't parse it with regular expressions.

Obligatory: https://stackoverflow.com/questions/1732348/regex-match-open...

zahlman•3mo ago
The point is: if you're trying to find all the URLs within the page source, it doesn't really matter to you what tags they're in, or how the document is structured, or even whether they're given as link targets or in the readable text or just what.
vaylian•3mo ago
The people who do this type of scraping to feed their AI are probably also using AI to write their scraper.
mikeiz404•3mo ago
It’s been some time since I have dealt with web scrapers but it takes less resources to run a regex than it does to parse the DOM (which may have syntactically incorrect parts anyway). This can add up when running many scraping requests in parallel. So depending on your goals using a regex can be much preferred.
mrweasel•3mo ago
You don't need to teach parsing, that won't help much any way. We need to teach people to be good netizen again. I'd argue that it was always viewed as reasonable to scrape content, as long as you didn't misrepresent content as your own and if you scraped responsibly, backing of if the server started to slow down, or simply not crawling to fast to begin with.

Currently we have at least three problems:

1) Companies have no issue with not providing sources and not linking back.

2) There are too many scrapers, even if they behaved, some site would struggle to handle all of them.

3) Srapers go full throttle 24/7, expecting the sites to rate-limit them if they are going to fast. Hammer a site into the ground, just wait until it's back and hammer it again, grabbing what you can before it crashes once more.

There's no longer a sense of the internet being for all of us and that we need to make room for each other. Website / human generated content exists as a resource to be strip mined.

sharkjacobs•3mo ago
Fun to see practical applications of interesting research[1]

[1]https://news.ycombinator.com/item?id=45529587

bakql•3mo ago
>These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.

"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.

Yes, I know about weev. That was a travesty.

XenophileJKO•3mo ago
What about people using an LLM as their web client? Are you now saying the website owner should be able to dictate what client I use and how it must behave?
aDyslecticCrow•3mo ago
> Are you now saying the website owner should be able to dictate what client I use and how it must behave?

Already pretty well established with Ad-block actually. It's a pretty similar case even. AI's don't click ads, so why should we accept their traffic? If it's un-proportionally loading the server without contributing to the funding of the site, get blocked.

The server can set whatever rules it wants. If the maintainer hates google and wants to block all chrome users, it can do so.

XenophileJKO•3mo ago
That was kind of what I was really hinting at, as the HN community tends to embrace things like ad blockers and archive links on stories, but god forbid someone read a site using an LLM.
1gn15•3mo ago
Humans are usually hypocritical. They support whatever they personally use while opposing whatever inconveniences them, even though they're basically the same thing.

This whole thing has made me hate humans, so so much. Robots are much better.

aDyslecticCrow•3mo ago
I use adblock myself, and don't feel bad for using it (it's a security and privacy tool). But i don't blame websites that kick me out for it; hosting costs money.

Server owners should have all the right to set the terms of their server access. Better tools to control LLMs and scrapers are all good in my book.

I really wish ad platforms were better at managing malware, trackers and fraud through. It is rather difficult to fully argue for website owner authority with how bad ads actually are for the user.

grayhatter•3mo ago
Yes? I'd suggest that you understand that's not an unreasonable expectation either.

Your browser has a bug, if you leave my webpage open in a tab, because of that bug, it's going to close the connection, reconnect, new tls handshake and everything and re-request that page without any cache tag, every second, everyday, for as long as you have the tab open.

That feels kinda problematic, right?

Web servers block well formed clients all the time, and I agree with you, that's dumb. But servers should be allowed to serve only the traffic they wish. If you want to use some LLM client, but the way that client behaves puts undue strain on wy server, what should I do, just accept that your client, and by proxy you, are an asshole and just accept that?

You shouldn't put your rules on my webserver, exactly as much I my webserver shouldn't put my rules on yours. But i believe that ethically, we should both attempt to respect and follow the rules of the other. Blocking traffic when it starts to behave abusively. It's not complex, just try to be nice and help the other as much as you reasonably can.

Calavar•3mo ago
I agree. It always surprises me when people are indignant about scrapers ignoring robots.txt and throw around words like "theft" and "abuse."

robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.

It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch. It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center. Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

hsbauauvhabzb•3mo ago
How else do you tell the bot you do not wish to be scraped? Your analogy is lacking - you didn’t order a package, you never wanted a package, and the postman is taking something, not leaving it, and you’ve explicitly left a sign saying ‘you are not welcome here’.
bakql•3mo ago
Stop your http server if you do not wish to receive http requests.
vkou•3mo ago
Turn off your phone if you don't want to receive robo-dialed calls and unsolicited texts 300 times a day.

Fence off your yard if you don't want people coming by and dumping a mountain of garbage on it every day.

You can certainly choose to live in a society that thinks these are acceptable solutions. I think it's bullshit, and we'd all be better off if anyone doing these things would be breaking rocks with their teeth in a re-education camp, until they learn how to be a decent human being.

bigbuppo•3mo ago
Ah yes, and unplug the mail server to stop all spam. Great idea!
Calavar•3mo ago
If you are serving web pages, you are soliciting GET requests, kind of like ordering a package is soliciting a delivery.

"Taking" versus "giving" is neither here nor there for this discussion. The question is are you expressing a preference on etiquette versus a hard rule that must be followed. I personally believe robots.txt is the former, and I say that as someone who serves more pages than they scrape

yuliyp•3mo ago
Having a front door physically allows anyone on the street to come to knock on it. Having a "no soliciting" sign is an instruction clarifying that not everybody is welcome. Having a web site should operate in a similar fashion. The robots.txt is the equivalent of such a sign.
halJordan•3mo ago
No soliciting signs are polite requests that no one has to follow, and door to door salesman regularly walk right past them.

No one is calling for the criminalization of door-to-door sales and no one is worried about how much door-to-door sales increases water consumption.

oytis•3mo ago
> door to door salesman regularly walk right past them.

Oh, now I understand why Americans can't see a problem here.

ahtihn•3mo ago
If a company was sending hundreds of salesmen to knock at a door one after the other, I'm pretty sure they could successfully get sued for harassment.
hsbauauvhabzb•3mo ago
Can’t Americans literally shoot each other for trespassing?
dragonwriter•3mo ago
Generally, legally, no, not just for ignoring a “no soliciting” sign.
hsbauauvhabzb•3mo ago
But they’re presumably trespassing.
dragonwriter•3mo ago
And, despite what ideas you may get from the media, mere trespass without imminent threat to life is not a justification for deadly force.

There are some states where the considerations for self defense do not include a duty to retreat if possible, either in general (“stand your ground" law) or specifically in the home (“castle doctrine"), but all the other requirements (imminent threat of certain kinds of serious harm, proportional force) for self-defense remain part of the law in those states, and trespassing by/while disregarding a ”no soliciting” would not, by itself, satisfy those requirements.

duskdozer•3mo ago
>No one is calling for the criminalization of door-to-door sales

Ok, I am, right now.

It seems like there are two sides here that are talking past one another: "people will do X and you accept it if you do not actively prevent it, if you can" and "X is bad behavior that should be stopped and shouldn't be the burden of individuals to stop". As someone who leans to the latter, the former just sounds like restating the problem being complained about.

distances•3mo ago
> No one is calling for the criminalization of door-to-door sales

Door-to-door sales absolutely are banned in many jurisdictions.

czscout•3mo ago
And a no soliciting sign is no more cosmically binding than robots.txt. It's a request, not an enforceable command.
hsbauauvhabzb•3mo ago
Tell me you work in an ethically bankrupt industry without telling me you work in an ethically bankrupt industry.
munk-a•3mo ago
I disagree strongly here - though not from a technical perspective. There's absolutely a legal concept of making your work available for viewing without making it available for copying and AI scraping (while we can technically phrase it as just viewing a bunch of times) is effectively copying.

Lets say a large art hosting site realizes how damaging AI training on their data can be - should they respond by adding a paywall before any of their data is visible? If that paywall is added (let's just say $5/mo) can most of the artists currently on their site afford to stay there? Can they afford it if their potential future patrons are limited to just those folks who can pay $5/mo? Would the scraper be able to afford a one time cost of $5 to scrape all of that data?

I think, as much they are a deeply flawed concept, this is a case where EULAs or an assumption of no-access for training unless explicitly granted that's actually enforced through the legal system is required. There are a lot of small businesses and side projects that are dying because of these models and I think that creative outlet has societal value we would benefit from preserving.

jMyles•3mo ago
> There's absolutely a legal concept of making your work available for viewing without making it available for copying

This "legal concept" is enforceable through legacy systems of police and violence. The internet does not recognize it. How much more obvious can this get?

If we stumble down the path of attempting to apply this legal framework, won't some jurisdiction arise with no IP protections whatsoever and just come to completely dominate the entire economy of the internet?

If I can spin up a server in copyleftistan with a complete copy of every album and film ever made, available for free download, why would users in copyrightistan use the locked down services of their domestic economy?

kelnos•3mo ago
> legacy systems of police and violence

You use "legacy" as if these systems are obsolete and on their way out. They're not. They're here to stay, and will remain dominant, for better or worse. Calling them "legacy" feels a bit childish, as if you're trying to ignore reality and base arguments on your preferred vision of how things should be.

> The internet does not recognize it.

Sure it does. Not universally, but there are a lot of things governments and law enforcement can do to control what people see and do on the internet.

> If we stumble down the path of attempting to apply this legal framework, won't some jurisdiction arise with no IP protections whatsoever and just come to completely dominate the entire economy of the internet?

No, of course not, that's silly. That only really works on the margins. Any other country would immediately slap economic sanctions on that free-for-all jurisdiction and cripple them. If that fails, there's always a military response they can resort to.

> If I can spin up a server in copyleftistan with a complete copy of every album and film ever made, available for free download, why would users in copyrightistan use the locked down services of their domestic economy?

Because the governments of all the copyrightistans will block all traffic going in and out of copyleftistan. While this may not stop determined, technically-adept people, it will work for the most part. As I said, this sort of thing only really works on the margins.

jMyles•3mo ago
I guess I'm more optimistic about the future of the human condition.

> You use "legacy" as if these systems are obsolete and on their way out. They're not.

I have serious doubts that nation states will still exist in 500 years. I feel quite certain that they'll be gone in 10,000. And I think it's generally good to build an internet for those time scales.

> base arguments on your preferred vision of how things should be.

I hope we all build toward our moral compass; I don't mean for arguments to fall into fallacies on this basis, but yeah I think our internet needs to resilient against the waxing and waning of the affairs of state. I don't know if that's childish... Maybe we need to have a more child-like view of things? The internet _is_ a child in the sense of its maturation timeframe.

> there are a lot of things governments and law enforcement can do to control what people see and do on the internet.

Of course there are things that governments do. But are they effective? I just returned from a throatsinging retreat in Tuva - a fairly remote part of Siberia. The Russian government has apparently quietly begun to censor quite a few resources on the internet, and it has caused difficulty in accessing the traditional music of the Tuvan people. And I was very happily astonished to find that everybody to whom I ran into, including a shaman grandmother, was fairly adept at routing around this censorship using a VPN and/or SSH tunnel.

I think the internet is doing a wonderful job at routing around censorship - better than any innovation ever discovered by humans so far.

> Any other country would immediately slap economic sanctions on that free-for-all jurisdiction and cripple them. If that fails, there's always a military response they can resort to.

Again, maybe I'm just more optimistic, but I think that on longer time frames, the sober elder statesmen/women will prevail and realize that violence is not an appropriate response to bytes transiting the wire that they wish weren't.

And at the end of the day, I don't think governments even have the power here - the content creators do. I distribute my music via free channels because that's the easiest way to reach my audience, and because, given the high availability of compelling free content, there's just no way I can make enough money on publishing to even concern myself with silly restrictions.

It seems to me that I'm ahead of the curve in this area, not behind it. But I'm certainly open to being convinced otherwise.

dns_snek•3mo ago
> Again, maybe I'm just more optimistic, but I think that on longer time frames, the sober elder statesmen/women will prevail and realize that violence is not an appropriate response to bytes transiting the wire that they wish weren't.

Your framing is off because this notion of fairness or morality isn't something they concern themselves with. They're using violence because if they didn't, they would be allowing other entities to gain wealth and power at their expense. I don't think it's much more complex than that.

See how differently these same bytes are treated in the hands of Aaron Swartz vs OpenAI. One threatened to empower humanity at the expense of reducing profits for a few rich men, so he got crucified for it. The other is hoping to make humans redundant, concentrate the distribution of wealth even further, and strengthen the US world dominance, so all of the right wheels get greased for them and they get a license to kill - figuratively and literally.

jMyles•3mo ago
I mean... I agree with everything you've said here. I'm not sure what makes you think I've mis-framed the stakes.
andoando•3mo ago
Well yes this is exactly what's happening as of now. But there SHOULD be a way to upload content without giving it access to scrapers.
kelnos•3mo ago
> If you are serving web pages, you are soliciting GET requests

So what's the solution? How do I host a website that welcomes human visitors, but rejects all scrapers?

There is no mechanism! The best I can do is a cat-and-mouse arms race where I try to detect the traffic I don't want, and block it, while the people generating the traffic keep getting more sophisticated about hiding from my detection.

No, putting up a paywall is not a reasonable response to this.

> The question is are you expressing a preference on etiquette versus a hard rule that must be followed.

Well, there really aren't any hard rules that must be followed, because there are no enforcement mechanisms outside of going nuclear (requiring login). Everything is etiquette. And I agree that robots.txt is also etiquette, and it is super messed up that we tolerate "AI" companies stomping all over that etiquette.

Do we maybe want laws that say everyone must respect robots.txt? Maybe? But then people will just move their scrapers to a jurisdiction without those laws. And I'm sure someone could make the argument that robots.txt doesn't apply to them because they spoofed a browser user-agent (or another user-agent that a site explicitly allows). So perhaps we have a new mechanism, or new laws, or new... something.

But this all just highlights the point I'm making here: there is no reasonable mechanism (no, login pages and http auth don't count) for site owners to restrict access to their site based on these sorts of criteria. And that's a problem.

davesque•3mo ago
If I order a package from a company selling a good, am I inviting all that company's competitors to show up at my doorstep to try and outbid the delivery person from the original company when they arrive, and maybe they all show up at the same time and cause my porch to collapse? No, because my front porch is a limited resource for which I paid for an intended purpose. Is it illegal for those other people to show up? Maybe not by the letter of the law.
pluto_modadic•3mo ago
ignoring a rate limit gets you blocked.
hsbauauvhabzb•3mo ago
Scrapers actively bypass this by rotating IP addresses.
davsti4•3mo ago
Its simple, and I'll quote myself - "robots.txt isn't the law".
ColinWright•3mo ago
Quoting Cervisia :

> robots.txt. This is not the law

In Germany, it is the law. § 44b UrhG says (translated):

(1) Text and data mining is the automated analysis of one or more digital or digitized works to obtain information, in particular about patterns, trends, and correlations.

(2) Reproductions of lawfully accessible works for text and data mining are permitted. These reproductions must be deleted when they are no longer needed for text and data mining.

(3) Uses pursuant to paragraph 2, sentence 1, are only permitted if the rights holder has not reserved these rights. A reservation of rights for works accessible online is only effective if it is in machine-readable form.

-- https://news.ycombinator.com/item?id=45776825

davsti4•3mo ago
Its the law pertaining to copyright. https://www.gesetze-im-internet.de/englisch_urhg/englisch_ur... ... and that's only in Germany.

If what you're protecting with robots.txt isn't copyright-able, then you'll need to find another legal means.

bigbuppo•3mo ago
Violating norms makes you an abusive jerk at best.
davsti4•3mo ago
That's a slippery slope opinion.

One person's norm can oppose another's norm. Who defines them? If my "norm" is to stop at a stop sign, and yours is to roll through it, are you the abusive jerk?

bigbuppo•3mo ago
No means no. Only yes means yes. Stop means stop. Why are people in tech so confused about this? Agree, or ask again in three days.
smsm42•3mo ago
"I will do anything that will not literally get me into jail" is a pretty low bar. Most decent people try to do better than that - and that's the only reason society still exists, because there's not enough cops to put all bad people into jail and never will be.
davsti4•3mo ago
Exactly - its why laws are written. So, if its enough of a problem, then there should be a law stopping/regulating it.

Are published TOSs enough to stop the owners of the bots?

smsm42•3mo ago
TOS wouldn't be enough but laws wouldn't be enough either. We have thousands of pages of laws, and people violate them all the times, and very often with zero consequences.
nkrisc•3mo ago
Put your content behind authentication if you don’t want it to be requested by just anyone.
kelnos•3mo ago
But I do want my content accessible to "just anyone", as long as they are humans. I don't want it accessible to bots.

You are free to say "well, there is no mechanism to do that", and I would agree with you. That's the problem!

1gn15•3mo ago
What the hell? That is incredibly discriminatory. Fuck off. I support those that counter those discriminatory mechanisms.
Anamon•3mo ago
Discriminatory against bots? That doesn't even make any sense.
bigbuppo•3mo ago
They probably have stock options.
9rx•3mo ago
> as long as they are humans. I don't want it accessible to bots.

A curious position. There isn't a secondary species using the internet. There is only humans. Unless you foresee some kind of alien invasion or earthworm uprising, nothing other than humans will ever access your content. Rejecting the tools humans use to bridge their biological gaps is rather nonsensical.

> You are free to say "well, there is no mechanism to do that", and I would agree with you. That's the problem!

I suppose it would be pretty neat if humans were born with some kind of internet-like telepathy ability, but lacking that mechanism isn't any kind of real problem. Humans are well adept at using tools and have successfully used tools for millennia. The internet itself is a tool! Which, like before, makes rejecting the human use of tools nonsensical.

nkrisc•3mo ago
Even abusive crawlers and scrapers are acting as agents of real humans, just as your browser is acting as your agent. I don't even know how you could reliably draw a reasonable line in the sand between the two without putting some group of people on the wrong side of the line.

I suppose the ultimate solution would be browsers and operating systems and hardware manufacturers co-operating to implement some system that somehow cryptographically signs HTTP requests which attests that it was triggered by an actual, physical interaction with a computing device by a human.

Though you don't have to think for very long to come up with all kinds of collateral damage that would cause and how bad actors could circumvent it anyway.

All in all, this whole issue seems more like a legal problem than a technical one.

bigbuppo•3mo ago
Or the AI people could just stop being abusive jerks. That's an even easier solution.
9rx•3mo ago
While that is probably good advice in general, the earlier commenter wanted even the abusive jerks to have access to his content.

He just doesn't want tools humans use to access content to be used in association with his content.

What he failed to realize is that if you eliminate the tools, the human cannot access the content anyway. They don't have the proper biological interfaces. Had he realized that, he'd have come to notice that simply turning off his server fully satisfies the constraints.

nkrisc•3mo ago
That would be easier. Too bad it won't ever happen.
stray•3mo ago
You require something the bot won't have that a human would.

Anybody may watch the demo screen of an arcade game for free, but you have to insert a quarter to play — and you can have even greater access with a key.

> and you’ve explicitly left a sign saying ‘you are not welcome here’

And the sign said "Long-haired freaky people Need not apply" So I tucked my hair up under my hat And I went in to ask him why He said, "You look like a fine upstandin' young man I think you'll do" So I took off my hat and said, "Imagine that Huh, me workin' for you"

michaelt•3mo ago
> You require something the bot won't have that a human would.

Is this why the “open web” is showing me a captcha or two, along with their cookie banner and newsletter pop up these days?

bigbuppo•3mo ago
Up until people started making a big stink about CAPTCHAs being used for unpaid labor at scale, uh, well they had two purposes.
whimsicalism•3mo ago
There's an evolving morality around the internet that is very, very different from the pseudo-libertarian rule of the jungle I was raised with. Interesting to see things change.
sethhochberg•3mo ago
The evolutionary force is really just "everyone else showed up at the party". The Internet has gone from a capital-I thing that was hard to access, to a little-i internet that was easier to access and well known but still largely distinct from the real world, to now... just the real world in virtual form. Internet morality mirrors real world morality.

For the most part, everybody is participating now, and that brings all of the challenges of any other space with everyone's competing interests colliding - but fewer established systems of governance.

hdgvhicv•3mo ago
Based on the comments here the polite world of the internet where people obeyed unwritten best practices is certainly over in favour of “grab what you can might makes right”
whimsicalism•3mo ago
that was never the internet. the old internet was “information wants to be free, good luck if you want to restrict my access or resharing”
bigbuppo•3mo ago
You're very much wrong. Two of the key tennets of libertarianism is that your rights end where my nose begins and the respect of property rights . Your AI bot is causing problems for me, then you should be compensating me for the damage or other expense you caused. But the AI bros think they should be able to take anything they want whenever they want without compensation, and they'll use every single shady behavior they can to make that happen. In other words, they're robber barrons.
bigbuppo•3mo ago
Seriously. Did you see what that web server was wearing? I mean, sure it said "don't touch me" and started screaming for help and blocked 99.9% of our IP space, but we got more and they didn't block that so clearly they weren't serious. They were asking for it. It's their fault. They're not really victims.
jMyles•3mo ago
Sexual consent is sacred. This metaphor is in truly bad taste.

When you return a response with a 200-series status code, you've granted consent. If you don't want to grant consent, change the logic of the server.

jraph•3mo ago
> When you return a response with a 200-series status code, you've granted consent. If you don't want to grant consent, change the logic of the server.

"If you don't consent to me entering your house, change its logic so that picking the door's lock doesn't let me open the door"

Yeah, well…

As if the LLM scrappers didn't try anything under the sun like using millions of different residential IP to prevent admins from "changing the logic of the server" so it doesn't "return a response with a 200-series status code" when they don't agree to this scrapping.

As if there weren't broken assumptions that make "When you return a response with a 200-series status code, you've granted consent" very false.

As if technical details were good carriers of human intents.

ryandrake•3mo ago
The locked door is a ridiculous analogy when it comes to the open web. Pretty much all "door" analogies are flawed, but sure let's imagine your web server has a door. If you want to actually lock the door, you're more than welcome to put an authentication gate around your content. A web server that accepts a GET request and replies 2xx is distinctly NOT "locked" in any way.
jraph•3mo ago
Any analogy is flawed and you can kill most analogies very fast. They are meant to illustrate a point hopefully efficiently, not to be mathematically true. They are not to everyone's taste, me included in most cases. They are mostly fine as long as they are not used to make a point, but only to illustrate it.

I agree with this criticism of this analogy, I actually had this flaw in mind from the start. There are other flaws I have in mind as well.

I have developed more without the analogy in the remaining of the comment. How about we focus on the crux of the matter?

> A web server that accepts a GET request and replies 2xx is distinctly NOT "locked" in any way

The point is that these scrappers use tricks so that it's difficult not to grant them access. What is unreasonable here is to think that 200 means consent, especially knowing about the tricks.

Edit:

> you're more than welcome to put an authentication gate around your content.

I don't want to. Adding auth so llm providers don't abuse my servers and the work I meant to share publicly is not a working solution.

jack_pp•3mo ago
here's my analogy, it's like you own a museum and you require entrance by "secret" password (your user agent filtering or what not). the problem is the password is the same for everyone so would you be surprised when someone figures it out or gets it from a friend and they visit your museum? Either require a fee (processing power, captcha etc) or make a private password (auth)

It is inherently a cat and mouse game that you CHOOSE to play. Either implement throttling for clients that consume too much resources for your server / require auth / captcha / javascript / whatever whenever the client is using too much resources. if the client still chooses to go through the hoops you implemented then I don't see any issue. If u still have an issue then implement more hoops until you're satisfied.

jraph•3mo ago
> Either require a fee (processing power, captcha etc) or make a private password (auth)

Well, I shouldn't have to work or make things worse for everybody because the LLM bros decided to screw us.

> It is inherently a cat and mouse game that you CHOOSE to play

No, let's not reverse the roles and blame the victims here. We sysadmins and authors are willing to share our work publicly to the world but never asked for it to be abused.

jack_pp•3mo ago
That's like saying you shouldn't have to sanitize your database inputs because you never asked for people to SQL inject your database. This stance is truly mind boggling to me
jraph•3mo ago
Would you take the defense of attackers using SQL injections? Because it feels like people here, including you, are defending the llm scrapers against sysadmins and authors who dare share their work publicly.

Ensuring basic security and robustness of a piece of software is simply not remotely comparable to countering the abuse these llm companies carry on.

But it's not even the point. And preventing SQL injections (through healthy programming practices) doesn't make things worse for any legitimate user neither.

catlifeonmars•3mo ago
It’s both. You should sanitize your inputs because there are bad actors, but you also categorize attempts to sql inject as abuse and there is legal recourse.
ryandrake•3mo ago
People need to have a better mental model of what it means to host a public web site, and what they are actually doing when they run the web server and point it at a directory of files. They're not just serving those files to customers. They're not just serving them to members. They're not just serving them to human beings. They're not even necessarily serving files to web browsers. They're serving files to every IP address (no matter what machine is attached to it) that is capable of opening a socket and sending GET. There's no such distinct thing as a scraper--and if your mental model tries to distinguish between a scraper and a human user, you're going to be disappointed.

As the web server operator, you can try to figure out if there's a human behind the IP, and you might be right or wrong. You can try to figure out if it's a web browser, or if it's someone typing in curl from a command line, or if it's a massively parallel automated system, and you might be right or wrong. You can try to guess what country the IP is in, and you might be right or wrong. But if you really want to actually limit access to the content, you shouldn't be publishing that content publicly.

jraph•3mo ago
> There's no such distinct thing as a scraper--and if your mental model tries to distinguish between a scraper and a human user, you're going to be disappointed.

I disagree. If your mental model doesn't allow conceptualizing (abusive) scrapers, it is too simplicistic to be useful to understand and deal with reality.

But I'd like to re-state the frame / the concern: it's not about any bot or any scraper, it is about the despicable behavior of LLM providers and their awful scrappers.

I'm personally fine with bots accessing my web servers, there are many legitimate use cases for this.

> But if you really want to actually limit access to the content, you shouldn't be publishing that content publicly.

It is not about denying access to the content to some and allowing access to others.

It is about having to deal with abuses.

Is a world in which people stop sharing their work publicly because of these abuses desirable? Hell no.

oytis•3mo ago
Technically, you are not serving anything - it's just voltage levels going up and down with no meaning at all.
Retric•3mo ago
> They're serving files to every IP address (no matter what machine is attached to it) that is capable of opening a socket and sending GET.

Legally in the US a “public” web server can have any set of usage restrictions it feels like even without a login screen. Private property doesn’t automatically give permission to do anything even if there happens to be a driveway from the public road into the middle of it.

The law cars about authorized access not the specific technical implementation of access. Which has caused serious legal trouble for many people when they make seemingly reasonable assumptions that say access to someURL/A12.jpg also gives them permission to someURL/A13.jpg etc.

jMyles•3mo ago
...but the matter of "what the law cares about" is not really the point of contention here - what matters here is what happens in the real world.

In the real world, these requests are being made, and servers are generating responses. So the way to change that is to change the logic of the servers.

Retric•3mo ago
> In the real world, these requests are being made, and servers are generating responses.

Except that’s not the end of the story.

If you’re running a scraper and risking serious legal consequences when you piss off someone running a server enough, then it suddenly matters a great deal independent of what was going on up to that point. Having already made these requests you’ve just lost control of the situation.

That’s the real world we’re all living in, you can hope the guy running a server is going to play ball but that’s simply not under your control. Which is the real reason large established companies care about robots.txt etc.

bigbuppo•3mo ago
How about AI companies just act ethically and obey norms?
tremon•3mo ago
The CFAA wants to have a word. The fact that a server responds with a 200 OK has no bearing on the legality of your request, there's plenty of precedent by now.
LexGray•3mo ago
Perhaps bad taste, but bots could also be legitimately purposely violating the most private or traumatizing moments a vulnerable person has in any exploitative way it cares to. I am not sure using bad taste is enough of an excuse to not discuss the issue as many people do in fact use the internet for sexual things. If anything consent should be MORE important because it is easier to document and verify.

A vast hoard of personal information exists and most of it never had or will have proper consent, knowledge, or protection.

jMyles•3mo ago
> the most private or traumatizing moments a vulnerable person has

...and in this hypothetical, this person is serving them via an unauthenticated http server and hoping that clients will respect robots.txt?

bigbuppo•3mo ago
Robots are supposed to behave. It was a solved problem 30 years ago until AI bros unsolved it. Any entity that does not obey robots.txt is by definition a malicious actor.
mvc•3mo ago
Future rapist right here.
mxkopy•3mo ago
The metaphor doesn’t work. It’s not the security of the package that’s in question, but something like whether the delivery person is getting paid enough or whether you’re supporting them getting replaced by a robot. The issue is in the context, not the protocol.
kelnos•3mo ago
> robots.txt is a polite request to please not scrape these pages

People who ignore polite requests are assholes, and we are well within our rights to complain about them.

I agree that "theft" is too strong (though I think you might be presenting a straw man there), but "abuse" can be perfectly apt: a crawler hammering a server, requesting the same pages over and over, absolutely is abuse.

> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

That's a shitty world that we shouldn't have to live in.

wslh•3mo ago
> People who ignore polite requests are assholes, and we are well within our rights to complain about them.

If you are building a new search engine and the robots.txt only include Google, are you an asshole indexing the information?

kijin•3mo ago
Yes, because the site owner has clearly and explicitly requested that you don't scrape their site, fully accepting the consequence that their site will not appear in any search engine other than Google.

Whatever impact your new search engine or LLM might have in the world is irrelevant to their wishes.

DoctorOetker•3mo ago
Whenever one forms a sentence, it is worthwhile to try to form a sentence that you believe to be generally true.

If someone politely requests you to suck their genitalia, and you ignore that request, does that make you an asshole?

watwut•3mo ago
If you ignore polite request, then it is perfectly ok to give you as much false data as possible. You have shown yourself not interested in good faith cooperation, that means other people can and should treat you as a jerk.
grayhatter•3mo ago
> I agree. It always surprises me when people are indignant about scrapers ignoring robots.txt and throw around words like "theft" and "abuse."

This feels like the kind of argument some would make as to why they aren't required to return their shopping cart to the bay.

> robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.

Well, no. That's an overly simplistic description which fits your argument, but doesn't accurately represent reality. yes, robots.txt is created as a hint for robots, a hint that was never expected to be non-binding, but the important detail, the one that is important to understanding why it's called robots.txt is because the web server exists to serve the requests of humans. Robots are welcome too, but please follow these rules.

You can tell your description is completely inaccurate and non-representative of the expectations of the web as a whole. because every popular llm scraper goes out of their way to both follow and announce that they follow robots.txt.

> It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch.

It's nothing like that, it's more like a note that says no soliciting, or please knock quietly because the baby is sleeping.

> It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center.

Or, people could not be assholes? Yes, I get it, the reality we live in there are assholes. But the problem as I see it, is not just the assholes, but the people who act as apologists for this clearly deviant behavior.

> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

Because it's your fault if you don't, right? That's victim blaming. I want to be able to host free, easy to access content for humans, but someone with more money, and more compute resources than I have, gets to overwhelm my server because they don't care... And that's my fault, right?

I guess that's a take...

There's a huge difference between suggesting mitigations for dealing with someone abusing resources, and excusing the abuse of resources, or implying that I should expect my server to be abused, instead of frustrated about the abuse.

smsm42•3mo ago
"Theft" may be wrong, but "abuse" certainly is not. Human interactions in general, and the web in particular, are built on certain set of conventions and common behaviors. One of them is that most sites are for consuming information at human paces and volumes, not downloading their content wholesale. There are specialized sites that are fine with that, but they say it upfront. Average, especially hobbyist site, is not that. People who do not abide by it are certainly abusing it.

> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

Yes, and if the rule of not dumping a ton of manure on your driveway is so important to you, you should live in a gated community and hire round-the-clock security. Some people do, but living in a society where the only way to not wake up with a ton of manure in your driveway is to spend excessive resources on security is not the world that I would prefer to live in. And I don't see why people would spend time to prove this is the only possible and normal world - it's certainly not the case, we can do better.

o11c•3mo ago
Theft is correct but for a different reason.

The #1 reason for all AI scrapers is to replace the content they are scraping. This means no "fair use" defense to the copyright infringement they inevitably commit.

bigiain•3mo ago
> robots.txt is a polite request to please not scrape these pages

At the same time, an http GET request is a polite request to respond with the expects content. There is no binding agreement that my webserver sends you the webpage you asked for. I am at liberty to enforce my no-scraping rules however I see fit. I get to choose whether I'm prepared to accept the consequences of a "real user" tripping my web scraping detection thresholds and getting firewalled or served nonsense or zipbombed (or whatever countermeasure I choose). Perhaps that'll drive away a reader (or customer) who opens 50 tabs to my site all at once, perhaps Google will send a badly behaved bot and miss indexing some of my pages or even deindexing my site. For my personal site I'm 100% OK with those consequences. For work's website I still use countermeasures but set the thresholds significantly more conservatively. For production webapps I use different but still strict thresholds and different countermeasures.

Anybody who doesn't consider typical AI company's webscraping behaviour over the last few years to qualify as "abuse" has probably never been responsible for a website with any volume of vaguely interesting text or any reasonable number of backlinks from popular/respected sites.

overfeed•3mo ago
It may be naivete, but I love the standards-based open web as a software platform and a s a fabric that connects people. O It makes my blood boil that some solipsistic, predatory bastards are eager to turn the internet into a dark forest
isodev•3mo ago
Ah yes, the “it’s ok because I can” school of thought. As if that was ever true.
munk-a•3mo ago
I think there's a massive shift in what the letter of the law needs to be to match the intent. The letter hasn't changed and this is all still quite legal - but there is a significant different between what webscraping was doing to impact creative lives five years ago and today. It was always possible for artists to have their content stolen and for creative works to be reposted - but there was enough IP laws around image sharing (which AI disingenuously steps around) and other creative work wasn't monetarily efficient to scrape.

I think there is a really different intent to an action to read something someone created (which is often a form of marketing) and to reproduce but modify someone's creative output (which competes against and starves the creative of income).

The world changed really quickly and our legal systems haven't kept up. It is hurting real people who used to have small side businesses.

Lionga•3mo ago
So if a house is not not locked I can take whatever I want?
Ylpertnodi•3mo ago
Yes, but you may get caught, and there suffer 'consequences'. I can drive well over 220kmh+ on the autobahn (Germany, Europe), and also in France (also in Europe). One is acceptable, the other will get me Royale-e fucked. If the can catch me.
arccy•3mo ago
yeah all open HTTP servers are fair game for DDoS because well it's open right?
sdenton4•3mo ago
The problem is that serving content costs money. Llm scraping is essentially ddos'ing content meant for human consumption. Ddos'ing sucks.
2OEH8eoCRo0•3mo ago
Scraping is legal. DDoSing isn't.

We should start suing these bad actors. Why do techies forget that the legal system exists?

ColinWright•3mo ago
There is no way that you can sue the people responsible for DDoSing your system. Even if you can find them ... and you won't ... they're likely as not either not in your jurisdiction (they might be in Russia, or China, or Bolivia, or anywhere) and they will have a lot more money than you.

People here on HN are laughing at the UKs Online Safety Act for trying to impose restrictions on people in other countries, and yet now you're implying that similar restrictions can be placed on people in other countries and over whom you have neither power nor control.

anon10484810573•3mo ago
> Why do techies forget that the legal system exists?

Simple, the gamble is it often makes business sense to "forget". Initially it could be they are unaware of the specific law but after a certain period of time it can really only be assumed the motivation is it is more convenient to ignore it.

Not all techies are like this.

Until the regulatory system corrects this calculus it will keep happening. Reputational costs or social costs sure wont correct it in this day and age.

herbst•3mo ago
Facebook and Bing sometimes are 80% of my daily hits and don't respect my IP bans and other bot filterings at all. You think I can just sue them and have any change to win before being broke?
dylan604•3mo ago
running the scraping bots cost money too.
QuadmasterXLII•3mo ago
what?
meepmorp•3mo ago
> Won’t somebody please think of the parasites?
jraph•3mo ago
When I open an HTTP server to the public web, I expect and welcome GET requests in general.

However,

(1) there's a difference between (a) a regular user browsing my websites and (b) robots DDoSing them. It was never okay to hammer a webserver. This is not new, and it's for this reason that curl has had options to throttle repeated requests to servers forever. In real life, there are many instances of things being offered for free, it's usually not okay to take it all. Yes, this would be abuse. And no, the correct answer to such a situation would not be "but it was free, don't offer it for free if you don't want it to be taken for free". Same thing here.

(2) there's a difference between (a) a regular user reading my website or even copying and redistributing my content as long as the license of this work / the fair use or related laws are respected, and (b) a robot counterfeiting it (yeah, I agree with another commenter, theft is not the right word, let's call a spade a spade)

(3) well-behaved robots are expected to respect robots.txt. This is not the law, this is about being respectful. It is only fair bad-behaved robots get called out.

Well behaved robots do not usually use millions of residential IPs through shady apps to "Perform a get request to an open HTTP server".

Cervisia•3mo ago
> robots.txt. This is not the law

In Germany, it is the law. § 44b UrhG says (translated):

(1) Text and data mining is the automated analysis of one or more digital or digitized works to obtain information, in particular about patterns, trends, and correlations.

(2) Reproductions of lawfully accessible works for text and data mining are permitted. These reproductions must be deleted when they are no longer needed for text and data mining.

(3) Uses pursuant to paragraph 2, sentence 1, are only permitted if the rights holder has not reserved these rights. A reservation of rights for works accessible online is only effective if it is in machine-readable form.

klntsky•3mo ago
> A reservation of rights for works accessible online is only effective if it is in machine-readable form.

What if MY machine can't read it though?

Y-bar•3mo ago
That’s your problem.

A solution has been offered and you can adhere to it, or stop doing that thing which causes problems for many of us.

luckylion•3mo ago
I doubt robots.txt would fit. robots.txt allows or disallows access, but it does not state any claim. You can license content you don't own, put it on your website, and then exclude it in robots.txt without that implying any claims of rights to that content.
Aloisius•3mo ago
> Well behaved robots do not usually use millions of residential IPs

Some antivirus and parental control control software will scan links sent to someone from their machine (or from access points/routers).

Even some antivirus services will fetch links from residential IPs in order to detect malware from sites configured to serve malware only to residential IPs.

Actually, I'm not entirely sure how one would tell the difference between a user software scanning links to detect adult content/malware/etc, randos crawling the web searching for personal information/vulnerable sites/etc. and these supposed "AI crawlers" just from access logs.

While I'm certainly not going to dismiss the idea that these are poorly configured crawlers at some major AI company, I haven't seen much in the way of evidence that is the case.

kijin•3mo ago
Occasionally fetching a link will probably go unnoticed.

If your antivirus software hammers the same website several times a second for hours on end, in a way that is indistinguishable from an "AI crawler", then maybe it's really misbehaving and should be stopped from doing so.

Aloisius•3mo ago
Legitimate software that scan links are often well behaved, in isolation. It's when that software is installed on millions of computers that in aggregate, they can behave poorly. This isn't particularly new though. RSS software used to blow up small websites that couldn't handle it. Now with some browsers speculatively loading links, you can be hammered simply because you're linked to from a popular site even if no one actually clicks on the link.

Personally, I'm skeptical of blaming everything on AI scrapers. Everything people are complaining about has been happening for decades - mostly by people searching for website vulnerabilities/sensitive info who don't care if they're misbehaving, sometimes by random individuals who want to archive a site or are playing with a crawler and don't see why they should slow them down.

Even the techniques for poisoning aggressive or impolite crawlers are at least 30 years old.

kijin•3mo ago
Yes, and sysadmins have been quietly banning those misbehaving programs for the last 30 years.

The only thing that seems to have changed is that today's thread is full of people who think they have some sort of human right to access any website by any means possible, including their sloppy vibe-coded crawler. In the past, IIRC, people used to be a little more apologetic about consuming other people's resources and did their best to fly below the radar.

It's my website. I have every right to block anyone at any time for any reason whatsoever. Whether or not your use case is "legitimate" is beside the point.

ToucanLoucan•3mo ago
The entitlement of so many modern vibe coders (or as we called them before, script kiddies) is absolutely off the charts. Just because there is not a rule or law expressly against what you're doing doesn't mean it's perfectly fine to do. Websites are hosted by and funded by people, and if your shitty scraper racks up a ton of traffic on one of my sites, I may end up on the hook for that. I am perfectly within both my rights and ethical boundaries to block your IP(s).

And just to not leave it merely implied, I don't give a rats ass if that slows down your "innovation." Go away.

Razengan•3mo ago
> And no, the correct answer to such a situation would not be "but it was free, don't offer it for free if you don't want it to be taken for free".

The answer to THAT could: "It is free but leave some for others you greedy fuck"

codyb•3mo ago
The sign on the door said "no scrapers", which as far as I know is not a protected class.
anon10484810573•3mo ago
This mindset really baffles me. Just because it is not illegal doesn't mean one should do it. And for anything truly innovative there are bound to be gaps in the current law.

It's pretty obvious that there is an asymmetry in benefit between those creating the models and those creating the content. If that doesn't bother you consider the fact that this currently undermines the economic and social model for open content creation on the internet.

What happens when the content significantly decreases?

Should those who create content not have some say in how their content is used?

davesque•3mo ago
I mean, it costs money to host content. If you are hosting content for bots fine, but if the money you're paying to host it is meant to benefit human users (the reason for robots.txt) then yeah, you ought to ask permission. Content might also be copyrighted. Honestly, I don't even know why I'm bothering to mention these things because it just feels obvious. LLM scrapers obviously want as much data as they can get, whether or not they act like assholes (ignoring robots.txt) or criminals (ignoring copyright) to get it.
j2kun•3mo ago
You should not have to ask for permission, but you should have to honestly set your user-agent. (In my opinion, this should be the law and it should be enforced)
gkbrk•3mo ago
> In my opinion, this should be the law and it should be enforced

You think people should go to prison if they go to their browser settings and change their user agent?

aendruk•3mo ago
Corporation, fined
grayhatter•3mo ago
If you're lying in the requests you send, to trick my server into returning the content you want, instead of what I would want to return to webscrapers, that's non-consensual.

You don't need my permission to send a GET request, I completely agree. In fact, by having a publicly accessible webserver, there's implied consent that I'm willing to accept reasonable, and valid GET requests.

But I have configured my server to spend server resources the way I want, you don't like how my server works, so your configure your bot to lie. If you get what you want only because you're willing to lie, where's the implied consent?

batch12•3mo ago
Browser user agents have a history of being lies from the earliest days of usage. Official browsers lied about what they were- and still do.
jraph•3mo ago
Lies in user agent strings where for bypassing bugs, poor workarounds and assumptions that became wrong, they are nothing like what we are talking about.
gkbrk•3mo ago
A server returning HTML for Chrome but not cURL seems like a bug, no?

This is why there are so many libraries to make requests that look like they came from browser, to work around buggy servers or server operators with wrong assumptions.

grayhatter•3mo ago
> A server returning HTML for Chrome but not cURL seems like a bug, no?

tell me you've never heard of https://wttr.in/ without telling me. :P

It would absolutely be a bug iff this site returned html to curl.

> This is why there are so many libraries to make requests that look like they came from browser, to work around buggy servers or server operators with wrong assumptions.

This is a shallow take, the best counter example is how googlebot has no problem identifying it itself both in and out of thue user agent. Do note user agent packing, is distinctly different from a fake user agent selected randomly from the list of most common.

The existence of many libraries with the intent to help conceal the truth about a request doesn't feel like proof that's what everyone should be doing. It feels more like proof that most people only want to serve traffic to browsers and real users. And it's the bots and scripts that are the fuckups.

batch12•3mo ago
Googlebot has no problem identifying itself because Google knows that you want it to index your site if you want visitors. It doesn't identify itself to give you the option to block it. It identifies itself so you don't.
grayhatter•3mo ago
I care much less about being indexed by Google as much as you might think.

Google bot doesn't get blocked from my server primarily because it's a *very* well behaved bot. It sends a lot of requests, but it's very kind, and has never acted in a way that could overload my server. It respects robots.txt, and identifies itself multiple times.

Google bot doesn't get blocked, because it's a well behaved bot that eagerly follows the rules. I wouldn't underestimate how far that goes towards the reason it doesn't get blocked. Much more than the power gained by being google search.

batch12•3mo ago
Yes, the client wanted the server to deliver content it had intended for a different client, regardless of what the service operator wanted, so it lied using its user agent. Exact same thing we are talking about. The difference is that people don't want companies to profit off of their content. That's fair. In this case, they should maybe consider some form of real authentication, or if the bot is abusive, some kind of rate limiting control.
grayhatter•3mo ago
> Yes, the client wanted the server to deliver content it had intended for a different client, regardless of what the service operator wanted, so it lied using its user agent.

I would actually argue, it's not nearly the same type of misconfiguration. The reason scripts, which have never been a browser, who omit their real identity, are doing it, is to evade bot detection. The reason browsers pack their UA with so much legacy data, is because of misconfigured servers. The server owner wants to send data to users and their browsers, but through incompetence, they've made a mistake. Browsers adapted by including extra strings in the UA to account for the expectations of incorrectly configured servers. Extra strings being the critical part, Google bot's UA is an example of this being done correctly.

jraph•3mo ago
Add "assumptions that became wrong" to "intended" and the perspective radically changes, to the point that omitting this part from my comment changes everything.

I would even add:

> the client wanted the server to deliver content it had intended for a different client

In most cases, the webmaster intended their work to look good, not really to send different content to different clients. That later part is a technical means, a workaround. The intent of bringing the ok version to the end user was respected… even better with the user agent lies!

> The difference is that people don't want companies to profit off of their content.

Indeed¹, and also they don't want terrible bot to bring down their servers.

1: well, my open source work explicitly allows people to profit off of it - as long as the license is respected (attribution, copyleft, etc)

grayhatter•3mo ago
Can you give a single example of a browser with a user agent that lies about it's real origin?

The best I can come up with is the TOR browser, which will reduce the number of bits of information it will return, but I dont consider that to be misleading. It's a custom build of firefox, that discloses it is firefox, and otherwise behaves exactly as I would expect firefox to behave.

wqaatwt•3mo ago
Somebody concealing or obfuscating various information a browser would send by standard for privacy or other reasons is also “lying” by that standard? Or someone using a VPN?
grayhatter•3mo ago
Someone using a VPN is not lying. The intent of a user agent is to identify the software sending the request. The IP address isn't sent by the browser, and isn't part of the HTTP request. It's part of the routing information required to deliver the packet back to the client. If a client sent it's "real" IP address as an HTTP header, and I tried to respond to that IP instead of the IP address from the TCP packet. It would never arrive.

There's a difference between sending no data, and sending false data. I don't block requests without http referrers for that very reason.

wqaatwt•3mo ago
IIRC Firefox (and I assume other browsers) when using privacy/no tracking mode does send fake data..
grayhatter•3mo ago
You're incorrect. I've never seen any browser, on it's own lie about it's user agent. (I can set a custom string and lie with it, but that's not the agent doing it)

Do you have a specific / concrete example in mind? Or are you mistaking a feature from something other than a mainstream browser?

gkbrk•3mo ago
Firefox sends an incorrect version and operating system on its User-Agent when the privacy settings are turned on.

IIRC it defaults to a Windows user agent even when you use it on other operating systems.

grayhatter•3mo ago
You're incorrect. I have Firefox configured with the most strict privacy settings, and it returns `Mozilla/5.0 (X11; Linux x86_64; rv:142.0) Gecko/20100101 Firefox/142.0)` With the exception of it being Wayland instead of X11 it's entirely accurate. Would love to see whatever gaslit you of something so easy to test and validate.
gkbrk•3mo ago
Nope, as it turns out this was actually a thing until 2025-01-24, where a commit removed this "pretend to be Windows even on Linux platforms" behavior.

https://github.com/mozilla-firefox/firefox/commit/eb2f90f870...

This OS spoofing behavior was added in 2019-01-09 with this commit:

https://github.com/mozilla-firefox/firefox/commit/264fe08c09...

So Firefox has spoofed the User-Agent as a Windows machine on Linux for around 6 years, and only stopped doing it early this year. Would love to see whatever gaslit you into forgetting this easy to test and validate behavior.

grayhatter•3mo ago
https://support.mozilla.org/en-US/kb/resist-fingerprinting

This was part of the resist fingerprinting feature. Which is an advanced user configuration. I can alter the user agent directly myself too.

Sigh

I regret getting tricked into arguing over such a pedantic specific, So I'd like to redirect the actual point, which is that it's not meaningful if a Firefox browser pretends to be a slightly different Firefox browser, but instead the problem is when something that's not a browser, claims to be and behave like one.

Still, +1 for finding the commit, I'd forgotten about this feature. I thought only the tor browser was this foolish.

malfist•3mo ago
If I set out a bowl of candy for ticker treaters, I wouldn't expect to be okay with the first adult strolling by and taking everything.
dylan604•3mo ago
and if they do, you have no recourse just like with scrapers. with the candy example, you spend you time sitting near the candy bowl supervising. for servers, we have various anti-bot supervisors. however, some asshat with no scruples can still just walk right up to your bowl and empty the contents into their bag and then just walk away even with you sitting right there. Unless you're willing to commit violence, there's nothing stopping them. now you're the assailant and the asshat is the victim. you still loose.
righthand•3mo ago
Then cutting up the candy and taping candy together in the most statistically pleasing way and finally selling all of the stolen frankenstein’s monster candy as innovative new candy and the future of humanity.
smsm42•3mo ago
You are still trying to pretend that accessing HTTP server once and burying it under an avalanche of never-stopping bot crawlers is the same thing? And spam is the same as "sending an email" and should be treated the same? I thought in this day and age we're past that.
1gn15•3mo ago
If you're trying to say DDoS, just say that.
smsm42•3mo ago
DDoS is a very specific type of attack. To be abusive, you don't have to do exactly that - it could be any type of DoS, and in fact it doesn't even have to deny all service - it could just impose excessive costs, for example.
bigbuppo•3mo ago
Sounds like you should give the bots exactly what they want... a 512MB file of random data.
kelnos•3mo ago
Most people have to pay for their bandwidth, though. That's a lot of data to send out over and over.
jcheng•3mo ago
512MB file of incredibly compressible data, then?
QuadmasterXLII•3mo ago
Could I recommend https://cubes.hgreer.com/ssg/output.html ?

50:1 compression ratio, but it's legitimately an implementation of a rubiks cube, that I wasn't actually making as any sort of trap, just wasn't thinking about file size, so any rule that filters it out is going to have a nasty false positive rate

aDyslecticCrow•3mo ago
Scraper sinkhole of randomly generated inter-linked files filled with AI poison could work. No human would click that link, so it leads to the "exclusive club".
oytis•3mo ago
Outbound traffic normally costs more than inbound one, so the asymmetry is set up wrong here. Data poisoning is probably the way.
zahlman•3mo ago
> Outbound traffic normally costs more than inbound one, so the asymmetry is set up wrong here.

That's what zip bombs are for.

kelseyfrog•3mo ago
That's leaving a lot of opportunity on the table.

The real money is in monetizing ad responses to AI scrapers so that LLMs are biased toward recommending certain products. The stealth startup I've founded does exactly this. Ad-poisoning-as-a-service is a huge untapped market.

bigbuppo•3mo ago
Now that's a paid subscription I can get behind, especially if it suggests that Meta should cut Rob Schneider a check for $200,000,000,000 to make more movies.
kelseyfrog•3mo ago
Contact info in bio. Always looking to make more happy customers.
AlienRobot•3mo ago
512 MB of saying your service is the best service.
stevage•3mo ago
The title is confusing, should be "commented-out".
pimlottc•3mo ago
Agree, I thought maybe this was going to be a script to block AI scrapers or something like that.
zahlman•3mo ago
I thought it was going to be AI scraper operators getting annoyed that they have to run reasoning models on the scraped data to make use of it.
mikeiz404•3mo ago
Two thoughts here when it comes to poisoning unwanted LLM training data traffic

1) A coordinated effort among different sites will have a much greater chance of poisoning the data of a model so long as they can avoid any post scraping deduplication or filtering.

2) I wonder if copyright law can be used to amplify the cost of poisoning here. Perhaps if the poisoned content is something which has already been shown to be aggressively litigated against then the copyright owner will go after them when the model can be shown to contain that banned data. This may open up site owners to the legal risk of distributing this content though… not sure. A cooperative effort with a copyright holder may sidestep this risk but they would have to have the means and want to litigate.

Anamon•3mo ago
As for 1, it would be great to have this as a plugin for WordPress etc. that anyone could simply install and enable. Pre-processing images to dynamically poison them on each request should be fun, and also protect against a deduplication defense. I'd certainly install that.
renegat0x0•3mo ago
Most web scrapers, even if illegal, are for... business. So they scrape amazon, or shops. So yeah. Most unwanted traffic is from big tech, or bad actors trying to sniff vulnerabilities.

I know a thing or two about web scraping.

There are sometimes status codes 404 for protection, so that you skip this site, so my crawler tries, as a hammer, several of faster crawling methods (curlcffi).

Zip bombs are also not for me. Reading header content length is enough to not read the page/file. I provide byte limit to check if response is not too big for me. For other cases reading timeout is enough.

Oh, and did you know that requests timeout is not really timeout a timeout for page read? So server can spoonfeed you bytes, one after another, and there will be no timeout.

That is why I created my own crawling system to mitigate these problems, and have one consistent mean of running selenium.

https://github.com/rumca-js/crawler-buddy

Based on library

https://github.com/rumca-js/webtoolkit

hnav•3mo ago
content-length is computed after content-encoding
ahoka•3mo ago
If it’s present at all.
Mars008•3mo ago
Looks like it's time for in-browser scrappers. They will be indistinguishable from the servers side. With AI driver can pass even human tests.
overfeed•3mo ago
> Looks like it's time for in-browser scrappers.

If scrapers were as well-behaved as humans, website operators wouldn't bother to block them[1]. It's the abuse that motivates the animus and action. As the fine articles spelt out, scrapers are greedy in many ways, one of which is trying to slurp down as many URLs as possible without wasting bytes. Not enough people know about common crawl, or know how to write multithreaded scrapers with high utilization across domains without suffocating any single one. If your scraper is URL FIFO or stack in a loop, you're just DOSing one domain at a time.

1. The most successful scrapers avoid standing out in any way

Mars008•3mo ago
The question is who runs them? There are only a few big companies like MS, Google, OpenAI, Anthropic. But from the posts here it looks like hordes of buggy scrapers run by enthusiasts.
iamacyborg•3mo ago
Lots of “data” companies out there that want to sell you scraped data sets.
luckylion•3mo ago
Ad companies, even the small ones, "Brand Protection" companies, IP lawyers looking for images that were used without license, Brand Marketing companies, where it matters also your competitors etc etc
eur0pa•3mo ago
you mean OpenAI Atlas?
bartread•3mo ago
Not a new idea. For years now, on the occasions I’ve needed to scrape, I’ve used a set of ViolentMonkey scripts. I’ve even considered creating an extension, but have never really needed it enough to do the extra work.

But this is why lots of sites implement captchas and other mechanisms to detect, frustrate, or trap automated activity - because plenty of bots run in browsers too.

1vuio0pswjnm7•3mo ago
Is there a difference between "scraping" and "crawling"
sokoloff•3mo ago
Well, if they’re going to request commented out scripts, serve them up some very large scripts…
throw_me_uwu•3mo ago
> most likely trying to non-consensually collect content for training LLMs

No, it's just background internet scanning noise

lucasluitjes•3mo ago
This.

If you were writing a script to mass-scan the web for vulnerabilities, you would want to collect as many http endpoints as possible. JS files, regardless of whether they're commented out or not, are a great way to find endpoints in modern web applications.

If you were writing a scraper to collect source code to train LLMs on, I doubt you would care as much about a commented-out JS file. I'm not sure you'd even want to train on random low-quality JS served by websites. Anyone familiar with LLM training data collection who can comment on this?