frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Behind y-s2: serverless multiplayer rooms

https://s2.dev/blog/durable-yjs-rooms
1•infiniteregrets•9m ago•0 comments

Auditing Permissions for All Shared Files in Google Drive

https://blog.terrydjony.com/google-drive-audit-access-permissions/
1•terryds•9m ago•0 comments

$400M Machine that Prints COMPUTER CHIPS [video]

https://www.youtube.com/watch?v=VIAS5_mWgO4
1•CHB0403085482•12m ago•0 comments

OWASP Kubernetes Top Survey

https://docs.google.com/forms/d/e/1FAIpQLScL-yznr-YqGdg9SIcToptVPw9qEy7eUPZDefSSvhDT6aMjWQ/viewform
1•mooreds•14m ago•0 comments

Show HN: 24-hour Halloween radio station hosted by Dr. Eleven

https://ui.elevenlabs.io/radio
1•louisjoejordan•18m ago•0 comments

Lightweight 2D Framebuffer Library for Linux

https://github.com/lvntky/fbgl
1•leventkaya•20m ago•0 comments

These Are All the Same Thing

https://www.pbump.net/o/these-are-all-the-same-thing/
3•tastyface•21m ago•0 comments

Long-Term Asset Return Study – The Ultimate Guide to Long-Term Investing

https://www.dbresearch.com/PROD/RI-PROD/PDFVIEWER.calias?pdfViewerPdfUrl=PROD0000000000607211
1•rufus_foreman•22m ago•0 comments

Renovate 42 Is Coming

https://github.com/renovatebot/renovate/discussions/38841
1•jamietanna•24m ago•0 comments

Show HN: I Built Bookmarks Manager Online

https://bookmarks-manager.online/
1•nicojuhari•25m ago•1 comments

The Future of Routing with the Navigation API – Eduardo San Martin Morote [video]

https://www.youtube.com/watch?v=2z2HMwAIc0o
1•fabiancook•25m ago•1 comments

'It's quite useless to us': What autistic people want

https://www.theguardian.com/us-news/2025/oct/19/autistic-people-trump-administration-research
1•PaulHoule•26m ago•0 comments

Disney yanks channels from YouTube TV after parties fail to resolve dispute

https://www.cnn.com/2025/10/30/media/disney-youtube-deal-biz-hnk
1•1vuio0pswjnm7•28m ago•0 comments

Chemical additive slashes carbon emissions when creating synthetic fuels

https://www.science.org/content/article/chemical-additive-slashes-carbon-emissions-when-creating-...
2•arkensaw•29m ago•0 comments

Future of AI

2•pillionaut•32m ago•1 comments

Ask HN: Why I rarely see game dev startup here?

3•blindprogrammer•33m ago•0 comments

Businesses are running out of pennies in the US

https://www.bbc.com/news/articles/c20556ly45eo
4•1659447091•36m ago•2 comments

How the Substack feed is learning to understand your reading journey

https://mrkcohen.substack.com/p/how-the-substack-feed-is-learning
1•cjbest•40m ago•0 comments

Geometric Pattern Generator

https://github.com/pxl-pshr/geometric-pattern-generator
1•SuperHeavy256•40m ago•1 comments

How China Powers Its Electric Cars and High-Speed Trains

https://www.nytimes.com/2025/10/11/business/china-electric-grid.html
2•bookofjoe•43m ago•1 comments

Bluesky hits 40M users, introduces 'dislikes' beta

https://techcrunch.com/2025/10/31/bluesky-hits-40-million-users-introduces-dislikes-beta/
2•doener•44m ago•0 comments

Security Community Slams MIT-Linked Report Claiming AI Powers 80% of Ransomware

https://socket.dev/blog/security-community-slams-mit-linked-report-claiming-ai-powers-80-of-ranso...
2•bediger4000•47m ago•1 comments

Strix Halo's Memory Subsystem: Tackling iGPU Challenges

https://old.chipsandcheese.com/2025/10/31/37437/
1•pella•49m ago•0 comments

Strix Halo's Memory Subsystem: Tackling iGPU Challenges

https://chipsandcheese.com/p/strix-halos-memory-subsystem-tackling
1•rbanffy•49m ago•0 comments

We Have a Human Problem

https://www.heatpumped.org/p/we-have-a-human-problem
2•ssuds•51m ago•1 comments

FCC to rescind ruling that said ISPs are required to secure their networks

https://arstechnica.com/tech-policy/2025/10/fcc-dumps-plan-for-telecom-security-rules-that-intern...
2•throw0101a•51m ago•0 comments

The new American dream is to get rich *quick

https://www.dopaminemarkets.com/p/investing-is-entertainment-and-traders
1•_1729•53m ago•1 comments

YouTube's AI Moderator Pulls Windows 11 Workaround Videos, Calls Them Dangerous

https://www.theregister.com/2025/10/31/ai_moderation_youtube_windows11_workaround/
7•m463•54m ago•1 comments

Show HN: Historian – A simple shell history tool

https://github.com/Schachte/Historian
1•siamese_puff•56m ago•0 comments

I'm an IIT Madras Student. But to Some, I'm Diluting the Brand

https://ishan.page/blog/jeeification/
2•ishandotpage•58m ago•0 comments
Open in hackernews

AI scrapers request commented scripts

https://cryptography.dog/blog/AI-scrapers-request-commented-scripts/
159•ColinWright•7h ago

Comments

rokkamokka•5h ago
I'm not overly surprised, it's probably faster to search the text for http/https than parse the DOM
embedding-shape•5h ago
Not probably, searching through plaintext (which they seem to be doing) VS iterating on the DOM have vastly different amount of work behind them in terms of resources used and performance that "probably" is way underselling the difference :)
franktankbank•3h ago
Reminds me of the shortcut that works for the happy path but is utterly fucked by real data. This is an interesting trap, can it easily be avoided without walking the dom?
embedding-shape•3h ago
Yes, parse out HTML comments which is also kind of trivial if you've ever done any sort of parsing, listen for "<!--", whenever you come across it, ignore everything until the next "-->". But then again, these people are using AI to build scrapers, so I wouldn't put too much pressure on them to produce high-quality software.
stevage•1h ago
Lots of other ways to include URLs in an HTML document that wouldn't be visible to a real user, though.
jcheng•20m ago
It's not quite as trivial as that; one could start the page with a <script> tag that contains "<!--" without matching "-->", and that would hide all the content from your scraper but not from real browsers.

But I think it's moot, parsing HTML is not very expensive if you don't have to actually render it.

Noumenon72•5h ago
It doesn't seem that abusive. I don't comment things out thinking "this will keep robots from reading this".
michael1999•5h ago
Crawlers ignoring robots.txt is abusive. That they then start scanning all docs for commented urls just adds to the pile of scummy behaviour.
tveyben•5h ago
Human behavior is interesting - me, me, me…
mostlysimilar•5h ago
The article mentions using this as a means of detecting bots, not as a complaint that it's abusive.

EDIT: I was chastised, here's the original text of my comment: Did you read the article or just the title? They aren't claiming it's abusive. They're saying it's a viable signal to detect and ban bots.

pseudalopex•5h ago
Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".[1]

[1] https://news.ycombinator.com/newsguidelines.html

woodrowbarlow•5h ago
the first few words of the article are:

> Last Sunday I discovered some abusive bot behaviour [...]

mostlysimilar•5h ago
> The robots.txt for the site in question forbids all crawlers, so they were either failing to check the policies expressed in that file, or ignoring them if they had.
foobarbecue•5h ago
Yeah but the abusive behavior is ignoring robots.txt and scraping to train AI. Following commented URLs was not the crime, just evidence inadvertently left behind.
ang_cire•1h ago
They call the scrapers "malicious", so they are definitely complaining about them.

> A few of these came from user-agents that were obviously malicious:

(I love the idea that they consider any python or go request to be a malicious scraper...)

latenightcoding•4h ago
when I used to crawl the web, battle tested Perl regexes were more reliable than anything else, commented urls would have been added to my queue.
rightbyte•4h ago
DOM navigation for fetching some data is for tryhards. Using a regex to grab the correct paragraph or div or whatever is fine and is more robust versus things moving around on the page.
chaps•4h ago
Doing both is fine! Just, once you've figured out your regex and such, hardening/generalizing demands DOM iteration. It sucks but it is what is is.
horseradish7k•4h ago
but not when crawling. you don't know the page format in advance - you don't even know what the page contains!
OhMeadhbh•4h ago
I blame modern CS programs that don't teach kids about parsing. The last time I looked at some scraping code, the dev was using regexes to "parse" html to find various references.

Maybe that's a way to defend against bots that ignore robots.txt, include a reference to a Honeypot HTML file with garbage text, but include the link to it in a comment.

tuwtuwtuwtuw•4h ago
Do you think that if some CS programs taught parsing, the authors of the bot would parse the HTML to properly extract links, instead of just doing plain text search?

I doubt it.

ericmcer•3h ago
How would recommend doing it? If I was just trying to pull <a/> tag links out I feel like treating it like text and using regex would be way more efficient than a full on HTML parser like JSDom or something.
singron•3h ago
You don't need javascript to parse HTML. Just use an HTML parser. They are very fast. HTML isn't a regular language, so you can't parse it with regular expressions.

Obligatory: https://stackoverflow.com/questions/1732348/regex-match-open...

zahlman•49m ago
The point is: if you're trying to find all the URLs within the page source, it doesn't really matter to you what tags they're in, or how the document is structured, or even whether they're given as link targets or in the readable text or just what.
vaylian•1h ago
The people who do this type of scraping to feed their AI are probably also using AI to write their scraper.
mikeiz404•1h ago
It’s been some time since I have dealt with web scrapers but it takes less resources to run a regex than it does to parse the DOM (which may have syntactically incorrect parts anyway). This can add up when running many scraping requests in parallel. So depending on your goals using a regex can be much preferred.
sharkjacobs•4h ago
Fun to see practical applications of interesting research[1]

[1]https://news.ycombinator.com/item?id=45529587

bakql•4h ago
>These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.

"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.

Yes, I know about weev. That was a travesty.

XenophileJKO•4h ago
What about people using an LLM as their web client? Are you now saying the website owner should be able to dictate what client I use and how it must behave?
aDyslecticCrow•1h ago
> Are you now saying the website owner should be able to dictate what client I use and how it must behave?

Already pretty well established with Ad-block actually. It's a pretty similar case even. AI's don't click ads, so why should we accept their traffic? If it's un-proportionally loading the server without contributing to the funding of the site, get blocked.

The server can set whatever rules it wants. If the maintainer hates google and wants to block all chrome users, it can do so.

Calavar•4h ago
I agree. It always surprises me when people are indignant about scrapers ignoring robots.txt and throw around words like "theft" and "abuse."

robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.

It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch. It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center. Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

hsbauauvhabzb•3h ago
How else do you tell the bot you do not wish to be scraped? Your analogy is lacking - you didn’t order a package, you never wanted a package, and the postman is taking something, not leaving it, and you’ve explicitly left a sign saying ‘you are not welcome here’.
bakql•3h ago
Stop your http server if you do not wish to receive http requests.
vkou•6m ago
Turn off your phone if you don't want to receive robo-dialed calls and unsolicited texts 300 times a day.

Fence off your yard if you don't want people coming by and dumping a mountain of garbage on it every day.

You can certainly choose to live in a society that thinks these are acceptable solutions. I think it's bullshit, and we'd all be better off if anyone doing these things would be breaking rocks with their teeth in a re-education camp, until they learn how to be a decent human being.

Calavar•3h ago
If you are serving web pages, you are soliciting GET requests, kind of like ordering a package is soliciting a delivery.

"Taking" versus "giving" is neither here nor there for this discussion. The question is are you expressing a preference on etiquette versus a hard rule that must be followed. I personally believe robots.txt is the former, and I say that as someone who serves more pages than they scrape

yuliyp•3h ago
Having a front door physically allows anyone on the street to come to knock on it. Having a "no soliciting" sign is an instruction clarifying that not everybody is welcome. Having a web site should operate in a similar fashion. The robots.txt is the equivalent of such a sign.
halJordan•3h ago
No soliciting signs are polite requests that no one has to follow, and door to door salesman regularly walk right past them.

No one is calling for the criminalization of door-to-door sales and no one is worried about how much door-to-door sales increases water consumption.

oytis•1h ago
> door to door salesman regularly walk right past them.

Oh, now I understand why Americans can't see a problem here.

ahtihn•57m ago
If a company was sending hundreds of salesmen to knock at a door one after the other, I'm pretty sure they could successfully get sued for harassment.
czscout•2h ago
And a no soliciting sign is no more cosmically binding than robots.txt. It's a request, not an enforceable command.
munk-a•3h ago
I disagree strongly here - though not from a technical perspective. There's absolutely a legal concept of making your work available for viewing without making it available for copying and AI scraping (while we can technically phrase it as just viewing a bunch of times) is effectively copying.

Lets say a large art hosting site realizes how damaging AI training on their data can be - should they respond by adding a paywall before any of their data is visible? If that paywall is added (let's just say $5/mo) can most of the artists currently on their site afford to stay there? Can they afford it if their potential future patrons are limited to just those folks who can pay $5/mo? Would the scraper be able to afford a one time cost of $5 to scrape all of that data?

I think, as much they are a deeply flawed concept, this is a case where EULAs or an assumption of no-access for training unless explicitly granted that's actually enforced through the legal system is required. There are a lot of small businesses and side projects that are dying because of these models and I think that creative outlet has societal value we would benefit from preserving.

jMyles•2h ago
> There's absolutely a legal concept of making your work available for viewing without making it available for copying

This "legal concept" is enforceable through legacy systems of police and violence. The internet does not recognize it. How much more obvious can this get?

If we stumble down the path of attempting to apply this legal framework, won't some jurisdiction arise with no IP protections whatsoever and just come to completely dominate the entire economy of the internet?

If I can spin up a server in copyleftistan with a complete copy of every album and film ever made, available for free download, why would users in copyrightistan use the locked down services of their domestic economy?

kelnos•1h ago
> legacy systems of police and violence

You use "legacy" as if these systems are obsolete and on their way out. They're not. They're here to stay, and will remain dominant, for better or worse. Calling them "legacy" feels a bit childish, as if you're trying to ignore reality and base arguments on your preferred vision of how things should be.

> The internet does not recognize it.

Sure it does. Not universally, but there are a lot of things governments and law enforcement can do to control what people see and do on the internet.

> If we stumble down the path of attempting to apply this legal framework, won't some jurisdiction arise with no IP protections whatsoever and just come to completely dominate the entire economy of the internet?

No, of course not, that's silly. That only really works on the margins. Any other country would immediately slap economic sanctions on that free-for-all jurisdiction and cripple them. If that fails, there's always a military response they can resort to.

> If I can spin up a server in copyleftistan with a complete copy of every album and film ever made, available for free download, why would users in copyrightistan use the locked down services of their domestic economy?

Because the governments of all the copyrightistans will block all traffic going in and out of copyleftistan. While this may not stop determined, technically-adept people, it will work for the most part. As I said, this sort of thing only really works on the margins.

andoando•2h ago
Well yes this is exactly what's happening as of now. But there SHOULD be a way to upload content without giving it access to scrapers.
kelnos•1h ago
> If you are serving web pages, you are soliciting GET requests

So what's the solution? How do I host a website that welcomes human visitors, but rejects all scrapers?

There is no mechanism! The best I can do is a cat-and-mouse arms race where I try to detect the traffic I don't want, and block it, while the people generating the traffic keep getting more sophisticated about hiding from my detection.

No, putting up a paywall is not a reasonable response to this.

> The question is are you expressing a preference on etiquette versus a hard rule that must be followed.

Well, there really aren't any hard rules that must be followed, because there are no enforcement mechanisms outside of going nuclear (requiring login). Everything is etiquette. And I agree that robots.txt is also etiquette, and it is super messed up that we tolerate "AI" companies stomping all over that etiquette.

Do we maybe want laws that say everyone must respect robots.txt? Maybe? But then people will just move their scrapers to a jurisdiction without those laws. And I'm sure someone could make the argument that robots.txt doesn't apply to them because they spoofed a browser user-agent (or another user-agent that a site explicitly allows). So perhaps we have a new mechanism, or new laws, or new... something.

But this all just highlights the point I'm making here: there is no reasonable mechanism (no, login pages and http auth don't count) for site owners to restrict access to their site based on these sorts of criteria. And that's a problem.

davesque•1h ago
If I order a package from a company selling a good, am I inviting all that company's competitors to show up at my doorstep to try and outbid the delivery person from the original company when they arrive, and maybe they all show up at the same time and cause my porch to collapse? No, because my front porch is a limited resource for which I paid for an intended purpose. Is it illegal for those other people to show up? Maybe not by the letter of the law.
davsti4•3h ago
Its simple, and I'll quote myself - "robots.txt isn't the law".
nkrisc•3h ago
Put your content behind authentication if you don’t want it to be requested by just anyone.
kelnos•1h ago
But I do want my content accessible to "just anyone", as long as they are humans. I don't want it accessible to bots.

You are free to say "well, there is no mechanism to do that", and I would agree with you. That's the problem!

stray•3h ago
You require something the bot won't have that a human would.

Anybody may watch the demo screen of an arcade game for free, but you have to insert a quarter to play — and you can have even greater access with a key.

> and you’ve explicitly left a sign saying ‘you are not welcome here’

And the sign said "Long-haired freaky people Need not apply" So I tucked my hair up under my hat And I went in to ask him why He said, "You look like a fine upstandin' young man I think you'll do" So I took off my hat and said, "Imagine that Huh, me workin' for you"

michaelt•3h ago
> You require something the bot won't have that a human would.

Is this why the “open web” is showing me a captcha or two, along with their cookie banner and newsletter pop up these days?

whimsicalism•3h ago
There's an evolving morality around the internet that is very, very different from the pseudo-libertarian rule of the jungle I was raised with. Interesting to see things change.
sethhochberg•3h ago
The evolutionary force is really just "everyone else showed up at the party". The Internet has gone from a capital-I thing that was hard to access, to a little-i internet that was easier to access and well known but still largely distinct from the real world, to now... just the real world in virtual form. Internet morality mirrors real world morality.

For the most part, everybody is participating now, and that brings all of the challenges of any other space with everyone's competing interests colliding - but fewer established systems of governance.

hdgvhicv•1h ago
Based on the comments here the polite world of the internet where people obeyed unwritten best practices is certainly over in favour of “grab what you can might makes right”
bigbuppo•2h ago
Seriously. Did you see what that web server was wearing? I mean, sure it said "don't touch me" and started screaming for help and blocked 99.9% of our IP space, but we got more and they didn't block that so clearly they weren't serious. They were asking for it. It's their fault. They're not really victims.
jMyles•2h ago
Sexual consent is sacred. This metaphor is in truly bad taste.

When you return a response with a 200-series status code, you've granted consent. If you don't want to grant consent, change the logic of the server.

jraph•2h ago
> When you return a response with a 200-series status code, you've granted consent. If you don't want to grant consent, change the logic of the server.

"If you don't consent to me entering your house, change its logic so that picking the door's lock doesn't let me open the door"

Yeah, well…

As if the LLM scrappers didn't try anything under the sun like using millions of different residential IP to prevent admins from "changing the logic of the server" so it doesn't "return a response with a 200-series status code" when they don't agree to this scrapping.

As if there weren't broken assumptions that make "When you return a response with a 200-series status code, you've granted consent" very false.

As if technical details were good carriers of human intents.

ryandrake•2h ago
The locked door is a ridiculous analogy when it comes to the open web. Pretty much all "door" analogies are flawed, but sure let's imagine your web server has a door. If you want to actually lock the door, you're more than welcome to put an authentication gate around your content. A web server that accepts a GET request and replies 2xx is distinctly NOT "locked" in any way.
jraph•2h ago
Any analogy is flawed and you can kill most analogies very fast. They are meant to illustrate a point hopefully efficiently, not to be mathematically true. They are not to everyone's taste, me included in most cases. They are mostly fine as long as they are not used to make a point, but only to illustrate it.

I agree with this criticism of this analogy, I actually had this flaw in mind from the start. There are other flaws I have in mind as well.

I have developed more without the analogy in the remaining of the comment. How about we focus on the crux of the matter?

> A web server that accepts a GET request and replies 2xx is distinctly NOT "locked" in any way

The point is that these scrappers use tricks so that it's difficult not to grant them access. What is unreasonable here is to think that 200 means consent, especially knowing about the tricks.

Edit:

> you're more than welcome to put an authentication gate around your content.

I don't want to. Adding auth so llm providers don't abuse my servers and the work I meant to share publicly is not a working solution.

jack_pp•1h ago
here's my analogy, it's like you own a museum and you require entrance by "secret" password (your user agent filtering or what not). the problem is the password is the same for everyone so would you be surprised when someone figures it out or gets it from a friend and they visit your museum? Either require a fee (processing power, captcha etc) or make a private password (auth)

It is inherently a cat and mouse game that you CHOOSE to play. Either implement throttling for clients that consume too much resources for your server / require auth / captcha / javascript / whatever whenever the client is using too much resources. if the client still chooses to go through the hoops you implemented then I don't see any issue. If u still have an issue then implement more hoops until you're satisfied.

jraph•1h ago
> Either require a fee (processing power, captcha etc) or make a private password (auth)

Well, I shouldn't have to work or make things worse for everybody because the LLM bros decided to screw us.

> It is inherently a cat and mouse game that you CHOOSE to play

No, let's not reverse the roles and blame the victims here. We sysadmins and authors are willing to share our work publicly to the world but never asked for it to be abused.

ryandrake•1h ago
People need to have a better mental model of what it means to host a public web site, and what they are actually doing when they run the web server and point it at a directory of files. They're not just serving those files to customers. They're not just serving them to members. They're not just serving them to human beings. They're not even necessarily serving files to web browsers. They're serving files to every IP address (no matter what machine is attached to it) that is capable of opening a socket and sending GET. There's no such distinct thing as a scraper--and if your mental model tries to distinguish between a scraper and a human user, you're going to be disappointed.

As the web server operator, you can try to figure out if there's a human behind the IP, and you might be right or wrong. You can try to figure out if it's a web browser, or if it's someone typing in curl from a command line, or if it's a massively parallel automated system, and you might be right or wrong. You can try to guess what country the IP is in, and you might be right or wrong. But if you really want to actually limit access to the content, you shouldn't be publishing that content publicly.

jraph•1h ago
> There's no such distinct thing as a scraper--and if your mental model tries to distinguish between a scraper and a human user, you're going to be disappointed.

I disagree. If your mental model doesn't allow conceptualizing (abusive) scrapers, it is too simplicistic to be useful to understand and deal with reality.

But I'd like to re-state the frame / the concern: it's not about any bot or any scraper, it is about the despicable behavior of LLM providers and their awful scrappers.

I'm personally fine with bots accessing my web servers, there are many legitimate use cases for this.

> But if you really want to actually limit access to the content, you shouldn't be publishing that content publicly.

It is not about denying access to the content to some and allowing access to others.

It is about having to deal with abuses.

Is a world in which people stop sharing their work publicly because of these abuses desirable? Hell no.

oytis•48m ago
Technically, you are not serving anything - it's just voltage levels going up and down with no meaning at all.
Larrikin•2h ago
>I don't like how your metaphor is an effective metaphor for the situation so it's in bad taste.
jack_pp•1h ago
if u absolutely want a sexual metaphor it's more like you snuck into the world record for how many sexual parteners a woman can take in 24h and even tho you aren't on the list you still got to smash.

solution is the same, implement better security

LexGray•50m ago
Perhaps bad taste, but bots could also be legitimately purposely violating the most private or traumatizing moments a vulnerable person has in any exploitative way it cares to. I am not sure using bad taste is enough of an excuse to not discuss the issue as many people do in fact use the internet for sexual things. If anything consent should be MORE important because it is easier to document and verify.

A vast hoard of personal information exists and most of it never had or will have proper consent, knowledge, or protection.

mxkopy•2h ago
The metaphor doesn’t work. It’s not the security of the package that’s in question, but something like whether the delivery person is getting paid enough or whether you’re supporting them getting replaced by a robot. The issue is in the context, not the protocol.
kelnos•2h ago
> robots.txt is a polite request to please not scrape these pages

People who ignore polite requests are assholes, and we are well within our rights to complain about them.

I agree that "theft" is too strong (though I think you might be presenting a straw man there), but "abuse" can be perfectly apt: a crawler hammering a server, requesting the same pages over and over, absolutely is abuse.

> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

That's a shitty world that we shouldn't have to live in.

watwut•1h ago
If you ignore polite request, then it is perfectly ok to give you as much false data as possible. You have shown yourself not interested in good faith cooperation, that means other people can and should treat you as a jerk.
isodev•3h ago
Ah yes, the “it’s ok because I can” school of thought. As if that was ever true.
munk-a•3h ago
I think there's a massive shift in what the letter of the law needs to be to match the intent. The letter hasn't changed and this is all still quite legal - but there is a significant different between what webscraping was doing to impact creative lives five years ago and today. It was always possible for artists to have their content stolen and for creative works to be reposted - but there was enough IP laws around image sharing (which AI disingenuously steps around) and other creative work wasn't monetarily efficient to scrape.

I think there is a really different intent to an action to read something someone created (which is often a form of marketing) and to reproduce but modify someone's creative output (which competes against and starves the creative of income).

The world changed really quickly and our legal systems haven't kept up. It is hurting real people who used to have small side businesses.

Lionga•3h ago
So if a house is not not locked I can take whatever I want?
Ylpertnodi•2h ago
Yes, but you may get caught, and there suffer 'consequences'. I can drive well over 220kmh+ on the autobahn (Germany, Europe), and also in France (also in Europe). One is acceptable, the other will get me Royale-e fucked. If the can catch me.
arccy•3h ago
yeah all open HTTP servers are fair game for DDoS because well it's open right?
sdenton4•3h ago
The problem is that serving content costs money. Llm scraping is essentially ddos'ing content meant for human consumption. Ddos'ing sucks.
2OEH8eoCRo0•2h ago
Scraping is legal. DDoSing isn't.

We should start suing these bad actors. Why do techies forget that the legal system exists?

ColinWright•2h ago
There is no way that you can sue the people responsible for DDoSing your system. Even if you can find them ... and you won't ... they're likely as not either not in your jurisdiction (they might be in Russia, or China, or Bolivia, or anywhere) and they will have a lot more money than you.

People here on HN are laughing at the UKs Online Safety Act for trying to impose restrictions on people in other countries, and yet now you're implying that similar restrictions can be placed on people in other countries and over whom you have neither power nor control.

jraph•2h ago
When I open an HTTP server to the public web, I expect and welcome GET requests in general.

However,

(1) there's a difference between (a) a regular user browsing my websites and (b) robots DDoSing them. It was never okay to hammer a webserver. This is not new, and it's for this reason that curl has had options to throttle repeated requests to servers forever. In real life, there are many instances of things being offered for free, it's usually not okay to take it all. Yes, this would be abuse. And no, the correct answer to such a situation would not be "but it was free, don't offer it for free if you don't want it to be taken for free". Same thing here.

(2) there's a difference between (a) a regular user reading my website or even copying and redistributing my content as long as the license of this work / the fair use or related laws are respected, and (b) a robot counterfeiting it (yeah, I agree with another commenter, theft is not the right word, let's call a spade a spade)

(3) well-behaved robots are expected to respect robots.txt. This is not the law, this is about being respectful. It is only fair bad-behaved robots get called out.

Well behaved robots do not usually use millions of residential IPs through shady apps to "Perform a get request to an open HTTP server".

Cervisia•1h ago
> robots.txt. This is not the law

In Germany, it is the law. § 44b UrhG says (translated):

(1) Text and data mining is the automated analysis of one or more digital or digitized works to obtain information, in particular about patterns, trends, and correlations.

(2) Reproductions of lawfully accessible works for text and data mining are permitted. These reproductions must be deleted when they are no longer needed for text and data mining.

(3) Uses pursuant to paragraph 2, sentence 1, are only permitted if the rights holder has not reserved these rights. A reservation of rights for works accessible online is only effective if it is in machine-readable form.

codyb•2h ago
The sign on the door said "no scrapers", which as far as I know is not a protected class.
davesque•1h ago
I mean, it costs money to host content. If you are hosting content for bots fine, but if the money you're paying to host it is meant to benefit human users (the reason for robots.txt) then yeah, you ought to ask permission. Content might also be copyrighted. Honestly, I don't even know why I'm bothering to mention these things because it just feels obvious. LLM scrapers obviously want as much data as they can get, whether or not they act like assholes (ignoring robots.txt) or criminals (ignoring copyright) to get it.
j2kun•1h ago
You should not have to ask for permission, but you should have to honestly set your user-agent. (In my opinion, this should be the law and it should be enforced)
vale11_amo2•4h ago
Hackers
edm0nd•3h ago
one of the best movies, yes.
bigbuppo•2h ago
Sounds like you should give the bots exactly what they want... a 512MB file of random data.
kelnos•1h ago
Most people have to pay for their bandwidth, though. That's a lot of data to send out over and over.
jcheng•19m ago
512MB file of incredibly compressible data, then?
aDyslecticCrow•1h ago
Scraper sinkhole of randomly generated inter-linked files filled with AI poison could work. No human would click that link, so it leads to the "exclusive club".
oytis•1h ago
Outbound traffic normally costs more than inbound one, so the asymmetry is set up wrong here. Data poisoning is probably the way.
zahlman•51m ago
> Outbound traffic normally costs more than inbound one, so the asymmetry is set up wrong here.

That's what zip bombs are for.

winddude•2h ago
i wish i could downvote.
stevage•1h ago
The title is confusing, should be "commented-out".
pimlottc•1h ago
Agree, I thought maybe this was going to be a script to block AI scrapers or something like that.
zahlman•52m ago
I thought it was going to be AI scraper operators getting annoyed that they have to run reasoning models on the scraped data to make use of it.
hexage1814•1h ago
I like and support web scrapers. It is even funnier when the site owners don't like it
ang_cire•1h ago
Yep. Robots.txt is a framework intended for performance, not a legal or ethical imperative.

If you want to control how someone accesses something, the onus is on you to put access controls in place.

The people who put things on a public, un-restricted server and then complain that the public accessed it in an un-restricted way might be excusable if it's some geocities-esque Mom and Pop site that has no reason to know better, but 'cryptography dog' ain't that.

mikeiz404•1h ago
Two thoughts here when it comes to poisoning unwanted LLM training data traffic

1) A coordinated effort among different sites will have a much greater chance of poisoning the data of a model so long as they can avoid any post scraping deduplication or filtering.

2) I wonder if copyright law can be used to amplify the cost of poisoning here. Perhaps if the poisoned content is something which has already been shown to be aggressively litigated against then the copyright owner will go after them when the model can be shown to contain that banned data. This may open up site owners to the legal risk of distributing this content though… not sure. A cooperative effort with a copyright holder may sidestep this risk but they would have to have the means and want to litigate.

renegat0x0•47m ago
Most web scrapers, even if illegal, are for... business. So they scrape amazon, or shops. So yeah. Most unwanted traffic is from big tech, or bad actors trying to sniff vulnerabilities.

I know a thing or two about web scraping.

There are sometimes status codes 404 for protection, so that you skip this site, so my crawler tries, as a hammer, several of faster crawling methods (curlcffi).

Zip bombs are also not for me. Reading header content length is enough to not read the page/file. I provide byte limit to check if response is not too big for me. For other cases reading timeout is enough.

Oh, and did you know that requests timeout is not really timeout a timeout for page read? So server can spoonfeed you bytes, one after another, and there will be no timeout.

That is why I created my own crawling system to mitigate these problems, and have one consistent mean of running selenium.

https://github.com/rumca-js/crawler-buddy

Based on library

https://github.com/rumca-js/webtoolkit