EDIT: I was chastised, here's the original text of my comment: Did you read the article or just the title? They aren't claiming it's abusive. They're saying it's a viable signal to detect and ban bots.
> Last Sunday I discovered some abusive bot behaviour [...]
> A few of these came from user-agents that were obviously malicious:
(I love the idea that they consider any python or go request to be a malicious scraper...)
Maybe that's a way to defend against bots that ignore robots.txt, include a reference to a Honeypot HTML file with garbage text, but include the link to it in a comment.
I doubt it.
Obligatory: https://stackoverflow.com/questions/1732348/regex-match-open...
"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.
Yes, I know about weev. That was a travesty.
Already pretty well established with Ad-block actually. It's a pretty similar case even. AI's don't click ads, so why should we accept their traffic? If it's un-proportionally loading the server without contributing to the funding of the site, get blocked.
The server can set whatever rules it wants. If the maintainer hates google and wants to block all chrome users, it can do so.
robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.
It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch. It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center. Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.
Fence off your yard if you don't want people coming by and dumping a mountain of garbage on it every day.
You can certainly choose to live in a society that thinks these are acceptable solutions. I think it's bullshit, and we'd all be better off if anyone doing these things would be breaking rocks with their teeth in a re-education camp, until they learn how to be a decent human being.
"Taking" versus "giving" is neither here nor there for this discussion. The question is are you expressing a preference on etiquette versus a hard rule that must be followed. I personally believe robots.txt is the former, and I say that as someone who serves more pages than they scrape
No one is calling for the criminalization of door-to-door sales and no one is worried about how much door-to-door sales increases water consumption.
Oh, now I understand why Americans can't see a problem here.
Lets say a large art hosting site realizes how damaging AI training on their data can be - should they respond by adding a paywall before any of their data is visible? If that paywall is added (let's just say $5/mo) can most of the artists currently on their site afford to stay there? Can they afford it if their potential future patrons are limited to just those folks who can pay $5/mo? Would the scraper be able to afford a one time cost of $5 to scrape all of that data?
I think, as much they are a deeply flawed concept, this is a case where EULAs or an assumption of no-access for training unless explicitly granted that's actually enforced through the legal system is required. There are a lot of small businesses and side projects that are dying because of these models and I think that creative outlet has societal value we would benefit from preserving.
This "legal concept" is enforceable through legacy systems of police and violence. The internet does not recognize it. How much more obvious can this get?
If we stumble down the path of attempting to apply this legal framework, won't some jurisdiction arise with no IP protections whatsoever and just come to completely dominate the entire economy of the internet?
If I can spin up a server in copyleftistan with a complete copy of every album and film ever made, available for free download, why would users in copyrightistan use the locked down services of their domestic economy?
You use "legacy" as if these systems are obsolete and on their way out. They're not. They're here to stay, and will remain dominant, for better or worse. Calling them "legacy" feels a bit childish, as if you're trying to ignore reality and base arguments on your preferred vision of how things should be.
> The internet does not recognize it.
Sure it does. Not universally, but there are a lot of things governments and law enforcement can do to control what people see and do on the internet.
> If we stumble down the path of attempting to apply this legal framework, won't some jurisdiction arise with no IP protections whatsoever and just come to completely dominate the entire economy of the internet?
No, of course not, that's silly. That only really works on the margins. Any other country would immediately slap economic sanctions on that free-for-all jurisdiction and cripple them. If that fails, there's always a military response they can resort to.
> If I can spin up a server in copyleftistan with a complete copy of every album and film ever made, available for free download, why would users in copyrightistan use the locked down services of their domestic economy?
Because the governments of all the copyrightistans will block all traffic going in and out of copyleftistan. While this may not stop determined, technically-adept people, it will work for the most part. As I said, this sort of thing only really works on the margins.
So what's the solution? How do I host a website that welcomes human visitors, but rejects all scrapers?
There is no mechanism! The best I can do is a cat-and-mouse arms race where I try to detect the traffic I don't want, and block it, while the people generating the traffic keep getting more sophisticated about hiding from my detection.
No, putting up a paywall is not a reasonable response to this.
> The question is are you expressing a preference on etiquette versus a hard rule that must be followed.
Well, there really aren't any hard rules that must be followed, because there are no enforcement mechanisms outside of going nuclear (requiring login). Everything is etiquette. And I agree that robots.txt is also etiquette, and it is super messed up that we tolerate "AI" companies stomping all over that etiquette.
Do we maybe want laws that say everyone must respect robots.txt? Maybe? But then people will just move their scrapers to a jurisdiction without those laws. And I'm sure someone could make the argument that robots.txt doesn't apply to them because they spoofed a browser user-agent (or another user-agent that a site explicitly allows). So perhaps we have a new mechanism, or new laws, or new... something.
But this all just highlights the point I'm making here: there is no reasonable mechanism (no, login pages and http auth don't count) for site owners to restrict access to their site based on these sorts of criteria. And that's a problem.
You are free to say "well, there is no mechanism to do that", and I would agree with you. That's the problem!
Anybody may watch the demo screen of an arcade game for free, but you have to insert a quarter to play — and you can have even greater access with a key.
> and you’ve explicitly left a sign saying ‘you are not welcome here’
And the sign said "Long-haired freaky people Need not apply" So I tucked my hair up under my hat And I went in to ask him why He said, "You look like a fine upstandin' young man I think you'll do" So I took off my hat and said, "Imagine that Huh, me workin' for you"
Is this why the “open web” is showing me a captcha or two, along with their cookie banner and newsletter pop up these days?
For the most part, everybody is participating now, and that brings all of the challenges of any other space with everyone's competing interests colliding - but fewer established systems of governance.
When you return a response with a 200-series status code, you've granted consent. If you don't want to grant consent, change the logic of the server.
"If you don't consent to me entering your house, change its logic so that picking the door's lock doesn't let me open the door"
Yeah, well…
As if the LLM scrappers didn't try anything under the sun like using millions of different residential IP to prevent admins from "changing the logic of the server" so it doesn't "return a response with a 200-series status code" when they don't agree to this scrapping.
As if there weren't broken assumptions that make "When you return a response with a 200-series status code, you've granted consent" very false.
As if technical details were good carriers of human intents.
I agree with this criticism of this analogy, I actually had this flaw in mind from the start. There are other flaws I have in mind as well.
I have developed more without the analogy in the remaining of the comment. How about we focus on the crux of the matter?
> A web server that accepts a GET request and replies 2xx is distinctly NOT "locked" in any way
The point is that these scrappers use tricks so that it's difficult not to grant them access. What is unreasonable here is to think that 200 means consent, especially knowing about the tricks.
Edit:
> you're more than welcome to put an authentication gate around your content.
I don't want to. Adding auth so llm providers don't abuse my servers and the work I meant to share publicly is not a working solution.
It is inherently a cat and mouse game that you CHOOSE to play. Either implement throttling for clients that consume too much resources for your server / require auth / captcha / javascript / whatever whenever the client is using too much resources. if the client still chooses to go through the hoops you implemented then I don't see any issue. If u still have an issue then implement more hoops until you're satisfied.
Well, I shouldn't have to work or make things worse for everybody because the LLM bros decided to screw us.
> It is inherently a cat and mouse game that you CHOOSE to play
No, let's not reverse the roles and blame the victims here. We sysadmins and authors are willing to share our work publicly to the world but never asked for it to be abused.
As the web server operator, you can try to figure out if there's a human behind the IP, and you might be right or wrong. You can try to figure out if it's a web browser, or if it's someone typing in curl from a command line, or if it's a massively parallel automated system, and you might be right or wrong. You can try to guess what country the IP is in, and you might be right or wrong. But if you really want to actually limit access to the content, you shouldn't be publishing that content publicly.
I disagree. If your mental model doesn't allow conceptualizing (abusive) scrapers, it is too simplicistic to be useful to understand and deal with reality.
But I'd like to re-state the frame / the concern: it's not about any bot or any scraper, it is about the despicable behavior of LLM providers and their awful scrappers.
I'm personally fine with bots accessing my web servers, there are many legitimate use cases for this.
> But if you really want to actually limit access to the content, you shouldn't be publishing that content publicly.
It is not about denying access to the content to some and allowing access to others.
It is about having to deal with abuses.
Is a world in which people stop sharing their work publicly because of these abuses desirable? Hell no.
solution is the same, implement better security
A vast hoard of personal information exists and most of it never had or will have proper consent, knowledge, or protection.
People who ignore polite requests are assholes, and we are well within our rights to complain about them.
I agree that "theft" is too strong (though I think you might be presenting a straw man there), but "abuse" can be perfectly apt: a crawler hammering a server, requesting the same pages over and over, absolutely is abuse.
> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.
That's a shitty world that we shouldn't have to live in.
I think there is a really different intent to an action to read something someone created (which is often a form of marketing) and to reproduce but modify someone's creative output (which competes against and starves the creative of income).
The world changed really quickly and our legal systems haven't kept up. It is hurting real people who used to have small side businesses.
We should start suing these bad actors. Why do techies forget that the legal system exists?
People here on HN are laughing at the UKs Online Safety Act for trying to impose restrictions on people in other countries, and yet now you're implying that similar restrictions can be placed on people in other countries and over whom you have neither power nor control.
However,
(1) there's a difference between (a) a regular user browsing my websites and (b) robots DDoSing them. It was never okay to hammer a webserver. This is not new, and it's for this reason that curl has had options to throttle repeated requests to servers forever. In real life, there are many instances of things being offered for free, it's usually not okay to take it all. Yes, this would be abuse. And no, the correct answer to such a situation would not be "but it was free, don't offer it for free if you don't want it to be taken for free". Same thing here.
(2) there's a difference between (a) a regular user reading my website or even copying and redistributing my content as long as the license of this work / the fair use or related laws are respected, and (b) a robot counterfeiting it (yeah, I agree with another commenter, theft is not the right word, let's call a spade a spade)
(3) well-behaved robots are expected to respect robots.txt. This is not the law, this is about being respectful. It is only fair bad-behaved robots get called out.
Well behaved robots do not usually use millions of residential IPs through shady apps to "Perform a get request to an open HTTP server".
In Germany, it is the law. § 44b UrhG says (translated):
(1) Text and data mining is the automated analysis of one or more digital or digitized works to obtain information, in particular about patterns, trends, and correlations.
(2) Reproductions of lawfully accessible works for text and data mining are permitted. These reproductions must be deleted when they are no longer needed for text and data mining.
(3) Uses pursuant to paragraph 2, sentence 1, are only permitted if the rights holder has not reserved these rights. A reservation of rights for works accessible online is only effective if it is in machine-readable form.
That's what zip bombs are for.
If you want to control how someone accesses something, the onus is on you to put access controls in place.
The people who put things on a public, un-restricted server and then complain that the public accessed it in an un-restricted way might be excusable if it's some geocities-esque Mom and Pop site that has no reason to know better, but 'cryptography dog' ain't that.
1) A coordinated effort among different sites will have a much greater chance of poisoning the data of a model so long as they can avoid any post scraping deduplication or filtering.
2) I wonder if copyright law can be used to amplify the cost of poisoning here. Perhaps if the poisoned content is something which has already been shown to be aggressively litigated against then the copyright owner will go after them when the model can be shown to contain that banned data. This may open up site owners to the legal risk of distributing this content though… not sure. A cooperative effort with a copyright holder may sidestep this risk but they would have to have the means and want to litigate.
I know a thing or two about web scraping.
There are sometimes status codes 404 for protection, so that you skip this site, so my crawler tries, as a hammer, several of faster crawling methods (curlcffi).
Zip bombs are also not for me. Reading header content length is enough to not read the page/file. I provide byte limit to check if response is not too big for me. For other cases reading timeout is enough.
Oh, and did you know that requests timeout is not really timeout a timeout for page read? So server can spoonfeed you bytes, one after another, and there will be no timeout.
That is why I created my own crawling system to mitigate these problems, and have one consistent mean of running selenium.
https://github.com/rumca-js/crawler-buddy
Based on library
rokkamokka•5h ago
embedding-shape•5h ago
franktankbank•3h ago
embedding-shape•3h ago
stevage•1h ago
jcheng•20m ago
But I think it's moot, parsing HTML is not very expensive if you don't have to actually render it.