Wikipedia has backups for this reason. AI companies ignore the readily available backups and instead crawl every page hundreds of times a day.
I think debian also recently spoke up about it.
Deny /honeypot in your robots.txt
Add <a href="/honeypot" style="display:none" aria-hidden="true">ban me</a> to your index.html
If an IP accesses that path, ban it.
Unrelated meta question but is the aria tag necessarily since display: none; should be removing the content from the flow?
While I doubt it does much today, that file really only matters to those that want to play by the rules which on the free web is not an awful lot of the web anymore I’m afraid.
I'll bite. It seems like a poor strategy to trust by default.
That aggressive crawling to train those on everything is insane.
A random scraper, on the other hand, just racks up my AWS bill and contributes nothing in return. You'd have to be very, very convincing in your bot description (yes, I do check out the link in the user-agent string to see what the bot claims to be for) in order to justify using other people's resources on a large scale and not giving anything back.
An open web that is accessible to all sounds great, but that ideal only holds between consenting adults. Not parasites.
It really won’t. It will steal your website’s content and regurgitate it back out in a mangled form to any lazy prompt that gets prodded into it. GPT bots are a perfect example of the parasites you speak of that have destroyed any possibility of an open web.
My somewhat silly take on seeing a bunch of information like emails in a user agent string is that I don't want to know about your stupid bot. Just crawl my site with a normal user agent and if there's a problem I'll block you based on that problem. It's usually not a permanent block, and it's also usually setup with something like fail2ban so it's not usually an instant request drop. If you want to identify yourself as a bot, fine, but take a hint from googlebot and keep the user agent short with just your identifier and an optional short URL. Lots of bots respect this convention.
But I'm just now reminded of some "Palo Alto Networks" company that started dumping their garbage junk in my logs, they have the audacity to include messages in the user agent like "If you would like to be excluded from our scans, please send IP addresses/domains to: scaninfo@paloaltonetworks.com" or "find out more about our scans in [link]". I put a rule in fail2ban to see if they'd take a hint (how about your dumb bot detects that it's blocked and stops/slows on its own accord?) but I forgot about it until now, seems they're still active. We'll see if they stop after being served nothing but zipbombs for a while before I just drop every request with that UA. It's not that I mind the scans, I'd just prefer to not even know they exist.
Since it is impossible to know a priori which crawler are malicious, and many are malicious, it is reasonable to default to considering anything unknown malicious.
robots.txt main purpose back in the day was curtailing penalties in the search engines when you got stuck maintaining a badly-built dynamic site that had tons of dynamic links and effectively got penalized for duplicate content. It was basically a way of saying "Hey search engines, these are the canonical URLs, ignore all the other ones with query parameters or whatever that give almost the same result."
It could also help keep 'nice' crawlers from getting stuck crawling an infinite number of pages on those sites.
Of course it never did anything for the 'bad' crawlers that would hammer your site! (And there were a lot of them, even back then.) That's what IP bans and such were for. You certainly wouldn't base it on something like User-Agent, which the user agent itself controlled! And you wouldn't expect the bad bots to play nicely just because you asked them.
That's about as naive as the Do-Not-Track header, which was basically kindly asking companies whose entire business is tracking people to just not do that thing that they got paid for.
Or the Evil Bit proposal, to suggest that malware should identify itself in the headers. "The Request for Comments recommended that the last remaining unused bit, the "Reserved Bit" in the IPv4 packet header, be used to indicate whether a packet had been sent with malicious intent, thus making computer security engineering an easy problem – simply ignore any messages with the evil bit set and trust the rest."
Or maybe more like the opposite: robots.txt told bots what not to touch, while sitemaps point them to what should be indexed. I didn’t realize its original purpose was to manage duplicate content penalties though. That adds a lot of historical context to how we think about SEO controls today.
That wasn’t its original purpose. It’s true that you didn’t want crawlers to read duplicate content, but it wasn’t because search engines penalised you for it – WWW search engines had only just been invented and they didn’t penalise duplicate content. It was mostly about stopping crawlers from unnecessarily consuming server resources. This is what the RFC from 1994 says:
> In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).
Very much so.
Computation was still expensive, and http servers were bad at running cgi scripts (particularly compared to the streamlined amazing things they can be today).
SEO considerations came way way later.
They were also used, and still are, by sites that have good reasons to not want results in search engines. Lots of court files and transcripts, for instance, are hidden behind robots.txt.
Well, yes, the point is to tell the bots what you've decided to consider "bad" and will ban them for. So that they can avoid doing that.
Which of course only works to the degree that they're basically honest about who they are or at least incompetent at disguising themselves.
I always consider "good" a bot that doesn't disguise itself and follows the robots.txt rules. I may not consider good the final intent of the bot or the company behind it, but the crawler behaviour is fundamentally good.
Especially considering the fact that it is super easy to disguise a crawler and not follow the robots conventions
I admit I'm one of those people. After decades where I should perhaps be a bit more cynical, from time to time I am still shocked or saddened when I see people do things that benefit themselves over others.
But I kinda like having this attitude and expectation. Makes me feel healthier.
Trust by default, also by default, never ignoring suspicious signals.
Trust is not being naïve, I find the confusion of both very worrying.
Actually Veritasium has a great video about this. It's proven as the most effective strategy in monte carlo simulation.
EDIT: This one: https://youtu.be/mScpHTIi-kM
There are varying degrees of this through our lives, where the trust lies not in the fact that people will just follow the rules because they are rules, but because the rules set expectations, allowing everyone to (more or less) know what's going on and decide accordingly. This also makes it easier to single out the people who do not think the rules apply to them so we can avoid trusting them (and, probably, avoid them in general).
Naturally I am talking about cultures where that decision has not been taken away from their citizens.
But yes, I do the same. I just do not come here to pretend this is virtue.
It should be a matter of judgement and not following rules just because.
It's usually a bad default to assume incompetence on the part of others, especially when many experienced and knowledgeable people have to be involved to make a thing happen.
The idea behind the DNT header was to back it up with legislation-- and sure you can't catch and prosecute all tracking, but there are limitations on the scale of criminal move fast and break things before someone rats you out. :P
The GDPR requires consent for tracking. DNT is a very clear "I do not consent" statement. It's a very widely known standard in the industry. It would therefore make sense that a court would eventually find companies not respecting it are in breach of the GDPR.
That was a theory at least...
Fortunately, the post inspector helps you suss out what's missing in some cases, but c'mon, man, how much effort should I spend helping a social media site figure out how to render a preview? Once you get it right, and to quote my 13 year old: "We have arrived, father... but at what cost?"
It used to be this ultrafake eternal job interview site, but people now seem uninhibited to go on wild political rants even there.
Yes, there's an ancient google reference parser in C++11 (which is undoubtedly handy for that one guy who is writing crawlers in C++), but not a lot for the much more prevalent Python and JavaScript crawler writers who just want to check if a path is ok or not.
Even if bot writers WANT to be good, it's much harder than it should be, particularly when lots of the robots info isn't even in the robots.txt files, it's in the index.html meta tags.
rel=nofollow is a bad name. It doesn’t actually forbid following the link and doesn’t serve the same purpose as robots.txt.
The problem it was trying to solve was that spammers would add links to their site anywhere that they could, and this would be treated by Google as the page the links were on endorsing the page they linked to as relevant content. rel=nofollow basically means “we do not endorse this link”. The specification makes this more clear:
> By adding rel="nofollow" to a hyperlink, a page indicates that the destination of that hyperlink should not be afforded any additional weight or ranking by user agents which perform link analysis upon web pages (e.g. search engines).
> nofollow is a bad name […] does not mean the same as robots exclusion standards
Shouldn't be that hard if someone WANT to be good.
It’s surprising to see the author frame what seems like a basic consequence of their actions as some kind of profound realization. I get that personal growth stories can be valuable, but this one reads more like a confession of obliviousness than a reflection with insight.
And then they posted it here for attention.
Objectively, "I give you one (1) URL and you traverse the link to it so you can get some metadata" still counts as crawling, but I think that's not how most people conceptualize the term.
It'd be like telling someone "I spent part of the last year travelling." and when they ask you where you went, you tell them you commuted to-and-fro your workplace five times a week. That's technically travelling, although the other person would naturally expect you to talk about a vacation or a work trip or something to that effect.
You block things -> of course good actors will respect and avoid you -> of course bad actors will just ignore it as it's a piece of "please do not do this" not a firewall blocking things.
You aren't going to get advertising without also providing value - be that money or information. Google has over 2 trillion in capitalization based primarily on the idea of charging people to get additional exposure, beyond what the information on their site otherwise would get.
1)
You consider this about the Linkedin site but don't stop to think about other social networks. This is true about basically all of them. You may not post on Facebook, Bluesky, etc, but other people may like your links and post them there.
I recently ran into this as it turns out the Facebook entries in https://github.com/ai-robots-txt/ai.robots.txt also block the crawler FB uses for link previews.
2)
From your first post,
> But what about being closer to the top of the Google search results - you might ask? One, search engines crawling websites directly is only one variable in getting a higher search engine ranking. References from other websites will also factor into that.
Kinda .... it's technically true that you can rank in Google if you block them in robots.txt but it's going to take a lot more work. Also your listing will look worse (last time I saw this there was no site description, but that was a few years back). If you care about Google SEO traffic you maybe want to let them on your site.
If you don't want people to crawl your content, don't put it online.
There are so many consequences of disallowing robots -- what about the Internet Archive for example?
The author had blocked all bots because they wanted to get rid of AI scrapers. Then they wanted to unblock bots scraping for OpenGraph embeds so they unblocked...LinkedIn specifically. What if I post a link to their post on Twitter or any of the many Mastodon instances? Now they'd have to manually unblock all of their UA, which they obviously won't, so this creates an even bigger power advantage to the big companies.
What we need is an ability to block "AI training" but allow "search indexing, opengraph, archival".
And of course, we'd need a legal framework to actually enforce this, but that's an entirely different can of worms.
orionblastar•7h ago
xnx•7h ago
reaperducer•7h ago
dylan604•6h ago
therein•6h ago
PeterStuer•4h ago
bryanhogan•5h ago
happymellon•4h ago