frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Nvidia's new consumer desktop PC chip reportedly delayed well into 2026

https://www.tomshardware.com/pc-components/cpus/nvidias-new-consumer-desktop-pc-chip-reportedly-delayed-well-into-2026
1•01-_-•19s ago•0 comments

Jack Dorsey invests 10M dollars in a non-profit organization

https://techcrunch.com/2025/07/16/jack-dorsey-backs-nonprofit-and-other-stuff-experimenting-with-open-social-tools-like-nostr/
1•01-_-•3m ago•0 comments

Universal Tool Calling Protocol (UTCP)

https://www.utcp.io/
1•Keats•3m ago•0 comments

DemandCast – open-source Global hourly electricity demand forecasting

https://open-energy-transition.github.io/demandcast/
2•protontypes•10m ago•0 comments

NINA: Rebuilding the original AIM, AOL Desktop, Yahoo and ICQ platforms

https://nina.chat/
2•ecliptik•11m ago•0 comments

Show HN: Quick Info on Cursor – Hoverless Tooltips for Keyboard-Only Coding

https://marketplace.visualstudio.com/items?itemName=developerjhp.quick-info-on-cursor
2•developerjhp•12m ago•0 comments

"Bypassing" Specialization in Rust or How I Learned to Stop Worrying and Love F

https://oakchris1955.eu/posts/bypassing_specialization/
1•todsacerdoti•12m ago•0 comments

Terence Tao: Game theory, politics and control of information

https://mathstodon.xyz/@tao/114866543948652941
1•bertman•16m ago•0 comments

Show HN:Convert text/image to 3D model in seconds,No Login Required

https://fast3d.io
2•blacktechnology•16m ago•0 comments

Tomorrowland Is Burning

https://www.cnn.com/2025/07/16/europe/tomorrowland-festival-belgium-fire-intl-latam
1•herrmaier•19m ago•0 comments

TikTok, AliExpress and WeChat ignore your GDPR rights

https://noyb.eu/en/how-tiktok-aliexpress-wechat-ignore-your-gdpr-rights
5•robin_reala•22m ago•1 comments

Babies born with DNA from three people to prevent illness with no cure

https://news.sky.com/story/babies-born-in-uk-with-dna-from-three-people-to-treat-inherited-disease-takes-medicine-into-uncharted-territory-13397706
2•Brajeshwar•24m ago•0 comments

AI is helping students be more independent, but isolation could be career poison

https://themarkup.org/artificial-intelligence/2025/07/16/ai-is-helping-students-be-more-independent-but-the-isolation-could-be-career-poison
3•billybuckwheat•27m ago•1 comments

Missouri Harasses AI Companies over Chatbots Dissing Glorious Leader Trump

https://reason.com/2025/07/14/missouri-harasses-ai-companies-over-chatbots-dissing-glorious-leader-trump/
1•saubeidl•28m ago•0 comments

The Patterns of Elites Who Conceal Their Assets Offshore

https://home.dartmouth.edu/news/2025/07/patterns-elites-who-conceal-their-assets-offshore
1•thunderbong•31m ago•0 comments

AI whiplash, and Neovim in the age of AI

https://dlants.me/ai-whiplash.html
1•anonymid•39m ago•0 comments

Ask HN: Looking for Unreal Engine 5 Developers for Dark Fantasy Game Concept

2•hejhdiss•40m ago•0 comments

Ask HN: How do you automate recurring workflows without writing glue code?

1•kimzhang•42m ago•1 comments

How to win AI visibility: A survival guide for content writers in the LLMs age

https://lauradecastro.substack.com/p/how-to-win-ai-visibility-a-survival
1•larub_•42m ago•0 comments

Speeding up compilation with `hint-mostly-unused`

https://blog.rust-lang.org/inside-rust/2025/07/15/call-for-testing-hint-mostly-unused/
1•ingve•42m ago•0 comments

Gradual negation types and the Python type system

https://jellezijlstra.github.io/negation-types.html
2•GalaxySnail•42m ago•0 comments

Arcs Is 2024's Best New Board Game

https://www.youtube.com/watch?v=iP36OXiPkoo
1•doener•43m ago•0 comments

Afghans relocated to UK under secret scheme after data leak

https://www.theguardian.com/uk-news/2025/jul/15/thousands-relocated-data-leak-afghans-who-helped-british-forces
1•chha•45m ago•0 comments

How A.I. really works [a "documentary" made with VEO3]

https://www.youtube.com/watch?v=ld1m0y47uHs
1•tkgally•45m ago•0 comments

Encryption and checking hashes slows faster SSDs

https://eclecticlight.co/2025/07/17/encryption-and-checking-hashes-slows-faster-ssds/
1•ingve•46m ago•0 comments

Asymmetry of Verification and Verifier's Law

https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law
1•kiyanwang•46m ago•0 comments

Be a 10x Engineer – Break the Promotion Timeline

https://medium.com/@lodestar97/be-a-10x-engineer-break-the-promotion-timeline-1ef59c21cc6e
2•vettyvignesh•48m ago•0 comments

Delta moves toward eliminating set prices in favor of AI

https://fortune.com/2025/07/16/delta-moves-toward-eliminating-set-prices-in-favor-of-ai-that-determines-how-much-you-personally-will-pay-for-a-ticket/
2•peroo•51m ago•0 comments

Treating beef like coal would make a big dent in greenhouse-gas emissions

https://www.economist.com/graphic-detail/2021/10/02/treating-beef-like-coal-would-make-a-big-dent-in-greenhouse-gas-emissions
3•doener•55m ago•0 comments

Why Canada needs to build a public cloud

https://www.disconnect.blog/p/why-canada-needs-to-build-a-public
4•BallsInIt•57m ago•0 comments
Open in hackernews

I was wrong about robots.txt

https://evgeniipendragon.com/posts/i-was-wrong-about-robots-txt/
74•EPendragon•7h ago

Comments

orionblastar•5h ago
This is a mistake that many websites make, trying to block all robots, and the robots that serve their blog posts to users can't function anymore.
xnx•5h ago
Agree. If you don't want it out there, put it in your journal or require a login.
reaperducer•5h ago
Not every web site is a blog. Not every web site can be legally put behind a login.
dylan604•4h ago
What kind of information legally cannot be put behind a login?
therein•3h ago
Maybe he is talking about stuff you're required by law to disclose but you don't really want to be seen too much. Like code of conduct, terms of service, retractions or public apologies.
PeterStuer•2h ago
Worst offenders I come across: official government information that needs to be public, placed behind Cloudflare, preventing even their M2M feeds (RSS, Atom, ...) to be accessed
bryanhogan•3h ago
Yes, there's often not much reason to block bots that abide by the rules. It just makes your site not show up on other search indexes and introduces problems for users. Malicious bots won't care about your robots.txt anyway.
happymellon•2h ago
Most bots don't serve the blog post to users.
PaulKeeble•5h ago
The problem is the robots that do follow robots.txt its all the bots that don't. Robots.txt is largely irrelevant now they don't represent most of the traffic problem. They certainly don't represent the bots that are going to hammer your site without any regard, those bots don't follow robots.txt.
anonnon•4h ago
Not sure why you were downvoted. I have zero confidence that OpenAI, Anthropic, and the rest respect robots.txt however much they insist they do. It's also clear that they're laundering their traffic through residential ISP IP addresses to make detection harder. There are plenty of third-parties advertising the service, and farming it out affords the AI companies some degree of plausible deniability.
zarzavat•2h ago
That's what honeypots are for.

Deny /honeypot in your robots.txt

Add <a href="/honeypot" style="display:none" aria-hidden="true">ban me</a> to your index.html

If an IP accesses that path, ban it.

kwar13•4m ago
I like this. Adding now. Thanks!
s-mon•5h ago
Having worked on bot detection in the past. Some really simple old fashioned attacks happened by doing the opposite of what the robots.txt file says.

While I doubt it does much today, that file really only matters to those that want to play by the rules which on the free web is not an awful lot of the web anymore I’m afraid.

dumbfounder•5h ago
I created a search engine that crawled the web way back in 2003. I used a proper user agent that included my email address. I got SO many angry emails about my crawler, which played as nice as I was able to make it play. Which was pretty nice I believe. If it’s not Google people didn’t want it. That’s a good way to prevent anyone from ever competing with Google. It isn’t just about that preview for LinkedIn, it’s about making sure the web is accessible by everyone and everything that is trying to make its way. Sure, block the malicious ones. But don’t just assume that every bot is malicious by default.
tomrod•4h ago
> But don’t just assume that every bot is malicious by default.

I'll bite. It seems like a poor strategy to trust by default.

ronsor•4h ago
I'll bite harder. That's how the public Internet works. If you don't trust clients at all, serve them a login page instead of content.
__loam•4h ago
It sucks that we're living in a landscape where bad actors take advantage of that way of doing things.
sltkr•3h ago
The really bad actors are going to ignore robots.txt entirely. You might as well be nice to the crawlers that respect robots.txt.
PeterStuer•2h ago
Even if you want to play nice, robots.txt is a catch-22, as accessing it is taken as a signal you are a 'bot' by malconfigured anti-bot 'solutions'.
chasebank•3h ago
Bad actors will always exploit whatever systems are available to them. Always have, always will.
KTibow•3h ago
It sucks more that Cloudflare/similar have responded to this with "if your handshake fingerprints more like curl than like Chrome/Firefox, no access for you".
edoceo•3h ago
Or getting a CAPTCHA from Chrome when visiting a site you've been to dozens of times (Stack Overflow). Now I just skip that content, probably in my LLM already anyway.
realusername•2h ago
It's the same thing as the anti pirate ads, you only annoy legit customers, this agressive captcha campaign just makes Stackoverflow drop down even faster than it would normally by making it lower quality.
codingminds•1h ago
Keep in mind that those LLMs are one of the bigger reasons why we see more and more anti bot behaviour on sites like SO.

That aggressive crawling to train those on everything is insane.

tickettotranai•3h ago
In fairness this appears to be the direction we are headed anyway
TylerE•3h ago
That's easy to say when it's your bot, but I've been on the other side to know that the problem isn't your bot, it's the 9000 other ones just like it, none of which will deliver traffic anywhere close to the resources consumed by scraping.
kijin•2h ago
True. Major search engines and bots from social networks have a clear value proposition: in exchange for consuming my resources, they help drive human traffic to my site. GPTBot et al. will probably do the same, as more people use AI to replace search.

A random scraper, on the other hand, just racks up my AWS bill and contributes nothing in return. You'd have to be very, very convincing in your bot description (yes, I do check out the link in the user-agent string to see what the bot claims to be for) in order to justify using other people's resources on a large scale and not giving anything back.

An open web that is accessible to all sounds great, but that ideal only holds between consenting adults. Not parasites.

NackerHughes•1h ago
> GPTBot et al. will probably do the same, as more people use AI to replace search.

It really won’t. It will steal your website’s content and regurgitate it back out in a mangled form to any lazy prompt that gets prodded into it. GPT bots are a perfect example of the parasites you speak of that have destroyed any possibility of an open web.

komali2•2m ago
I'm confused why scraping is so resource intensive - it hits every URL your site serves? For an individual ecommerce site that's maybe 10,000 hits?
Jach•2h ago
I guess back in 2003 people would expect an email to actually go somewhere, these days I would expect it to either go nowhere or just be part of a campaign to collect server admin emails for marketing/phishing purposes. Angry emails are always a bit much, but I wonder if they aren't sent as much anymore in general or if people just stopped posting them to point and laugh at and wonder what goes through people's minds to get so upset to send such emails.

My somewhat silly take on seeing a bunch of information like emails in a user agent string is that I don't want to know about your stupid bot. Just crawl my site with a normal user agent and if there's a problem I'll block you based on that problem. It's usually not a permanent block, and it's also usually setup with something like fail2ban so it's not usually an instant request drop. If you want to identify yourself as a bot, fine, but take a hint from googlebot and keep the user agent short with just your identifier and an optional short URL. Lots of bots respect this convention.

But I'm just now reminded of some "Palo Alto Networks" company that started dumping their garbage junk in my logs, they have the audacity to include messages in the user agent like "If you would like to be excluded from our scans, please send IP addresses/domains to: scaninfo@paloaltonetworks.com" or "find out more about our scans in [link]". I put a rule in fail2ban to see if they'd take a hint (how about your dumb bot detects that it's blocked and stops/slows on its own accord?) but I forgot about it until now, seems they're still active. We'll see if they stop after being served nothing but zipbombs for a while before I just drop every request with that UA. It's not that I mind the scans, I'd just prefer to not even know they exist.

mytailorisrich•17m ago
It's just that people are suspicious of unknown crawlers, and rightly so.

Since it is impossible to know a priori which crawler are malicious, and many are malicious, it is reasonable to default to considering anything unknown malicious.

Falkon1313•4h ago
This is kinda amusing.

robots.txt main purpose back in the day was curtailing penalties in the search engines when you got stuck maintaining a badly-built dynamic site that had tons of dynamic links and effectively got penalized for duplicate content. It was basically a way of saying "Hey search engines, these are the canonical URLs, ignore all the other ones with query parameters or whatever that give almost the same result."

It could also help keep 'nice' crawlers from getting stuck crawling an infinite number of pages on those sites.

Of course it never did anything for the 'bad' crawlers that would hammer your site! (And there were a lot of them, even back then.) That's what IP bans and such were for. You certainly wouldn't base it on something like User-Agent, which the user agent itself controlled! And you wouldn't expect the bad bots to play nicely just because you asked them.

That's about as naive as the Do-Not-Track header, which was basically kindly asking companies whose entire business is tracking people to just not do that thing that they got paid for.

Or the Evil Bit proposal, to suggest that malware should identify itself in the headers. "The Request for Comments recommended that the last remaining unused bit, the "Reserved Bit" in the IPv4 packet header, be used to indicate whether a packet had been sent with malicious intent, thus making computer security engineering an easy problem – simply ignore any messages with the evil bit set and trust the rest."

pi_22by7•3h ago
So it did the same work that a sitemap does? Interesting.

Or maybe more like the opposite: robots.txt told bots what not to touch, while sitemaps point them to what should be indexed. I didn’t realize its original purpose was to manage duplicate content penalties though. That adds a lot of historical context to how we think about SEO controls today.

JimDabell•2h ago
> I didn’t realize its original purpose was to manage duplicate content penalties though.

That wasn’t its original purpose. It’s true that you didn’t want crawlers to read duplicate content, but it wasn’t because search engines penalised you for it – WWW search engines had only just been invented and they didn’t penalise duplicate content. It was mostly about stopping crawlers from unnecessarily consuming server resources. This is what the RFC from 1994 says:

> In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

— https://www.robotstxt.org/orig.html

Quarrel•36m ago
> It was mostly about stopping crawlers from unnecessarily consuming server resources.

Very much so.

Computation was still expensive, and http servers were bad at running cgi scripts (particularly compared to the streamlined amazing things they can be today).

SEO considerations came way way later.

They were also used, and still are, by sites that have good reasons to not want results in search engines. Lots of court files and transcripts, for instance, are hidden behind robots.txt.

MiddleMan5•3h ago
It should be noted here that the Evil Bit proposal was an April Fools RFC https://datatracker.ietf.org/doc/html/rfc3514
tbrownaw•2h ago
> And you wouldn't expect the bad bots to play nicely just because you asked them.

Well, yes, the point is to tell the bots what you've decided to consider "bad" and will ban them for. So that they can avoid doing that.

Which of course only works to the degree that they're basically honest about who they are or at least incompetent at disguising themselves.

gbalduzzi•1h ago
I think it depends on the definition of bad.

I always consider "good" a bot that doesn't disguise itself and follows the robots.txt rules. I may not consider good the final intent of the bot or the company behind it, but the crawler behaviour is fundamentally good.

Especially considering the fact that it is super easy to disguise a crawler and not follow the robots conventions

atoav•40m ago
Well you as the person running a website can define unilaterally what you consider good and bad. You may want bots to crawl everything, nothing or (most likely) something inbetween. Then you judge bots based on those guidelines. You know like a solicitor that rings your bell that has a text above it saying "No solicitors", certain assumptions can be made about those who ignore it.
ceautery•3h ago
LinkedIn is by far the worst offender in post previews. The doctype tag must be all lowercase. The HTML document must be well-formed (the meta tags must be in an explicit <head> block, for example). You must have OG meta tags for url, title, type, and image. The url meta tag gets visited, even if it's the same address the inspector is already looking at.

Fortunately, the post inspector helps you suss out what's missing in some cases, but c'mon, man, how much effort should I spend helping a social media site figure out how to render a preview? Once you get it right, and to quote my 13 year old: "We have arrived, father... but at what cost?"

qwerty2000•3h ago
Dude can't even grammar.
babuloseo•3h ago
isnt LinkedIn dead.
kookamamie•2h ago
One can dream.
nxpnsv•2h ago
It can still be more dead
PeterStuer•2h ago
I feel it is morphing into Twitter/Facebook/Instagram more each day.

It used to be this ultrafake eternal job interview site, but people now seem uninhibited to go on wild political rants even there.

yodon•3h ago
What astounds me is there are no readily available libraries crawler authors can reach for to parse robots.txt and meta robots tags, to decide what is allowed, and to work through the arcane and poorly documented priorities between the two robots lists, including what to do when they disagree, which they often do.

Yes, there's an ancient google reference parser in C++11 (which is undoubtedly handy for that one guy who is writing crawlers in C++), but not a lot for the much more prevalent Python and JavaScript crawler writers who just want to check if a path is ok or not.

Even if bot writers WANT to be good, it's much harder than it should be, particularly when lots of the robots info isn't even in the robots.txt files, it's in the index.html meta tags.

JimDabell•1h ago
robots.txt support is built into the Python stdlib as urllib.robotparser: https://docs.python.org/3/library/urllib.robotparser.html

rel=nofollow is a bad name. It doesn’t actually forbid following the link and doesn’t serve the same purpose as robots.txt.

The problem it was trying to solve was that spammers would add links to their site anywhere that they could, and this would be treated by Google as the page the links were on endorsing the page they linked to as relevant content. rel=nofollow basically means “we do not endorse this link”. The specification makes this more clear:

> By adding rel="nofollow" to a hyperlink, a page indicates that the destination of that hyperlink should not be afforded any additional weight or ranking by user agents which perform link analysis upon web pages (e.g. search engines).

> nofollow is a bad name […] does not mean the same as robots exclusion standards

— https://microformats.org/wiki/rel-nofollow

yodon•1h ago
Thanks for this!
codingminds•1h ago
I don't see a reason why a good bot operator couldn't build a parser lib in a different language and put it on a public repo.

Shouldn't be that hard if someone WANT to be good.

elric•51m ago
Sure, but it's always easier to use a tool that's been tried and tested.
donohoe•3h ago
I try to stay away from negative takes here, so I’ll keep this as constructive as I can:

It’s surprising to see the author frame what seems like a basic consequence of their actions as some kind of profound realization. I get that personal growth stories can be valuable, but this one reads more like a confession of obliviousness than a reflection with insight.

And then they posted it here for attention.

zem•2h ago
it's mostly that they didn't think of the page preview fetcher as a "crawler", and did not intend for their robots.txt to block it. it may not be profound but it's at the least not a completely trivial realisation. and heck, an actual human-written blog post can okay improve the average quality of the web.
spookie•2h ago
They posted it here because they wouldn't appear on Google otherwise (:
kookamamie•2h ago
You shouldn't worry about LinkedIn, the cancer of the internet.
acosmism•2h ago
if you are hosting a house party that invites the entire world robots.txt is a neon sign to guide guests to where the beers are, who's cooking what kind of burgers and on what grill; rules of the house etc - you'll still have to secure your gold chains and laptop in a safe somewhere or decide to even keep them in the same house yourself
dwaite•1h ago
This doesn't seem like a new discovery at all - this is what news publications have been dealing with ever since they went online.

You aren't going to get advertising without also providing value - be that money or information. Google has over 2 trillion in capitalization based primarily on the idea of charging people to get additional exposure, beyond what the information on their site otherwise would get.

NackerHughes•1h ago
This article could have been two lines. It takes some serious stretching of school-essay-writing muscles to inflate it to this many pages of waffle.
jarofgreen•51m ago
Hey OP,

1)

You consider this about the Linkedin site but don't stop to think about other social networks. This is true about basically all of them. You may not post on Facebook, Bluesky, etc, but other people may like your links and post them there.

I recently ran into this as it turns out the Facebook entries in https://github.com/ai-robots-txt/ai.robots.txt also block the crawler FB uses for link previews.

2)

From your first post,

> But what about being closer to the top of the Google search results - you might ask? One, search engines crawling websites directly is only one variable in getting a higher search engine ranking. References from other websites will also factor into that.

Kinda .... it's technically true that you can rank in Google if you block them in robots.txt but it's going to take a lot more work. Also your listing will look worse (last time I saw this there was no site description, but that was a few years back). If you care about Google SEO traffic you maybe want to let them on your site.