frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Facebook's Fascination with My Robots.txt

https://blog.nytsoi.net/2026/02/23/facebook-robots-txt
62•Ndymium•3h ago

Comments

Ndymium•3h ago
For some reason, Facebook has been requesting my Forgejo instance's robots.txt in a loop for the past few days, currently at a speed of 7700 requests per hour. The resource usage is negligible, but I'm wondering why it's happening in the first place and how many other robot files they're also requesting repeatedly. Perhaps someone at Meta broke a loop condition.
antonyh•1h ago
As facebookexternalhit is listed in the robots.txt, it does look like it's optimistically rechecking in the hope it's no longer disallowed. That rate of request is obscene though, and falls firmly into the category of Bad Bot.
mghackerlady•54m ago
That is probably the dumbest yet most genius solution to getting your scraper blocked I've ever seen
matja•2h ago
Did you try adding a Cache-Control response header?
mrweasel•1h ago
Even if they haven't added any cache control headers, what kind a of lazy Meta engineer designed their crawler with to just pull the same URL multiple times a second?

Is this where all that hardware for AI projects is going? To data centers that just uncritically hits the same URL over and over without checking if the content of a site or page has chanced since the last visit then and calculate a proper retry interval. Search engine crawlers 25 - 30 years ago could do this.

Hit the URL once per day, if it chances daily, try twice a day. If it hasn't chanced in a week, maybe only retry twice per week.

bot403•1h ago
It's not the "same" crawler. Probably each thread or each cluster machine instance of the crawler hitting it independently.
mrweasel•1h ago
I sincerely doubt that search engines run their crawlers on a single machine and they got it figured out.
OliverGuy•1h ago
That's still the same crawler system though. And it's lazy engineering to not build in something to track when you last requested a url.

And it's quite a trivial feature at that.

Ndymium•1h ago
Forgejo does set "cache-control: private, max-age=21600", which is considerably more than one second, but I grant it uses the "private" keyword for no reason here.
xg15•1h ago
Facebook just decided that instead of loading the robots.txt for every host they intend to crawl, they'll just ignore all the other robots.txt files and then access this one a million times to restore the average.
Vinnl•21m ago
Ah yes, robots_georg.txt.
Nextgrid•1h ago
> Perhaps someone at their end screwed up a loop conditional, but you'd think some monitoring dashboard somewhere would have a warning pop up because of this.

If you've been in any big company you'll know things perpetually run in a degraded, somewhat broken mode. They've even made up the term "error budget" because they can't be bothered to fix the broken shit so now there's an acceptable level of brokenness.

goodmythical•42m ago
>they can't be bothered to fix the broken shit

Surely it's more likely that it's just cheaper to pay for the errors than to pay to fix the errors.

Why fix 10k worth of errors if it'll cost me 100k to fix it?

DanielHB•32m ago
The orgs are not ruthless like that, anything less than a certain % of the org revenue is not worth bothering unless it creates _more_ work to the person responsible for it than fixing it does.

Add some % if person who gets more work from the problem is not the same as the person who needs to fix it. People will happily leave things in a broken state if no one calls them out on it.

nazgulsenpai•9m ago
In my 3rd year of enterprise now and learned that there are many engineers who will purposefully not fix/improve their problematic applications as a weird sort of job security. It kind of blew up in their faces last year when we moved most of the affected on-premise applications to cloud. Seems like when you introduce tons of friction on-premise it makes the cloud look even better to the suits.
tananaev•1h ago
Maybe they’re trying to DDoS it, and once an error is returned, they assume that no robots.txt file exists and then crawl everything else on the site?
Ndymium•1h ago
While 7700 per hour sounds big, pretty much any dinky server can handle it. So I don't think it's a matter of DDoS. At this point it's just... odd behaviour.
mghackerlady•51m ago
especially for a txt file. I don't know anything really about webdev but I'm pretty sure serving up 7700 plaintext files with roughly 10 lines each an hour isn't that demanding
evv•1h ago
Have you considered serving a zip bomb to this user agent?
delecti•1h ago
I'm sure their crawler can handle a zip bomb. Plus it might interpret that as "this site doesn't have a robots.txt" and start scraping that OP is trying to prevent with their current robots.txt.
1e1a•1h ago
Could allow only the path to the zip bomb for this user agent.
FartyMcFarter•1h ago
That will work once at most and then quickly get fixed.
esseph•51m ago
Are you so sure? :)
xp84•1m ago
Yeah it seems like this team takes a really tough stance on obvious bugs
dormento•1h ago
Has anyone done research on the topic of trying to block these bots by claiming to host illegal material or talking about certain topics? I mean having a few entries in your robots like "/kill-president", "/illegal-music-downloads", "/casino-lucky-tiger-777" etc.
pousada•1h ago
Yea I can’t see how that could backfire in any way
DetroitThrow•18m ago
FB crawler is used for national security reasons at times. The first would probably make it more active.
mghackerlady•49m ago
>my extreme LibreOffice Calc skillz

How does one learn these skills, I can see them being useful in the future

petee•38m ago
Do crawlers follow/cache 301 permanent redirects? I wonder if you could point the firehouse back at facebook, but it would mean they wouldn't get your robots.txt anymore (though I'd just blackhole that whole subnet anyway)
lloydatkinson•33m ago
I recently started maintaining a MediaWiki instance for a niche hobbyist community and we'd been struggling with poor server performance. I didn't set the server up, so came into it assuming that the tiny amount of RAM the previous maintainer had given it was the problem.

Turns out all of the major AI slop companies had been hounding our wiki constantly for months, and this had resulted in Apache spawning hundreds of instances, bringing the whole machine to a halt.

Millions upon millions of requests, hundreds of GB's of bandwidth. Thankfully we're using Cloudflare so could block all of them except real search engine crawlers and now we don't have any problems at all. I also made sure to constrain Apache's limits a bit too.

From what I've read, forums, wikis, git repos are the primary targets of harassment by these companies for some reason. The worst part is these bots could just download a git repo or a wiki dump and do whatever it wants with it, but instead they are designed to push maximum load onto their victims.

Our wiki, in total, is a few gigabytes. They crawled it thousands of times over.

13pixels•13m ago
Facebook is honestly the least interesting crawler misbehaving right now. The real shift is GPTBot, ClaudeBot, PerplexityBot and a dozen other AI crawlers that don't even identify themselves half the time.

I've been monitoring server logs across ~150 sites and the pattern is striking: AI crawler traffic increased roughly 8x in the last 12 months, but most site owners have no idea because it doesn't show up in analytics. The bots read everything, respect robots.txt maybe 60% of the time, and the content they index directly shapes what ChatGPT or Perplexity recommends to users.

The irony is that robots.txt was designed for a world where crawling meant indexing for search results. Now crawling means training data and real-time retrieval for AI answers. Completely different power dynamic and most robots.txt files haven't adapted.

VladVladikoff•12m ago
My bet is this is a threading bug rather than just a broken loop. Somehow the threads are failing to communicate with each other, or some sort of race condition, so it keeps putting in the same task to the queue but missing the result. Something like that.

The Age Verification Trap, Verifying age undermines everyone's data protection

https://spectrum.ieee.org/age-verification
93•oldnetguy•55m ago•28 comments

Ladybird Browser adopts Rust

https://ladybird.org/posts/adopting-rust/
498•adius•3h ago•263 comments

The peculiar case of Japanese web design

https://sabrinas.space
35•montenegrohugo•49m ago•6 comments

Elsevier shuts down its finance journal citation cartel

https://www.chrisbrunet.com/p/elsevier-shuts-down-its-finance-journal
305•qsi•6h ago•56 comments

Sub-$200 Lidar could reshuffle auto sensor economics

https://spectrum.ieee.org/solid-state-lidar-microvision-adas
225•mhb•3d ago•303 comments

Magical Mushroom – Europe's first industrial-scale mycelium packaging producer

https://magicalmushroom.com/index
153•microflash•7h ago•55 comments

I built Timeframe, our family e-paper dashboard

https://hawksley.org/2026/02/17/timeframe.html
1314•saeedesmaili•20h ago•318 comments

0 A.D. Release 28: Boiorix

https://play0ad.com/new-release-0-a-d-release-28-boiorix/
217•jonbaer•3d ago•68 comments

femtolisp: A lightweight, robust, scheme-like Lisp implementation

https://github.com/JeffBezanson/femtolisp
27•tosh•2h ago•4 comments

Pipelined Relational Query Language, Pronounced "Prequel"

https://prql-lang.org/
34•dmit•2d ago•15 comments

Hacker News.love – 22 projects Hacker News didn't love

https://hackernews.love/
137•ohong•5h ago•87 comments

SETI@home: Data Acquisition and Front-End Processing (2025)

https://iopscience.iop.org/article/10.3847/1538-3881/ade5a7
38•tosh•5h ago•4 comments

Show HN: CIA World Factbook Archive (1990–2025), searchable and exportable

https://cia-factbook-archive.fly.dev/
406•MilkMp•18h ago•85 comments

Loops is a federated, open-source TikTok

https://joinloops.org/
494•Gooblebrai•20h ago•329 comments

Hetzner (European hosting provider) to increase prices by up to 38%

https://old.reddit.com/r/BuyFromEU/comments/1rce0lf/hetzner_european_hosting_provider_to_increase/
275•doener•3h ago•226 comments

What Is a Centipawn Advantage?

https://win-vector.com/2026/02/19/what-is-a-centipawn-advantage/
16•jmount•3d ago•2 comments

Facebook's Fascination with My Robots.txt

https://blog.nytsoi.net/2026/02/23/facebook-robots-txt
62•Ndymium•3h ago•32 comments

Show HN: AI Timeline – 171 LLMs from Transformer (2017) to GPT-5.3 (2026)

https://llm-timeline.com/
35•ai_bot•6h ago•26 comments

Pope tells priests to use their brains, not AI, to write homilies

https://www.ewtnnews.com/vatican/pope-leo-xiv-tells-priests-to-use-their-brains-not-ai-to-write-h...
362•josephcsible•7h ago•301 comments

Microspeak: Escrow

https://devblogs.microsoft.com/oldnewthing/20260217-00/?p=112067
4•ibobev•3d ago•1 comments

My journey to the microwave alternate timeline

https://www.lesswrong.com/posts/8m6AM5qtPMjgTkEeD/my-journey-to-the-microwave-alternate-timeline
298•jstanley•4d ago•127 comments

Bitmovin (YC S15) Is Hiring Interns in AI for Summer 2026 in Austria

https://bitmovin.com/careers/8023403002/
1•slederer•8h ago

VTT Test Donut Lab Battery Reaches 80% Charge in Under 10 Minutes [pdf]

https://pub-fee113bb711e441db5c353d2d31abbb3.r2.dev/VTT_CR_00092_26.pdf
62•sagyam•2h ago•57 comments

QRTape – Audio Playback from Paper Tape with Computer Vision (2021)

http://www.theresistornetwork.com/2021/03/qrtape-audio-playback-from-paper-tape.html
20•austinallegro•5h ago•10 comments

Don't host email yourself – your reminder in 2026

https://www.coinerella.com/dont-host-email-yourself-your-reminder-in-2026/
13•willy__•3h ago•11 comments

Crawling a billion web pages in just over 24 hours, in 2025

https://andrewkchan.dev/posts/crawler.html
83•pseudolus•11h ago•20 comments

Man accidentally gains control of 7k robot vacuums

https://www.popsci.com/technology/robot-vacuum-army/
353•Brajeshwar•1d ago•194 comments

Google restricting Google AI Pro/Ultra subscribers for using OpenClaw

https://discuss.ai.google.dev/t/account-restricted-without-warning-google-ai-ultra-oauth-via-open...
715•srigi•16h ago•595 comments

Six Math Essentials

https://terrytao.wordpress.com/2026/02/16/six-math-essentials/
265•digital55•19h ago•56 comments

The JavaScript Oxidation Compiler

https://oxc.rs/
223•modinfo•12h ago•109 comments