frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

How I protect my Forgejo instance from AI web crawlers

https://her.esy.fun/posts/0031-how-i-protect-my-forgejo-instance-from-ai-web-crawlers/index.html
31•todsacerdoti•19h ago

Comments

immibis•19h ago
My issue with Gitea (which Forgejo is a fork of) was that crawlers would hit the "download repository as zip" link over and over. Each access creates a new zip file on disk which is never cleaned up. I disabled that (by setting the temporary zip directory to read-only, so the feature won't work) and haven't had a problem since then.

It's easy to assume "I received a lot of requests, therefore the problem is too many requests" but you can successfully handle many requests.

This is a clever way of doing a minimally invasive botwall though - I like it.

bob1029•2h ago
> you can successfully handle many requests.

There is a point where your web server becomes fast enough that the scraping problem becomes irrelevant. Especially at the scale of a self-hosted forge with a constrained audience. I find this to be a much easier path.

I wish we could find a way to not conflate the intellectual property concerns with the technological performance concerns. It seems like this is essential to keeping the AI scraping drama going in many ways. We can definitely make the self hosted git forge so fast that anything short of ~a federal crime would have no meaningful effect.

idontsee•1h ago
> There is a point where your web server becomes fast enough that the scraping problem becomes irrelevant.

It isn't just the volume of requests, but also bandwidth. There have been cases where scraping represents >80% of a forge's bandwidth usage. I wouldn't want that to happen to the one I host at home.

spockz•1h ago
Maybe it is fast enough but my objection is mostly due to the gross inefficiency of crawlers. Requesting downloads of whole repositories over and over, leading to storing these archives on disk wasting CPU cycles to create them and storage space to retain them, and bandwidth to sent them over the wire. Add this to the gross power consumption of AI and hogging of physical compute hardware, and it is easy to see “AI” as wasteful.
userbinator•1h ago
Each access creates a new zip file on disk which is never cleaned up.

That sounds like a bug.

csilker•1h ago
Cloudflare has a solution to protect routes from crawlers.

https://blog.cloudflare.com/introducing-pay-per-crawl/

roywashere•24m ago
Sure, but the whole point of self-hosting forgejo is to not use these big cloud solutions. Introducing cloudflare is a step back!
reconnecting•1h ago
tirreno (1) guy here.

Our open-source system can block IP addresses based on rules triggered by specific behavior.

Can you elaborate on what exact type of crawlers you would like to block? Like, a leaky bucket of a certain number of requests per minute?

1. https://github.com/tirrenotechnologies/tirreno

notachatbot123•1h ago
The article is about AI web crawlers. How can your tool help and how would one set it up for this specific context?
reconnecting•1h ago
I don't see how an AI crawler is different from any others.

The simplest approach is to count the UA as risky or flag multiple 404 errors or HEAD requests, and block on that. Those are rules we already have out of the box.

It's open source, there's no pain in writing specific rules for rate limiting, thus my question.

Plus, we have developed a dashboard for manually choosing UA blocks based on name, but we're still not sure if this is something that would be really helpful for website operators.

Roark66•10m ago
>It's open source, there's no pain in writing specific rules for rate limiting, thus my question.

Depends on the goal.

Author wants his instance not to get killed. Request rate limiting may achieve that easily in a way transparent to normal users.

reconnecting•25m ago
I believe there is a slight misunderstanding regarding the role of 'AI crawlers'.

Bad crawlers have been there since the very beginning. Some of them looking for known vulnerabilities, some scraping content for third-party services. Most of them have spoofed UAs to pretend to be legitimate bots.

This is approximately 30–50% of traffic on any website.

userbinator•1h ago
Unfortunately this means, my website could only be seen if you enable javascript in your browser.

Or have a web-proxy that matches on the pattern and extracts the cookie automatically. ;-)

apples_oranges•1h ago
HTTP 412 would be better I guess..
maelito•1h ago
I'm having lots of connections every day from Singapor. It's now the main country... despite the whole website being French-only. AI crawlers, for sure.

Thanks for this tip.

arjie•47m ago
Amazonbot does this despite my efforts in robots.txt to help it out. I look at all the Singapore requests and they’re Amazonbot trying to get various variants of the Special:RecentChanges page. You’re wasting your time, Amazonbot. I’m trying to help you.
input_sh•29m ago
Fun fact: you don't get rid of them even when you put a captcha on all visitors from Singapore. I still a the spike in traffic that perfectly matches the spike in served captchas, but this time it's geographically distributed between places like Iraq, Bangladesh and Brazil.

Hopefully it at least costs them a little bit more.

andai•29m ago
Can someone help me understand where all this traffic is coming from? Are there thousands of companies all doing it simultaneously? How come even small sites get hammered constantly? At some point haven't you scraped the whole thing?
KronisLV•2m ago
We should just have some standard for crawlable archived versions of pages with no back end or DB interaction behind them etc., for example if there's a reverse proxy, whatever it outputs is archived and it wouldn't actually pass on any call in the archive version. Same for translating the output of any dynamic JS into fully static HTML.

Then add some proof-of-work that works without JS and is a web standard (e.g. server sends header, client sends correct response, gets access to archive) and mainstream the culture for low-cost hosting for such archives and you're done, also make sure that this sort of feature is enabled in the most basic configuration for all web servers and such, logged separately.

Obviously such a thing will never happen, because the web and culture went in a different direction. But if it were a mainstream thing, you'd get easy to consume archives (also for regular archival and data hoarding) and the "live" versions of sites wouldn't have their logs be bogged down by stupid spam.

The ancient monuments saluting the winter solstice

https://www.bbc.com/culture/article/20251219-the-ancient-monuments-saluting-the-winter-solstice
18•1659447091•1h ago•3 comments

Inverse Parentheses

https://kellett.im/a/inverse-parentheses
40•mighty-fine•2h ago•33 comments

A guide to local coding models

https://www.aiforswes.com/p/you-dont-need-to-spend-100mo-on-claude
433•mpweiher•13h ago•231 comments

Programming languages used for music

https://timthompson.com/plum/cgi/showlist.cgi?sort=name&concise=yes
40•ofalkaed•1d ago•9 comments

Well Being in Times of Algorithms

https://www.ssp.sh/blog/well-being-algorithms/
7•articsputnik•42m ago•0 comments

Deliberate Internet Shutdowns

https://www.schneier.com/blog/archives/2025/12/deliberate-internet-shutdowns.html
191•WaitWaitWha•3d ago•70 comments

Build Android apps using Rust and Iced

https://github.com/ibaryshnikov/android-iced-example
89•rekireki•8h ago•18 comments

I'm just having fun

https://jyn.dev/i-m-just-having-fun/
360•lemper•5d ago•145 comments

How I protect my Forgejo instance from AI web crawlers

https://her.esy.fun/posts/0031-how-i-protect-my-forgejo-instance-from-ai-web-crawlers/index.html
31•todsacerdoti•19h ago•19 comments

Show HN: Books mentioned on Hacker News in 2025

https://hackernews-readings-613604506318.us-west1.run.app
460•seinvak•18h ago•167 comments

Webb observes exoplanet that may have an exotic helium and carbon atmosphere

https://science.nasa.gov/missions/webb/nasas-webb-observes-exoplanet-whose-composition-defies-exp...
62•taubek•2d ago•13 comments

Disney Imagineering Debuts Next-Generation Robotic Character, Olaf

https://disneyparksblog.com/disney-experiences/robotic-olaf-marks-new-era-of-disney-innovation/
194•ChrisArchitect•12h ago•79 comments

Kernighan's Lever

https://linusakesson.net/programming/kernighans-lever/index.php
64•xk3•2d ago•21 comments

Engineering dogmas it's time to retire

https://newsletter.manager.dev/p/5-engineering-dogmas-its-time-to
17•flail•3d ago•17 comments

Aliasing

https://xania.org/202512/15-aliasing-in-general
37•ibobev•6d ago•6 comments

Functional Flocking Quadtree in ClojureScript

https://www.lbjgruppen.com/en/posts/flocking-quadtrees
44•lbj•6d ago•3 comments

CO2 batteries that store grid energy take off globally

https://spectrum.ieee.org/co2-battery-energy-storage
247•rbanffy•19h ago•206 comments

Making the most of bit arrays in Gleam

https://gearsco.de/blog/bit-array-syntax/
26•crowdhailer•3d ago•1 comments

ONNX Runtime and CoreML May Silently Convert Your Model to FP16

https://ym2132.github.io/ONNX_MLProgram_NN_exploration
69•Two_hands•10h ago•15 comments

Lightning: Real-time editing for tiled map data

https://felt.com/blog/lightning-tiles
10•hinting•5d ago•2 comments

More on whether useful quantum computing is “imminent”

https://scottaaronson.blog/?p=9425
89•A_D_E_P_T•13h ago•72 comments

Rue: Higher level than Rust, lower level than Go

https://rue-lang.dev/
147•ingve•13h ago•108 comments

Show HN: Rust/WASM lighting data toolkit – parses legacy formats, generates SVGs

https://eulumdat.icu
28•holg•13h ago•0 comments

Show HN: WalletWallet – create Apple passes from anything

https://walletwallet.alen.ro/
384•alentodorov•18h ago•103 comments

I program on the subway

https://www.scd31.com/posts/programming-on-the-subway
222•evankhoury•5d ago•153 comments

Single-Pass Huffman Coding

https://doisinkidney.com/posts/2018-02-17-single-pass-huffman.html
21•todsacerdoti•6d ago•1 comments

QBasic64 Phoenix 4.3.0 Released

https://qb64phoenix.com/forum/showthread.php?tid=4244
30•jandeboevrie•3h ago•5 comments

Cursed circuits #3: true mathematics

https://lcamtuf.substack.com/p/cursed-circuits-3-true-mathematics
21•zdw•6h ago•3 comments

The Going Dark initiative or ProtectEU is a Chat Control 3.0 attempt

https://mastodon.online/@mullvadnet/115742530333573065
568•janandonly•16h ago•206 comments

Evaluating chain-of-thought monitorability

https://openai.com/index/evaluating-chain-of-thought-monitorability/
54•mfiguiere•3d ago•18 comments