Our open-source system can block IP addresses based on rules triggered by specific behavior.
Can you elaborate on what exact type of crawlers you would like to block? Like, a leaky bucket of a certain number of requests per minute?
The simplest approach is to count the UA as risky or flag multiple 404 errors or HEAD requests, and block on that. Those are rules we already have out of the box.
It's open source, there's no pain in writing specific rules for rate limiting, thus my question.
Plus, we have developed a dashboard for manually choosing UA blocks based on name, but we're still not sure if this is something that would be really helpful for website operators.
Depends on the goal.
Author wants his instance not to get killed. Request rate limiting may achieve that easily in a way transparent to normal users.
Bad crawlers have been there since the very beginning. Some of them looking for known vulnerabilities, some scraping content for third-party services. Most of them have spoofed UAs to pretend to be legitimate bots.
This is approximately 30–50% of traffic on any website.
Or have a web-proxy that matches on the pattern and extracts the cookie automatically. ;-)
Thanks for this tip.
Hopefully it at least costs them a little bit more.
Then add some proof-of-work that works without JS and is a web standard (e.g. server sends header, client sends correct response, gets access to archive) and mainstream the culture for low-cost hosting for such archives and you're done, also make sure that this sort of feature is enabled in the most basic configuration for all web servers and such, logged separately.
Obviously such a thing will never happen, because the web and culture went in a different direction. But if it were a mainstream thing, you'd get easy to consume archives (also for regular archival and data hoarding) and the "live" versions of sites wouldn't have their logs be bogged down by stupid spam.
immibis•19h ago
It's easy to assume "I received a lot of requests, therefore the problem is too many requests" but you can successfully handle many requests.
This is a clever way of doing a minimally invasive botwall though - I like it.
bob1029•2h ago
There is a point where your web server becomes fast enough that the scraping problem becomes irrelevant. Especially at the scale of a self-hosted forge with a constrained audience. I find this to be a much easier path.
I wish we could find a way to not conflate the intellectual property concerns with the technological performance concerns. It seems like this is essential to keeping the AI scraping drama going in many ways. We can definitely make the self hosted git forge so fast that anything short of ~a federal crime would have no meaningful effect.
idontsee•1h ago
It isn't just the volume of requests, but also bandwidth. There have been cases where scraping represents >80% of a forge's bandwidth usage. I wouldn't want that to happen to the one I host at home.
spockz•1h ago
userbinator•1h ago
That sounds like a bug.