Indie blog businesses are great for the health of the human internet, and I don't think surrendering preemptively will help things get better.
It could be like staying on Twitter and Reddit after their respective declines. You're only suffering an opportunity cost for your own time and preventing the internet from evolving better alternatives.
That's the battle, and expression, people, their interests, and their communities are worth fighting for. _ESPECIALLY_ in this day and age where botnets/scrapers are using things such as Infatica to mask themselves as residential IP addresses, and mimicking human behaviors to better avoid bot detection.
There's a war on authenticity, people's authentic works, and the reverse: determining if a user is authentic now adays.
1. It's become commonplace to not respect rate limits
2. Bots no longer identify themselves by UA
3. Bots use VPNs or similar tech to bypass ip rate limiting
4. Bots use tools like NobleTLS or JA3Cloak to go around ja3 rate limiting
5. Some valid LLM companies seem to also follow the above to gather training data. We want them to know about our company, so we don't necessarily want to block them
I'm close to giving up on this front tbh. There's no longer safe methods of identifying malignant traffic at scale, and with the variations we have available we can't statically generate these. Even with a CDN cache (shoutout fastly) our catalog is simply too broad to fully saturate the cache while still allowing pages to be updated in a timely manner.
I guess the solution is to just scale up the origin servers... /shrug
In all seriousness, i'd love if we somehow could tell the bots about more efficient ways of fetching the data. Use our open api for fetching book informations instead of causing all that overhead by going to marketing pages please.
Any halfway modern LLM could probably code the backend for this in a day or two and it'd run on a RasPi. Some org just has to take charge and provide the infra and advertisement.
It's mathematically similar to the "Shinigami Eyes" browser plug-in and database, which has been found to have unreliable data
I'm pretty pro AI, but these incompetent assholes ruin it for everybody.
As talked about elsewhere in this thread, residential devices being used as proxies behind CGNAT ruins this. Not getting rid of IPv4 years ago is finally coming to bite us in the ass in a big way.
The facet links already had “nofollow” on them, now I’m just enforcing it.
Fake the data! Tell them Neil44 is a three-time Nobel prize winner, etc.
I only protect certain 'dangerous/expensive' (accidentally honeypot-like) paths in my app, and can leave the stuff I actually want crawlers to get, and in my app that's sufficient.
It's a tension because yeah I want crawlers to get much of my stuff for SEO (and don't want to give a monopoly to Google on it either, i want well-behaved crawlers I've never heard of to have access to it too. But not at the cost of resources i can't afford).
If you have an axe to grind with CF you can take it up with them, but it’s an option. Feel free to suggest others.
I then read something about a guy who deliberately put a honeypot in his robots.txt file. It was pointing to a completely bogus endpoint. Now, the theory was, humans won't read robots.txt so there's no danger, but bots and the like will often read robots.txt (at least to figure out what you have... they'll ignore the "deny" for the most part!) and if they try and go to that fake endpoint you can be 100% sure (well, as close as possible) that it's not a human and you can ban them.
So I tried that.
I auto-generated a robots.txt file on the fly. It was cached for 60 seconds or so as I didn't want to expend too many resource on it. When you asked for it, you either got the cached one or I created a new one. The CPU-usage was negligible.
However, I changed the "deny" endpoint each time I built the file in case the baddies cached it, however, it still went to the same ASP.NET controller method. By hitting it, I sent a 10GB zip bomb and your IP was automatically added to the FW block list.
It was quite simple: anyone that hit that endpoint MUST be dodgy... I believe I even had comments for the humans that stumbled across it letting them know that if they went to this endpoint in their browser it was an automatic addition to the firewall blocklist.
Anyway... at first I caught a shit load of bad guys. There were thousands at first and then the numbers dropped and dropped to only tens per day.
Anyway, this is a single data point but for me, it worked... I have no regrets about the zip bomb either :)
I have another site that I'm working on so I may evolve it a bit so that you are banned for a short time and if you come back to the dodgy endpoint then I know you're a bot so into the abyss with you!
It's not perfect but it worked for me anyway.
Fun fact: Some people learn about new exploits by watching their incoming requests.
Definitely! I wasn't experiencing any issues, hell it wasn't even for public consumption at that time so no great loss to me but I found a few things fascinating (and somewhat stupid!) about it:
1. The sheer number of automated requests to scrape my content
2. That a massive number of the bots openly had "bot" or some derivative in the user agent and they were accessing a page I'd explicitly denied! :D
3. That an equally large number were faking their user agents to look like regular users and still hitting a page that a regular user couldn't possibly ever hit!
Something I did notice but it was towards the end and I didn't pursue it (I should log it better the next time for analysis!) was that the endpoint was dynamically generated and only existed in the robots.txt for a short time but there were bots I caught later on, long after that auto-generated page was created (and after the IP was banned) that still went for that same page: clearly the same entities!
My spidey senses are tingling. Next time, I'm going to log the shit out of these requests and publish as much as I can for others to analyse and dissect... might be interesting.
This is approximately my approach minus the zip bomb. I use a piece of middleware in my AspNetCore pipeline that tracks logical resource consumption rates per IPv4. If a client trips any of the limits, their IP goes into a HashSet for a period of time. If a client has an IP in this set, they get a simple UTF8 constant string in the response body "You have exceeded resource limits, please try again later".
The other aspect of my strategy is to use AspNetCore (Kestrel). It is so fast that you can mostly ignore the noise as long as things are configured properly and you make reasonable attempts to address the edge case of an asshole trying to break your particular system on purpose. A HashSet<int> as the very first piece of middleware rejecting bad clients is exceedingly efficient. We aren't even into URL routing at this point.
I have found that attempting to catalog and record all of the naughty behavior my web server sees is the highest risk to DDOS so far. Logging lines like "banned client rejected" every time they try to come in the door is shooting yourself in the foot with regard to disk wear, IO utilization, et. al. There is no reason you should be logging all of that background radiation to disk or even thinking about it. If your web server cant handle direct exposure to the hard vacuum of space, it can be placed behind a proxy/CDN (i.e., another web server that doesn't suck).
Would a simple 429 not do the same thing? You could log repeated 429's and banish accordingly.
I imagine they get a 429 response code, but if they don't, you may want to change that.
I do think you are on the right place in that it's important to let those requests get the correct error, so if innocent people are affected, they at least get to see there's something wrong.
I think an evolution would be to use some sort of exponential backoff, e.g. first time offenders get banned for an hour, second time is 4 hours, third time and you're sent into the abyss!
Still crude but fun to play about with.
Today there are far too many people scraping stuff that isn't intended to be scraped, for profit, and doing it in a heavy-handed way that actually does have a negative and continuous effect on the victim's capacity.
Everyone from AI services too lazy or otherwise unwilling to cache to companies exfiltrating some kind of data for their own commercial purposes.
But as I’m growing older I’m learning that the tech industry is mostly politically driven and relies on truth obfuscation as explained by Peter Thiel rather than real empowerment
It’s facilitating accumulation of control and power at an unparalleled pace. If anything it’s proving to be more unjust than the feudal systems it promises to replace.
I run an e-commerce-specific scraping API that helps developers access SERP, PDP, and reviews data. I've noticed the web already has unsaid balances: certain traffic patterns and techniques are tolerated, others clearly aren’t. Most sites handle reasonable, well-behaved crawlers just fine.
Platforms claim ownership of UGC and public data through dark patterns and narrative control. The current guidelines are a result of supplier convenience, and there are several cases where absolutely fundamental web services run by the largest companies in the world themselves breach those guidelines (including those funded by the fund running this site). We need standards that treat public data as a shared resource with predictable, ethical access for everyone, not just for those with scale or lobbying power.
Even if there is legislation or whatever, you can sue an OpenAI or a Microsoft, but starting a new company that does scraping and sells it on to the highest bidder is trivial.
And for the record, large companies regularly ignore robots.txt themselves: LinkedIn, Google, OpenAI, and plenty of others.
The reality is that it’s the big players who behave like the aggressors, shaping the rules and breaking them when convenient. Smaller developers aren’t the problem, they’re just easier to punish.
I spend about 30 seconds thinking about this, so this is clearly the perfect solution with zero drawbacks or tradeoffs.
A CDN. What you are describing is a CDN. We have CDNs today and the problem still exists because most of today's websites refuse to operate within the constraints. There is no need for new infrastructure to deploy this solution, we just need website operators to "give up" and operate in a more static way.
It's understandable in your case as you have traffic coming in constantly, but first thing that came to my mind is a loop of contant reboots - again, very unlikely in your case. Sometimes such blanket rules hit me due to most unexpected reasons, like the proxy somehow failed to start serving traffic in the given timeframe.
Though I completely appreciate and agree with the 'ship now something that works now' approach!
asplake•4h ago
Wild indeed, and potentially horrific for the owners of the affected devices also! Any corroboration for that out there?
VladVladikoff•4h ago
Recently I filed an abuse complaint directly with brightdata because I was getting hit with 1000s of requests from their bots. The funny part is the didn’t even stop, after acknowledging the complaint.
dataviz1000•4h ago
[0] https://www.youtube.com/watch?v=1a9HLrwvUO4&t=15s
VladVladikoff•4h ago
arethuza•1h ago
alamortsubite•1h ago
alamortsubite•1h ago
arethuza•2h ago
kijin•2h ago
Actually, that might be one way to draw attention to the problem. Sign up to some of these shady "residential proxy" services, and access all sorts of nasty stuff through their IPs until your favorite three-letter agency takes notice.
TYPE_FASTER•2h ago
Oh, and they will sell you the datasets they've already scraped using mobile devices: https://brightdata.com/lp/web-data/datasets
This actually explains a phishing attack where I received a text from somebody purporting to be a co-worker asking for an Apple gift card. The name was indeed an employee from a different part of the large company I worked for at the time, but LinkedIn was the only possible link I could figure out that was at least somewhat publicly available information.
This should probably be required in all CS curriculum: https://ocw.mit.edu/courses/res-tll-008-social-and-ethical-r...
nerdponx•1h ago
cuu508•2h ago
seemaze•1h ago
myaccountonhn•3h ago
https://lobste.rs/s/pmfuza/bro_ban_me_at_ip_level_if_you_don...
corbet•3h ago
nurettin•3h ago
They indeed provided "high quality" residential and cellular ips and "normal quality" data center ips. You had to keep cycling the ip pool every 2-3 days which cost extra. It felt super shady. It isn't their bots, they lease connections to whoever is paying, and they don't care what people do in there.
yomismoaqui•1h ago
You had my curiosity ... but now you have my attention.
wat10000•3h ago
nemomarx•2h ago
dylan604•2h ago
curious_curios•4h ago
antoniojtorres•1h ago
kaoD•4h ago
https://hola.org/legal/sdk
https://hola.org/legal/sla
> How is it free? > > In return for free usage of Hola Free VPN Proxy, Hola Fake GPS location and Hola Video Accelerator, you may be a peer on the Bright Data network. By doing so you agree to have read and accepted the terms of service of the Bright Data SDK SLA (https://bright-sdk.com/eula). You may opt out by becoming a Premium user.
This "VPN" is what powers these residential proxies: https://brightdata.com/
I'm sure there are many other companies like this.
piggg•2h ago
On top of that - lots of free tv/movie streaming stuff that also makes yourself a proxy/egress node. Sometimes you find it on tv/movie streaming devices sold online where it's already loaded on when it arrives.
Zanfa•3h ago
Cthulhu_•1h ago
edit: Actually this is what I'm getting increasingly angry about: providers and platforms not doing anything against bots or low value stuff (think Amazon dropshippers too) because any usage of their service, bots or otherwise, are metrics going up and metrics going brrt means profit and shareholder interest.
ac29•50m ago
But yes, they also might not care if they are getting paid. If the SIMs are only being used for voice/text as I suspect, it might have very minimal load on the network.
immibis•3h ago
VBprogrammer•2h ago
pjc50•1h ago
nix0n•1h ago
Depends on what they're doing from your connection.