Personally, I think the "network sharing" software bundled with apps should fall into the category of potentially unwanted applications along with adware and spyware. All of the above "tag along" with something the user DID want to install, and quietly misuse the user's resources. Proxies like this definitely have an impact for metered/slow connections - I'm tempted to start Wireshark'ing my devices now to look for suspicious activity.
There should be a public repository of apps known to have these shady behaviours. Having done some light web scraping for archival/automation before, it's a pity that it'll become collateral damage in the anti-AI-botfarm fight.
Is the premise that users should not be allowed to use vpns in order to participate in ecommerce?
[1] https://reports.exodus-privacy.eu.org/en/trackers/ [2] https://f-droid.org/packages/com.aurora.store/
I wouldn't mind reading a comprehensive report on SOTA with regard to bot-blocking.
Sure, there's Anubis (although someone elsethread called it a half-measure, and I'd like to know why), there's captcha's, there's relying on a monopoly (cloudflare, etc) who probably also wants to run their own bots at some point, but what else is there?
If the app isn't a web browser, none are legit?
I suspect that this goes for many different SDKs. Personally, I am really, really sick of hearing "That's a solved problem!", whenever I mention that I tend to "roll my own," as opposed to including some dependency, recommended by some jargon-addled dependency addict.
Bad actors love the dependency addiction of modern developers, and have learned to set some pretty clever traps.
The "network sharing" behavior in these SDKs is the sole purpose of the SDK. It isn't being included as a surprise along with some other desirable behavior. What needs to stop is developers including these SDKs as a secondary revenue source in free or ad-supported apps.
Doubt it. This is just one -of many- carrots that are used to entice developers to include dodgy software into their apps.
The problem is a lot bigger than these libraries. It's an endemic cultural issue. Much more difficult to quantify or fix.
Is that greed?
I can find many reasons to be critical of that developer, things like creating a product for a market segment that is saturated, and likely doing so because it is low hanging fruit (both conceptually and in terms of complexity). I can be critical of their moral judgement for how they decided to generate income from their poor business judgment. But I don't thinks it's right to automatically label them as greedy. They may be greedy, but they may also be trying to generate income from their work.
Umm, yes? You are not owed anything in this life, certainly not income for your choice to spend your time on building a software product no one asked for. Not making money on it is a perfectly fine outcome. If you desperately need guaranteed money, don't build an app expecting it to sell; get a job.
Technically true but a bit of perspective might help. The consumer market is distorted by free (as in beer) apps that does a bunch of shitty things that should in many cases be illegal or require much more informed consent than today, like tracking everything they can. Then you have VC funded ”free” as well, where the end game is to raise prices slowly to boil the frog. Then you have loss leaders from megacorps, and a general anti-competitive business culture.
Plus, this is not just in the Wild West shady places, like the old piratebay ads. The top result for ”timer” on the App Store (for me) is indeed a timer app, but with IAP of $800/y subscription… facilitated by Apple Inc, who gets 15-30% of the bounty.
Look, the point is it’s almost impossible to break into consumer markets because everyone else is a predator. It’s a race to the bottom, ripping off clueless customers. Everyone would benefit from a fairer market. Especially honest developers.
That’s got to be money laundering or something else illicit? No one is actually paying that for a timer app?
We could have people ask for software in a more convenient way.
Not making money could be an indication the software isn't useful, but what if it is? What can the collective do in that zone?
I imagine one could ask and pay for unwritten software then get a refund if it doesn't materialize before your deadline.
Why is discovery (of many creation) willingly handed over to a hand full of mega corps?? They seem to think I want to watch and read about Trump and Elon every day.
Promoting something because it is good is a great example of a good thing that shouldn't pay.
Brings a new meaning to dependency injection.
My personal beef is that most of the time it acts like hidden global dependencies, and the configuration of those dependencies, along with their lifetimes, becomes harder to understand by not being traceable in the source code.
To me it‘s rather anti-functional. Normally, when you instantiate a class, the resulting object’s behavior only depends on the constructor arguments you pass it (= the behavior is purely a function of the arguments). With dependency injection, the object’s behavior may depend on some hidden configuration, and not even inspecting the class’ source code will be able to tell you the source of that bevavior, because there’s only an @Inject annotation without any further information.
Conversely, when you modify the configuration of which implementation gets injected for which interface type, you potentially modify the behavior of many places in the code (including, potentially, the behavior of dependencies your project may have), without having passed that code any arguments to that effect. A function executing that code suddenly behaves differently, without any indication of that difference at the call site, or traceable from the call site. That’s the opposite of the functional paradigm.
It sounds like you have a gripe with a particular DI framework and not the idea of Dependency Injection. Because
> Normally, when you instantiate a class, the resulting object’s behavior only depends on the constructor arguments you pass it (= the behavior is purely a function of the arguments)
With Dependency Injection this is generally still true, even more so than normal because you're making the constructor's dependencies explicit in the arguments. If you have a class CriticalErrorLogger(), you can't directly tell where it logs to, is it using a flat file or stdout or a network logger? If you instead have a class CriticalErrorLogger(logger *io.writer), then when you create it you know exactly what it's using to log because you had to instantiate it and pass it in.
Or like Kortilla said, instead of passing in a class or struct you can pass in a function, so using the same example, something like CriticalErrorLogger(fn write)
My issue with that is this: From the point of view of the code accessing the injected value (and from the point of view of that code's callers), the value appears like out of thin air. There is no way to trace back from that code where the value came from. Similarly, when defining which value will be injected, it can be difficult to trace all the places where it will be injected.
In addition, there are often lifetime issues involved, when the injected value is itself a stateful object, or may indirectly depend on mutable, cached, or lazy-initialized, possibly external state. The time when the value's internal state is initialized or modified, or whether or not it is shared between separate injection points, is something that can't be deduced from the source code containing the injection points, but is often relevant for behavior, error handling, and general reasoning about the code.
All of this makes it more difficult to reason about the injected values, and about the code whose behavior will depend on those values, from looking at the source code.
I agree with your definition except for this part, you don't need any framework to do dependency injection. It's simply the idea that instead of having an abstract base class CriticalErrorLogger, with the concrete implementations of StdOutCriticalErrorLogger, FileCriticalErrorLogger, AwsCloudwatchCriticalErrorLogger which bake their dependency into the class design; you instead have a concrete class CriticalErrorLogger(dep *dependency) and create dependency objects externally that implement identical interfaces in different ways. You do text formatting, generating a traceback, etc, and then call dep.write(myFormattedLogString), and the dependency handles whatever that means.
I agree with you that most DI frameworks are too clever and hide too much, and some forms of DI like setter injection and reflection based injection are instant spaghetti code generators. But things like Constructor Injection or Method Injection are so simple they often feel obvious and not like Dependency Injection even though they are. I love DI, but I hate DI frameworks; I've never seen a benefit except for retrofitting legacy code with DI.
And yeah it does add the issue or lifetime management. That's an easy place to F things up in your code using DI and requires careful thought in some circumstances. I can't argue against that.
But DI doesn't need frameworks or magic methods or attributes to work. And there's a lot of situations where DI reduces code duplication, makes refactoring and testing easier, and actually makes code feel less magical than using internal dependencies.
The basic principle is much simpler than most DI frameworks make it seem. Instead of initializing a dependency internally, receive the dependency in some way. It can be through overly abstracted layers or magic methods, but it can also be as simple as adding an argument to the constructor or a given method that takes a reference to the dependency and uses that.
edit: made some examples less ambiguous
The term Dependency Injection was coined by Martin Fowler with this article: https://martinfowler.com/articles/injection.html. See how it presents the examples in terms of wiring up components from a configuration, and how it concludes with stressing the importance of "the principle of separating service configuration from the use of services within an application". The article also presents constructor injection as only one of several forms of dependency injection.
That is how everyone understood dependency injection when it became popular 10-20 years ago: A way to customize behavior at the top application/deployment level by configuration, without having to pass arguments around throughout half the code base to the final object that uses them.
Apparently there has been a divergence of how the term is being understood.
[0] https://en.wikipedia.org/wiki/Strategy_pattern
[1] The fact that Car is abstract in the example is immaterial to the pattern, and a bit unfortunate in the Wikipedia article, from a didactic point of view.
DI in Java is almost completely disconnected from what the Strategy pattern is, so it doesn't make sense to use one to refer to the other there.
It's equivalent to partial application.
An uninstantiated class that follows the dependency injection pattern is equivalent to a family of functions with N+Mk arguments, where Mk is the number of parameters in method k.
Upon instantiation by passing constructor arguments, you've created a family of functions each with a distinct sets of Mk parameters, and N arguments in common.
That's the best way to think of it fundamentally. But the main implication of that which is at some point something has to know how to resolve those dependencies - i.e. they can't just be constructed and then injected from magic land. So global cradles/resolvers/containers/injectors/providers (depending on your language and framework) are also typically part and parcel of DI, and that can have some big implications on the structure of your code that some people don't like. Also you can inject functions and methods not just constructors.
This is all well and good, but you also need a bunch of code that handles resolving those dependencies, which oftentimes ends up being complex and hard to debug and will also cause runtime errors instead of compile time errors, which I find to be more or less unacceptable.
Edit: to elaborate on this, I’ve seen DI frameworks not be used in “enterprise” projects a grand total of zero times. I’ve done DI directly in personal projects and it was fine, but in most cases you don’t get to make that choice.
Just last week, when working on a Java project that’s been around for a decade or so, there were issues after migrating it from Spring to Spring Boot - when compiled through the IDE and with the configuration to allow lazy dependency resolution it would work (too many circular dependencies to change the code instead), but when built within a container by Maven that same exact code and configuration would no longer work and injection would fail.
I’m hoping it’s not one of those weird JDK platform bugs but rather an issue with how the codebase is compiled during the container image build, but the issue is mind boggling. More fun, if you take the .jar that’s built in the IDE and put it in the container, then everything works, otherwise it doesn’t. No compilation warnings, most of the startup is fine, but if you build it in the container, you get a DI runtime error about no lazy resolution being enabled even if you hardcode the setting to be on in Java code: https://docs.spring.io/spring-boot/api/kotlin/spring-boot-pr...
I’ve also seen similar issues before containers, where locally it would run on Jetty and use Tomcat on server environments, leading to everything compiling and working locally but throwing injection errors on the server.
What’s more, it’s not like you can (easily) put a breakpoint on whatever is trying to inject the dependencies - after years of Java and Spring I grow more and more convinced that anything that doesn’t generate code that you can inspect directly (e.g. how you can look at a generated MapStruct mapper implementation) is somewhat user hostile and will complicate things. At least modern Spring Boot is good in that more of the configuration is just code, because otherwise good luck debugging why some XML configuration is acting weird.
In other words, DI can make things more messy due to a bunch of technical factors around how it’s implemented (also good luck reading those stack traces), albeit even in the case of Java something like Dagger feels more sane https://dagger.dev/ despite never really catching on.
Of course, one could say that circular dependencies or configuration issues are project specific, but given enough time and projects you will almost inevitably get those sorts of headaches. So while the theory of DI is nice, you can’t just have the theory without practice.
Hidden dependencies are: untyped context variable; global "service registry", etc. Those are hidden, the only way to find out which dependencies given module has is to carefully read its code and code of all called functions.
I'm talking more specifically about Aspect Oriented Programming though and DI containers in OOP, which seemed pretty clever in theory, but have a lot of issues in reality.
I take no issues with currying in functional programming.
There are other good uses of it but it absolutely can get out of control, especially if implemented by someone whose just discovered it and wants to use it for everything.
But nobody seems to do this diligence. It’s just “we are in a rush. we need X. dependency does X. let’s use X.” and that’s it!
Wrong question. “Are you paid to audit this code?” And “if you fail to audit this code, who’se problem is it?”
Have you ever worked anywhere that said "go ahead and slow down on delivering product features that drive business value so you can audit the code of your dependencies, that's fine, we'll wait"?
I haven't.
When was the last time producer of an app was held legally accountable for negligence, had to pay compensation and damages, etc?
AI is making this worse than ever though, I am constantly having to tell devs that their work is failing to meet requirements, because AI is just as bad as a junior dev when it comes to reaching for a dependency. It’s like we need training wheels for the prompts juniors are allowed to write.
I imagine that e.g. Youtube would be happy to agree with this. Not that it would turn them against AI generally.
[Cloudflare](https://developers.cloudflare.com/cache/troubleshooting/alwa...) tags the internet archive as operating from 207.241.224.0/20 and 208.70.24.0/21 so disabling the bot-prevention framework on connections from there should be enough.
New actors have the right to emerge.
There's no rule that you have to let anyone in who claims to be a web crawler.
The truth is that I sympathize with the people trying to use mobile connections to bypass such a cartel.
What Cloudflare is doing now is worse than the web crawlers themselves and the legality of blocking crawlers with a monopoly is dubious at best.
Kagi is welcome to scrape from their IP addresses. Other bots that behave are fine too (Huawei and various other Chinese bots don't and I've had to put an IP block on those).
On a separate note, I believe open web scraping has been a massive benefit to the internet on net, and almost entirely positive pre-2021. Web scraping & crawling enables search engines, services like Internet Archive, walled-garden-busting (like Invidious, yt-dlp, and Nitter), mashups (Spotube, IFTT, and Plaid would have been impossible to bootstrap without web scraping), and all kinds of interesting data science projects (e.g. scraping COVID-19 stats from local health departments to patch together a picture of viral spread for epidemiologists).
Or do you want a central authority that decides who can do new search engines?
Why is Anubis-type mitigations a half-measure?
It's a half-measure because:
1. You're slowing down scrapers, not blocking them. They will still scrape your site content in violation of robots.txt.
2. Scrapers with more compute than IP proxies will not be significantly bottlenecked by this.
3. This may lead to an arms race where AI companies respond by beefing up their scraping infrastructure, necessitating more difficult PoW challenges, and so on. The end result of this hypothetical would be a more inconvenient and inefficient internet for everyone, including human users.
To be clear: I think Anubis is a great tool for website operators, and one of the best self-hostable options available today. However, it's a workaround for the core problem that we can't reliably distinguish traffic from badly behaving AI scrapers from legitimate user traffic.
What good is all the app vetting and sandbox protection in iOS (dunno about Android) if it doesn't really protect me from those crappy apps...
If you treat platforms like they are all-powerful, then that's what they are likely to become...
Network access settings should really be more granular for apps that have a legitimate need.
App store disclosure labels should also add network usage disclosure.
Maybe it's less convenient and more expensive and onerous. Do good things require hard work? Or did we expect everyone to ignore incentives forever while the trillion-dollar hyperscalers fought for an open and noble internet and then wrapped it in affordable consumer products to our delight?
It reminds me of the post here a few weeks ago about how Netflix used to be good and "maybe I want a faster horse" - we want things to be built for us, easily, cheaply, conveniently, by companies, and we want those companies not to succumb to enshittification - but somehow when the companies just follow the game theory and turn everything into a TikToky neural-networks-maximizing-engagement-infinite-scroll-experience, it's their fault, and not ours for going with the easy path while hoping the corporations would not take the easy path.
We are working on an open‑source fraud prevention platform [1], and detecting fake users coming from residential proxies is one of its use cases.
Trying to understand your product, where is it intended to sit in a network? Is it a standalone tool that you use to identify these IPs and feed into something else for blockage or is it intended to be integrated into your existing site or is it supposed to proxy all your web traffic? The reason I ask is it has fairly heavyweight install requirements and Apache and PHP are kind of old school at this point, especially for new projects and companies. It's not what they would commonly be using for their site.
Thank you for your question. tirreno is a standalone app that needs to receive API events from your main web application. It can work perfectly with 512GB Postgres RAM or even lower, however, in most cases we're talking about millions of events that request resources.
It's much easier to write a stable application without dependencies based on mature technologies. tirreno is fairly 'boring software'.
Finally, as mentioned earlier, there is no silver bullet that works for every type of online fraudster. For example, in some applications, a TOR connection might be considered a red flag. However, if we are talking about hn visitors, many of them use TOR on a daily basis.
I’ve found TOR browsing ok, but login via TOR to just be a great alternative to snow shoeing credential stuffing.
[1] https://jan.wildeboer.net/2025/02/Blocking-Stealthy-Botnets/
Regarding the first post, it's rare to see both datacenter network IPs and mobile proxy IP addresses used simultaneously. This suggests the involvement of more than one botnet. The main idea is to avoid using IP addresses as the sole risk factor. Instead, they should be considered as just one part of the broader picture of user behavior.
Both are pretty easy to mitigate with a geoip database and some smart routing. One "residential proxy" vendor even has session tokens so your source IP doesn't randomly jump between each request.
Why jump to that conclusion?
If a scraper clearly advertises itself, follows robots.txt, and has reasonable backoff, it's not abusive. You can easily block such a scraper, but then you're encouraging stealth scrapers because they're still getting your data.
I'd block the scrapers that try to hide and waste compute, but deliberately allow those that don't. And maybe provide a sitemap and API (which besides being easier to scrape, can be faster to handle).
Not sure how this could work for browsers, but the other 99% of apps I have on my phone should work fine with just a single permitted domain.
It should also do something similar for apps making chatty background requests to domains not specified at app review time. The legitimate use cases for that behaviour are few.
The system may have some such functions built in, and asking permission might be a reasonable thing to include by default.
I've used all of them, and it's a deluge: it is too much information to reasonably react to.
Your broad is either deny or accept but there's no sane way to reliably know what you should do.
This is not and cannot be an individual problem: the easy part is building high fidelity access control, the hard part is making useful policy for it.
> it is too much information to reasonably react to.
Even if it asks, does not necessarily mean it has to ask every time if the user lets it keep the answer (either for the current session for until the user deliberately deletes this data). Also, if it asks too much because it tries to access too many remote servers, then might be spyware, malware, etc anyways, and is worth investigating in case that is what it is.
> the hard part is making useful policy for it.
What the default settings should be is a significant issue. However, changing the policies in individual cases for different uses, is also something that a user might do, since the default settings will not always be suitable.
If whoever manages the package repository, app store, etc is able to check for malware, then this is a good thing to do (although it should not prohibit the user from installing their own software and modifying the existing software), but security on the computer is also helpful, and neither of these is the substitute for the other; they are together.
I am waiting for Apple to enable /etc/hosts or something similar on iOS devices.
And, AFAIK, you already need special permission for anything other than HTTPS to specific domains on the public Internet. That's why apps ping you about permissions to access "local devices".
They should need special permission for that too.
That's how it works with other permissions most applications should not have access to, like accessing user locations. (And private entitlements third party applications can't have are one way Apple makes sure nobody can compete with their apps, but that's a separate issue.)
You mean, good bye using my bandwidth without my permission? That's good. And if I install a bittorrent client on my phone, I'll know to give it permission.
> such as companion apps for watches and other peripherals
That's just apple abusing their market position in phones to push their watch. What does it have to do with p2p?
What are you talking about?
> What does it have to do with p2p?
It’s an example of when you design sandboxes/firewalls it’s very easy to assume all apps are one big homogenous blob doing rest calls and everything else is malicious or suspicious. You often need strange permissions to do interesting things. Apple gives themselves these perms all the time.
> What are you talking about?
That’s the main use case for p2p in an application isn’t it? Reducing the vendors bandwidth bill…
The equivalent would be to say that running local workloads or compute is to reduce the vendors bill. It’s a very centralized view of the internet.
There are many reasons to do p2p. Such as improving bandwidth and latency, circumventing censorship, improve resilience and more. WebRTC is a good example of p2p used by small and large companies alike. None of this is any more ”without permission” than a standard app phoning home and tracking your fingerprint and IP.
Great respect for the user's resources.
I just brought it up as a technology that at the very least is both legitimate and common.
Except the platform providers hold the trump card. Fuck around, if they figure it out you'll be finding out.
TINFOIL: Sometimes I always wondered if Azure or AWS used bots to push site traffic hits to generate money... they know you are hosted with them.. They have your info.. Send out bots to drive micro accumulation. Slow boil..
GCE is rare in my experience. Most bots I see are on AWS. The DDOS-adjacent hyper aggressive bots that try random URLs and scan for exploits tend to be on Azure or use VPNs.
AWS is bad when you report malicious traffic. Azure has been completely unresponsive and didn't react, even for C&C servers.
People are jumping to conclusions a bit fast over here, yes technically it's possible but this kind of behavior would be relatively easy to spot because the app would have to make direct connections to the website it wants to scrap.
Your calculator app for instance connecting to CNN.com ...
iOS have app privacy report where one can check what connections are made by app, how often, last one, etc.
Android by Google doesn't have such a useful feature of course, but you can run third party firewall like pcapdroid, which I recommend highly.
Macos (little snitch).
Windows (fort firewall).
Not everyone run these app obviously, only the most nerdy like myself but we're also the kind of people who would report on app using our device to make, what is in fact, a zombie or bot network.
I'm not saying it's necessarily false but imo it remains a theory until proven otherwise.
How often is the average calculator app user checking there Privacy Report? My guess, not many!
They happens from time to time, last one was not more than two week ago where it's been shown that many app were able to read the list of all other app installed on a Android and that Google refused to fix that.
Do you really believe that an app used to make your device part of a bot network wouldn't be posted over here ?
^ edit: my mistake, the server logs I mentioned were from the authors prior blog post on this topic, linked to at the top of TFA: https://jan.wildeboer.net/2025/02/Blocking-Stealthy-Botnets/
Privacy reports do not include that information. They include broad areas of information the app claims to gather. There is zero connection between those claimed areas and what the app actually does unless app review notices something that doesn't match up. But none of that information is updated dynamically, and it has never actually included the domains the app connects to. You may be confusing it with the old domain declarations for less secure HTTP connections. Once the connections met the system standards you no longer needed to declare it.
Go to a data conference like Neudata and you will see. You can have scraped data from user devices, real-time locations, credit card, Google analytics, etc.
AKA "why do Cloudflare and Google make me fill out these CAPTCHAs all day"
I don't know why Play Protect/MS Defender/whatever Apple has for antivirus don't classify apps that embed such malware as such. It's ridiculous that this is allowed to go on when detection is so easy. I don't know a more obvious example of a trojan than an SDK library making a user's device part of a botnet.
This is even worse with CG-NAT if you don't have IPv6 to solve the CG-NAT problem.
I don't think the data they collect is used to train anything these days. Cloudflare is using AI generated images for CAPTCHAs and Google's actual CAPTCHAs are easier for bots than humans at this point (it's the passive monitoring that makes it still work a little bit).
Do nothing, win.
They are the primary benefactors buying this data since they are the largest AI players.
Google could do this. I'm sure Apple could as well. Third parties could for a small set of apps
If instead we had a content addressed model, we could drop the uniqueness constraint. Then these AI scrapers could be gossiping the data to one another (and incidentally serving it to the rest of us) without placing any burden on the original source.
Having other parties interested in your data should make your life easier (because other parties will host it for you), not harder (because now you need to work extra hard to host it for them).
See https://arxiv.org/abs/1905.11880 [Hydras and IPFS: A Decentralised Playground for Malware]
That's not to say that it is a ready replacement for the web as we know it. If you have hash-linked everything then you wind up with problems trying to link things together, for instance. Once two pages exist, you can't after-the-fact create a link between them because if you update them to contain that link then their hashes change so now you have to propagate the new hash to people. This makes it difficult to do things like have a comments section at the bottom of a blog post. So you've got to handle metadata like that in some kind of extra layer--a layer which isn't hash linked and which might be susceptible to all the same problems that our current web is--and then the browser can build the page from immutable pieces, but the assembly itself ends up being dynamic (and likely sensitive to the users preference, e.g. dark mode as a browser thing not a page thing).
But I still think you could move maybe 95% of the data into an immutable hash-linked world (think of these as nodes in a graph), the remaining 5% just being tuples of hashes and pubic keys indicating which pages are trusted by which users, which ought to be linked to which others, which are known to be the inputs and output of various functions, and you know... structure stuff (these are our graph's edges).
The edges, being smaller, might be subject to different constraints than the web as we know it. I wouldn't propose that we go all the way to a blockchain where every device caches every edge, but it might be feasible for my devices to store all of the edges for the 5% of the web I care about, and your devices to store the edges for the 5% that you care about... the nodes only being summoned when we actually want to view them. The edges can be updated when our devices contact other devices (based on trust, like you know that device's owner personally) and ask "hey, what's new?"
I've sort of been freestyling on this idea in isolation, probably there's already some projects that scratch this itch. A while back I made a note to check out https://ceramic.network/ in this capacity, but I haven't gotten down to trying it out yet.
AI scrapers aren't trying to find things they already know exist, they're trying to discover what they didn't know existed.
"Content-addressable" has a broader meaning than what you seem to be thinking of -- roughly speaking, it applies if any function of the data is used as the "address". E.g., git commits are content-addressable by their SHA1 hashes.
It's a legit limitation on what content addressing can do, but it's one we can overcome by just not having everything be content addressed. The web we have now is like if you did a `git pull` every time you opened a file.
The web I'm proposing is like how we actually use git--periodically pulling new hashes as a separate action, but spending most of our time browsing content that we already have hashes for.
But there's a lot middle ground to explore here. Loading a modern web page involves making dozens of requests to a variety of different servers, evaluating some javascript, and then doing it again a few times, potentially moving several Mb of data. The part people want, the thing you don't already know exist, it's hidden behind that rather heavy door. It doesn't have to be that way.
If you already know about one thing (by its cryptographic hash, say) and you want to find out which other hashes it's now associated with--associations that might not have existed yesterday--that's much easier than we've made it. It can be done:
- by moving kB not Mb, we're just talking about a tuple of hashes here, maybe a public key and a signature
- without placing additional burden on whoever authored the first thing, they don't even have to be the ones who published the pair of hashes that your scraper is interested in
Once you have the second hash, you can then reenter immutable-space to get whatever it references. I'm not sure if there's already a protocol for such things, but if not then we can surely make one that's more efficient and durable than what we're doing now.
It is entirely possible to serve a fully cached response that says "you already have this". The problem is...people don't implement this well.
If content were handled independently of server names, anyone who cares to distribute metadata for content they care about can do so. One doesn't need write access, or even to be on the same network partition. You could just publish a link between content A and content B because you know their hashes. Assembling all of this can happen in the browser, subject to the user's configs re: who they trust.
I know, as far as possible it's a good idea to have content-immutable URLs. But at some point, I need to make www.myexamplebusiness.com show new content. How would that work?
But as for updating, you just format your URLs like so: {my-public-key}/foo/bar
And then you alter the protocol so that the {my-public-key} part resolves to the merkle-root of whatever you most recently published. So people who are interested in your latest content end up with a whole new set of hashes whenever you make an update. In this way, it's not 100% immutable, but the mutable payload stays small (it's just a bunch of hashes) and since it can be verified (presumably there's a signature somewhere) it can be gossiped around and remain available even if your device is not.
You can soft-delete something just by updating whatever pointed to it to not point to it anymore. Eventually most nodes will forget it. But you can't really prevent a node from hanging on to an old copy if they want to. But then again, could you ever do that? Deleting something on on the web has always been a bit of a fiction.
True in the absolute sense, but the effect size is much worse under the kind of content-addressable model you're proposing. Currently, if I download something from you and you later delete that thing, I can still keep my downloaded copy; under your model, if anyone ever downloads that thing from you and you later delete that thing, with high probability I can still acquire it at any later point.
As you say, this is by design, and there are cases where this design makes sense. I think it mostly doesn't for what we currently use the web for.
It's the same functionality you get with permalinks and sites like archive.org--forgotten unless explicitly remembered by anybody, dynamic unless explicitly a permalink. It's just built into the protocol rather than a feature to be inconsistently implemented over and over by many separate parties.
Unless complicit, tech leaders (Apple Google Microsoft) have a duty to respond swiftly and decisively. This has been going on far too long.
That's not good.
It is still a pretty good lay-of-the-land.
https://www.trendmicro.com/vinfo/us/security/news/vulnerabil...
But it just doesn't scale to internet size so I'm fucked if I know how we should fix it. We all have that cousin or dude in our highschool class who would do anything for a bit of money and introducing his 'friend' Paul who is in fact a bot whose owner paid for the lie. And not like enough money to make it a moral dilemma, just drinking money or enough for a new video game. So once you get past about 10,000 people you're pretty much back where we are right now.
Binary "X trusts Y" statements, plus transitive closure, can lead to long trust paths that we probably shouldn't actually trust the endpoints of. Could we not instead assign probabilities like "X trusts Y 95%", multiply probabilities along paths starting from our own identity, and take the max at each vertex? We could then decide whether to finally trust some Z if its percentage is more than some threshold T%. (Other ways of combining in-edges may be more suitable than max(); it's just a simple and conservative choice.)
Perhaps a variant of backprop could be used to automatically update either (a) all or (b) just our own weights, given new information ("V has been discovered to be fraudulent").
How about restricting them to everyone-knows-everyone sized groups, of like a couple hundred people?
One can be a member of multiple groups so you're not actually limited. But the groups will be small enough to self regulate.
You want to chat with a Dunbar number of people get yourself a private discord or slack channel.
I'm hypothesising that any such large scale structure will be perverted by commercial interests, while having multiple Dunbar sized such structures will have a chance to be useful.
At least, that's the way I've always imagined it working. Maybe I need to read up.
So what are potential solutions? We're somehow still stuck with CAPTCHAS, a 25 years old concept that wastes millions of human hours and billions in infra costs [0].
How can enable beneficial automation while protecting against abusive AI crawlers?
Did you mean "against"?
I say we ask Google Analytics to count an AI crawler as a real view. Let’s see who’s most popular.
Not to mention that it's unknown if these are actually from AI companies, or from people pretending to be AI companies. You can set anything as your user agent.
It's more appropriate to mention the specific issue one haves about the crawlers, like "they request things too quickly" or "they're overloading my server". Then from there, it is easier to come to a solution than just "I hate AI". For example, one would realize that things like Anubis have existed forever, they are just called DDoS protection, specifically those using proof-of-work schemes (e.g. https://github.com/RuiSiang/PoW-Shield).
This also shifts the discussion away from something that adds to the discrimination against scraping in general, and more towards what is actually the issue: overloading servers, or in other words, DDoS.
It's just like how not all Ddoss's are actually hackers or bots. Sometimes a server just can't take the traffic of a large site flooding in. But the result is the same until something is investigated.
Running SHA hash calculations for a second or so once every week is not bad for users, but with scrapers constantly starting new sessions they end up spending most of their time running useless Javascript, slowing the down significantly.
The most effective alternative to proof of work calculations seems to be remote attestation. The downside is that you're getting captchas if you're one of the 0.1% who disable secure boot and run Linux, but the vast majority of web users will live a captcha free life. This same mechanism could in theory also be used to authenticate welcome scrapers rather than relying on pure IP whitelists.
It won't fully solve the problem, but with the problem relatively identified, you must then ask why people are engaging in this behavior. Answer: money, for the most part. Therefore, follow the money and identify the financial incentives driving this behavior. This leads you pretty quickly to a solution most people would reject out-of-hand: turn off the financial incentive that is driving the enshittification of the web. Which is to say, kill the ad-economy.
Or at least better regulate it while also levying punitive damages that are significant enough to both disuade bad-actors and encourage entities to view data-breaches (or the potential therein) and "leakage[0]" as something that should actually be effectively secured against. Afterall, there are some upsides to the ad-economy that, without it, would present some hard challenges (eg, how many people are willing to pay for search? what happens to the vibrant sphere of creators of all stripes that are incentivized by the ad-economy? etc).
Personally, I can't imagine this would actually happen. Pushback from monied interests aside, most people have given up on the idea of data-privacy or personal-ownership of their data, if they ever even cared in the first place. So, in the absence of willing to do do something about the incentive for this maligned behavior, we're left with few good options.
0: https://news.ycombinator.com/item?id=43716704 (see comments on all the various ways people's data is being leaked/leached/tracked/etc)
The broad idea is to use zero knowledge proofs with certification. It sort of flips the public key certification system and adds some privacy.
To get into place, the powers in charge need to sway.
As for letting well behaved crawlers in, I've had an idea for something like DKIM for crawlers. Should be possible to set up a fairly cheap cryptographic solution that enables crawlers a persistent identity that can't be forged.
Basically put a header containing first a string including today's date, the crawler's IP, and a domain name, then a cryptographic signature of the string. The domain has a TXT record with a public key for verifying the identity. It's cheap because you really only need to verify the string it once on the server side, and the crawler only needs to regenerate it once per day.
With that in place, crawlers can crawl with their reputation at stake. The big problem with these rogue scrapers are that they're basically impossible to identify or block, which means they don't have any incentives to behave well.
It wouldn't work to prevent the type of behavior shown in a title story
That may well be true. But how many of those people are specifically against AI companies scraping the web? That’s not really an argument—it’s an assumption based on personal perception.
> Ask the average person if they want more or less of their lives recorded and stored.
What exactly is the "average person"? Also, I’ll admit my earlier claim was a bit exaggerated. But let’s be clear: this isn’t about recording personal data—it’s about collecting and structuring knowledge.
And beyond that: companies have been scraping the web for years. They still are. And they’re gathering far more personal data for online marketing, tracking, profiling—whatever the reason—and the so-called "average person" hasn’t raised much of a finger. People remain glued to platforms, willingly sharing their personal lives. And what do they get in return? Doomscrolling and five-second video clips.
Who said that?
There's basically two extremes:
1. We want access to all of human knowledge, now and forever, in order to monetise it and make more money for us, and us alone.
and
2. We don't want our freely available knowledge sold back to us, with no credits to the original authors.
2. You’re not paying just to have your own knowledge echoed back at you. You’re paying so that someone (or something) can read what you provide and, ideally, return improved knowledge or fresh insights. As I said above, you’re paying for the technology and its capabilities—not the knowledge itself. That’s how I see it.
You appear to be under the impression that there is only one hypocritical group.
The companies selling us computers that supposedly know everything should pay for their database, or they should give away the knowledge they gained for free. Right now, the scraping and copying is free and the knowledge is behind a subscription to access a proprietary model that forms the basis of their business.
Humanity doesn't benefit, the snake oil salesmen do.
I do agree with you on the point that we need to find better ways to compensate the people creating content—especially considering that parts of this "AI service," as we might call it, are subscription-based.
But in the long run, I’m quite sure that if everyone shared this opinion, it wouldn't move us forward technologically.
Also, a couple of other points:
Google and others have been scraping the internet for years, and no one complained then.
You're not paying the AI company for the knowledge itself—you're paying for the technology behind it, for the ability to access and use it effectively.
https://krebsonsecurity.com/?s=infatica
https://krebsonsecurity.com/tag/residential-proxies/
https://bright-sdk.com/ <- way bigger than infatica
This is yet another reason why we need to be wary of popular apps, add-ons, extensions, and so forth changing hands, by legitimate sale or more nefarious methods. Initially innocent utilities can be quickly coopted into being parts of this sort of scheme.
The first involved requests from 300,000 unique IPs in a span of a few hours. I analyzed them and found that ~250,000 were from Brazil. I'm used to using ASNs to block network ranges sending this kind of traffic, but in this case they were spread thinly over 6,000+ ASNs! I ended up blocking all of Brazil (sorry).
A few days later this same web server was on fire again. I performed the same analysis on IPs and found a similar number of unique addresses, but spread across Turkey, Russia, Argentina, Algeria and many more countries. What is going on?! Eventually I think I found a pattern to identify the requests, in that they were using ancient Chrome user agents. Chrome 40, 50, 60 and up to 90, all released 5 to 15 years ago. Then, just before I could implement a block based on these user agents, the traffic stopped.
In both cases the traffic from datacenter networks was limited because I already rate limit a few dozen of the larger ones.
Sysadmin life...
It's a reverse proxy that presents a PoC challenge to every new visitor. It shifts the initial cost of accessing your server's resources back at the client. Assuming your uplink can handle 300k clients requesting a single 70kb web page, it should solve most of your problems.
For science, can you estimate your peak QPS?
Crazy how I remember the HN post where anubis's blog post was first made. Though, I always thought it was a bit funny with anime and it was made by frustration of (I think AWS? AI scrapers who won't follow general rules and it was constantly giving requests to his git server and it actually made his git server down I guess??) I didn't expect it to blow up to ... UN.
It was frustration at AWS' Alexa team and their abuse of the commons. Amusingly if they had replied to my email before I wrote my shitpost of an implementation this all could have turned out vastly differently.
Also didn't expect you to respond to my comment xD
I went through the slow realization of while reading this comment that you are the creator of anubis and I had such a smile when I realized that you commented to me.
Also, this project is really nice, but I actually want to ask, I haven't read the docs of anubis but could it be that the proof of work isn't wasted / it can be used for something (I know I might get downvoted because I am going to mention cryptocurrency, but nano currency has a proof of work required for each transaction, so if anubis actually does the proof of work as by nano standards, then theoretically that proof of work could atleast be some useful)
Looking forward to your comment!
The only way I see anything like that incorporated is a folding@home kind of thing that could help humanity as a whole.
Of course, if someone makes it work like you suggested, and it catches on, I will personally haunt your dreams forever. Don't give them any ideas.
It's a pretty effective attack because you get large numbers of individual browsers to contribute. Hosters don't care, so unless the site owners are technical enough, they can stay online quite a bit.
If they work with Referrer Policy, they should be able to mask themselves fairly well - the ones I saw back then did not.
Digging it up: https://www.washingtonpost.com/news/the-switch/wp/2015/04/10...
Very similar indeed. The attacks I witnessed where easy to block once you identified the patterns (referrer was visible and they used predictable ?_=... query parameters to try and bypass caches), but very effective otherwise.
I suppose in the event of a hot war, the Internet will be cut quickly to defend against things like the "Great Cannon".
It's not a crime if we do it with an app
https://pluralistic.net/2025/01/25/potatotrac/#carbo-loading
If you are being bombarded by suspicious IP addresses, please consider using our free service and blocking IP addresses by ASN or Country. I think ASN is a common parameter for malicious IP addresses. If you do not have time to explore our services/tools (it is mostly just our CLI: https://github.com/ipinfo/cli), simply paste the IP addresses (or logs) in plain text, send it to me and I will let you know the ASNs and corresponding ranges to block.
In cybersecurity, decisions must be guided by objective data, not assumptions or biases. When you’re facing abuse, you analyze the IPs involved and enrich them with context — ASN, country, city, whether it’s VPN, hosting, residential, etc. That gives you the information you need to make calculated decisions: Should you block a subnet? Rate-limit it? CAPTCHA-challenge it?
Here’s a small snapshot from my own SSH honeypot:
Summary of 1,413 attempts
- Hosting IPs: 981 (69%)
- VPNs: 35
- Top ASNs:
- AS204428 (SS-Net): 152
- AS136052 (PT Cloud Hosting Indonesia): 83
- AS14061 (DigitalOcean): 76
- Top Countries:
- Romania: 238 (16.8%)
- United States: 150 (10.6%)
- China: 134 (9.5%)
- Indonesia: 115 (8.1%)
One single /24 from Romania accounts for over 10% of the attacks. That’s not about nationality or ethnicity — it's about IP space abuse from a specific network. If a network or country consistently shows high levels of hostile traffic and your risk tolerance justifies it, blocking or throttling it may be entirely reasonable.Security teams don’t block based on "where people come from" — they block based on where the attacks are coming from.
We even offer tools to help people explore and understand these patterns better. But if someone doesn’t have the time or resources to do that, I'm more than happy to assist by analyzing logs and suggesting reasonable mitigations.
I hope nobody does cybersecurity in 2025 by analysing and enriching IP addresses. Not on a market where a single residential proxy provider (which you fail to identify) offers 150M+ exit nodes. Even a JA3 fingerprinting could be more useful than looking at IP addresses. I bet you, romanian ips were not operated by romanians. yet you're banning all romanians?
Cybersecurity is a probabilistic game. You build a threat model based on your business, audience, and tolerance for risk. Blocking combinations of metadata — such as ASN, country, usage type, and VPN/proxy status — is one way to make informed short-term mitigations while preserving long-term accessibility. For example:
If an ASN is a niche hosting provider in Indonesia, ask: “Do I expect real users from here?”
If a /24 from a single provider accounts for 10% of your attacks, ask: “Do I throttle it or add a CAPTCHA?”
The point isn’t to permanently ban regions or people. It’s to reduce noise and protect services while staying responsive to legitimate usage patterns.
As for IP enrichment — yes, it's still extremely relevant in 2025. Just like JA3, TLS fingerprinting, or behavioral patterns — it's one more layer of insight. But unlike opaque “fraud scores” or black-box models, our approach is fully transparent: we give you raw data, and you build your own model.
We intentionally don’t offer fraud scoring or IP quality scores. Why? Because we believe it reduces agency and transparency. It also risks penalizing privacy-conscious users just for using VPNs. Instead, we let you decide what “risky” means in your own context.
We’re deeply committed to accuracy and evidence-based data. Most IP geolocation providers historically relied on third-party geofeeds or manual submissions — essentially repackaging what networks told them. We took a different route: building a globally distributed network of nearly 1,000 probe servers to generate independent, verifiable measurements for latency-based geolocation. That’s a level of infrastructure investment most providers haven’t attempted, but we believe it's necessary for reliability and precision.
Regarding residential proxies: we’ve built our own residential proxy detection system (https://ipinfo.io/products/residential-proxy) from scratch, and it’s maturing fast. One provider may claim 150M+ exit nodes, but across a 90-day rolling window, we’ve already observed 40,631,473 unique residential proxy IPs — and counting. The space is noisy, but we’re investing heavily in research-first approaches to bring clarity to it.
IP addresses aren’t perfect but nothing is! But with the right context, they’re still one of the most powerful tools available for defending services at the network layer. We provide the context and you build the solution.
api•2mo ago
Anything incorporating anything like this is malware.
reconnecting•2mo ago
In most cases they are used for conducting real financial crimes, but the police investigators are also aware that there is a very low chance that sophisticated fraud is committed directly from a residential IP address.