Scaling that data model beyond projects the size of the Linux kernel was not critical for the original implementation. I do wonder if there are fundamental limits to scaling the model for use cases beyond “source code management for modest-sized, long-lived projects”.
Consider vcpkg. It’s entirely reasonable to download a tree named by its hash to represent a locked package. Git knows how to store exactly this, but git does not know how to transfer it efficiently.
Naïvely, I’d expect shallow clones to be this, so I was quite surprised by a mention of GitHub asking people not to use them. Perhaps Git tries too hard to make a good packfile?..
Meanwhile, what Nixpkgs does (and why “release tarballs” were mentioned as a potential culprit in the discussion linked from TFA) is request a gzipped tarball of a particular commit’s files from a GitHub-specific endpoint over HTTP rather than use the Git protocol. So that’s already more or less what you want, except even the tarball is 46 MB at this point :( Either way, I don’t think the current problems with Nixpkgs actually support TFA’s thesis.
Turns out Go module will not accept package hosted on my Forgejo instance because it asks for certificate. There are ways to make go get use ssh but even with that approach the repository needs to be accessible over https. In the end, I cloned the repository and used it in my project using replace directive. It's really annoying.
No, that's false. You don't need anything to be accessible over HTTP.
But even if it did, and you had to use mTLS, there's a whole bunch of ways to solve this. How do you solve this for any other software that doesn't present client certs? You use a local proxy.
So the phrase the article says "Package managers keep falling for this. And it keeps not working out" I feel that's untrue.
The most issue I have with this really is "flakes" integration where the whole recipe folder is copied into the store (which doesn't happen with non-flakes commands), but that's a tooling problem not an intrinsic problem of using git
Julia does the same thing, and from the Rust numbers on the article, Julia has about 1/7th the number of packages that Rust does[1] (95k/13k = 7.3).
It works fine, Julia has some heuristics to not re-download it too often.
But more importantly, there's a simple path to improve. The top Registry.toml [1] has a path to each package, and once donwloading everything proves unsustainable you can just download that one file and use it to download the rest as needed. I don't think this is a difficult problem.
[1] https://github.com/JuliaRegistries/General/blob/master/Regis...
... Should it be concerning that someone was apparently able to engineer an ID like that?
https://en.wikipedia.org/wiki/Universally_unique_identifier
> 00000000-1111-2222-3333-444444444444
This would technically be version 2, which would be built from the date-time and MAC address, and DCE security version.
But overall, if you allow any yahoo to pick a UUID, its not really a UUID, its just some random string that looks like one.
Right now I don't see the problem because the only criterion for IDs is that they are unique.
Apparently it is the former, and most developers independently generate random IDs because it's easy and is extremely unlikely to result in collisions. But it seems the dev at the top of the list had a sense of vanity instead.
Software engineers always make the excuse that what they're making now is unimportant, so who cares? But then everything gets built on top of that unimportant thing, and one day the world crashes down. Worse, "fixing the problem" becomes near impossible, because now everything depends on it.
But really the reason not to do it, is there's no need to. There are plenty of other solutions than using Git that work as well or better without all the pitfalls. The lazy engineer picks bad solutions not because it's necessarily easier than the alternatives, but because it's the path of least resistance for themselves.
Not only is this not better, it's often actively worse. But this is excused by the same culture that gave us "move fast and break things". All you have to do is use any modern software to see how that worked out. Slow bug-riddled garbage that we're all now addicted to.
> The problem was that go get needed to fetch each dependency’s source code just to read its go.mod file and resolve transitive dependencies. Cloning entire repositories to get a single file.
I have also had inconsistent performance with go get. Never enough to look closely at it. I wonder if I was running into the same issue?
Python used to have this problem as well (technically still does, but a large majority of things are available as a wheel and PyPI generally publishes a separate .metadata file for those wheels), but at least it was only a question of downloading and unpacking an archive file, not cloning an entire repo. Sheesh.
Why would Go need to do that, though? Isn't the go.mod file in a specific place relative to the package root in the repo?
My favorite hill to die on (externality) is user time. Most software houses spend so much time focusing on how expensive engineering time is that they neglect user time. Software houses optimize for feature delivery and not user interaction time. Yet if I spent one hour making my app one second faster for my million users, I can save 277 user hour per year. But since user hours are an externality, such optimization never gets done.
Externalities lead to users downloading extra gigabytes of data (wasted time) and waiting for software, all of which is waste that the developer isn't responsible for and doesn't care about.
Commons would be if it's owned by nobody and everyone benefits from its existence.
The "tragedy", if you absolutely need to find one, is only for unrestricted, free-for-all commons, which is obviously a bad idea.
Nonetheless, the concept is still alive, and anthropic global warming is here to remind you about this.
But that doesn't mean the tragedy of the commons can't happen in other scenarios. If we define commons a bit more generously it does happen very frequently on the internet. It's also not difficult to find cases of it happening in larger cities, or in environments where cutthroat behavior has been normalized
That works while the size of the community is ~100-200 people, when everyone knows everyone else personally. It breaks down rapidly after that. We compensate for that with hierarchies of governance, which give rise to written laws and bureaucracy.
New tribes break off old tribes, form alliances, which form larger alliances, and eventually you end up with countries and counties and vovoidships and cities and districts and villages, in hierarchies that gain a level per ~100x population increase.
This is sociopolitical history of the world in a nutshell.
Commons can fail, but the whole point of Hardin calling commons a "tragedy" is to suggest it necessarily fails.
Compare it to, say, driving. It can fail too, but you wouldn't call it "the tragedy of driving".
We'd be much better off if people didn't throw around this zombie term decades after it's been shown to be unfounded.
But I would make the following clarifications:
1. A private entity is still the steward of the resource and therefore the resource figures into the aims, goals, and constraints of the private entity.
2. The common good is itself under the stewardship of the state, as its function is guardian of the common good.
3. The common good is the default (by natural law) and prior to the private good. The latter is instituted in positive law for the sake of the former by, e.g., reducing conflict over goods.
The jerks get their free things for a while, then it goes away for everyone.
Remember how GTA5 took 10 minutes to start and nobody cared? Lots of software is like this.
Some Blizzard games download 137 MB file every time you run them and take few minutes to start (and no, this is not due to my computer).
The article mentions that most of these projects did use GitHub as a central repo out of convenience so there’s that but they could also have used self-hosted repos.
Explain to me how you self-host a git repo without spending any money and having no budget which is accessed millions of time a day from CI jobs pulling packages.
I've looked into self hosting and git repo that has horizontal scalability, and it is indeed very difficult. I don't have the time to detail it in a comment here, but for anyone who is curious it's very informative to look at how GitLab handled this with gitaly. I've also seen some clever attempts to use object storage, though I haven't seen any of those solutions put heavily to the test.
I'd love to hear from others about ideas and approaches they've heard about or tried
This is what people mean about speed being a feature. But "user time" depends on more than the program's performance. UI design is also very important.
You can implement entire features with 10 cents of tokens.
Companies which dont adapt will be left behind this year.
Also in case you're not aware, accusing people of shilling or astroTurfing is against the hacker news guidelines
The number of companies that have this much respect for the user is vanishingly small.
Native software being an optimum is mostly an engineer fantasy that comes from imagining what you can build.
In reality that means having to install software like Meta’s WhatsApp, Zoom, and other crap I’d rather run in a browser tab.
I want very little software running natively on my machine.
I think companies shifted to online apps because #1 it solved the copy protection problem. FOSS apps are not in any hurry to become centralized because they dont care about that issue.
Local apps and data are a huge benefit of FOSS and I think every app website should at least mention that.
"Local app. No ads. You own your data."
I have never been convinced by this argument. The aggregate number sounds fantastic but I don't believe that any meaningful work can be done by each user saving 1 second. That 1 second (and more) can simply be taken by me trying to stretch my body out.
OTOH, if the argument is to make software smaller, I can get behind that since it will simply lead to more efficient usage of existing resources and thus reduce the environmental impact.
But we live in a capitalist world and there needs to be external pressure for change to occur. The current RAM shortage, if it lasts, might be one of them. Otherwise, we're only day dreaming for a utopia.
A high usage one, absolutely improve the time of it.
Loading the profile page? Isn't done often so not really worth it unless it's a known and vocal issue.
https://xkcd.com/1205/ gives a good estimate.
First argument would be - take at least two 0's from your estimation, most of applications will have maybe thousands of users, successful ones will maybe run with 10's of thousands. You might get lucky to work on application that has 100's of thousands, millions of users and you work in FAANG not a typical "software house".
Second argument is - most users use 10-20 apps in typical workday, your application is most likely irrelevant.
Third argument is - most users would save much more time learning how to use applications (or to use computer) properly they use on daily basis, than someone optimizing some function from 2s to 1s. But of course that's hard because they have 10-20 apps daily plus god know how many other not on daily basis. Though still I see people doing super silly stuff in tools like Excel or even not knowing copy paste - so not even like any command line magic.
> "The Macintosh boots too slowly. You've got to make it faster!"
Mostly to avoid downloading the whole repo/resolve deltas from the history for the few packages most applications tend to depend on. Especially in today's CI/CD World.
It relies on a git repo branch for stable. There are yaml definitions of the packages including urls to their repo, dependencies, etc. Preflight scripts. Post install checks. And the big one, the signatures for verification. No binaries, rpms, debs, ar, or zip files.
What’s actually installed lives in a small SQLite database and searching for software does a vector search on each packages yaml description.
Semver included.
This was inspired by brew/portage/dpkg for my hobby os.
As it is, this comment is just letting out your emotion, not engaging in dialogue.
O(1) beats O(n) as n gets large.
If it didn't work we would not have these massive ecosystems upsetting GitHub's freemium model, but anything at scale is naturally going to have consequences and features that aren't so compatible with the use case.
Every, ...king time, when I read something like "RFC 2789 introduced a sparse HTTP protocol." my brain suffers from a short-circuit. BTW: RFC 2789 is a "Mail Monitoring MIB".
Not so smart, when we realize, that one of aspects of secure and reliable system is elimination of ambiguities.
The index is used for all lookups; it can also be generated or incrementally updated client-side to accommodate local changes.
This has worked fine for literally decades, starting back when bandwidth and CPU power was far more limited.
The problem isn’t using SCM, and the solutions have been known for a very long time.
This entire blog is just a waste of time for anyone reading it.
That is such an insane default, I'm at a loss for words.
Unfortunately, when you’re starting out, the idea of running a registry is a really tough sell. Now, on top of the very hard engineering problem of writing the code and making a world class tool, plus the social one of getting it adopted, I need to worry about funding and maintaining something that serves potentially a world of traffic? The git solution is intoxicating through this lense.
Fundamentally, the issue is the sparse checkouts mentioned by the author. You’d really like to use git to version package manifests, so that anyone with any package version can get the EXACT package they built with.
But this doesn’t work, because you need arbitrary commits. You either need a full checkout, or you need to somehow track the commit a package version is in without knowing what hash git will generate before you do it. You have to push the package update and then push a second commit recording that. Obviously infeasible, obviously a nightmare.
Conan’s solution is I think just about the only way. It trades the perfect reproduction for conditional logic in the manifest. Instead of 3.12 pointing to a commit, every 3.x points to the same manifest, and there’s just a little logic to set that specific config field added in 3.12. If the logic gets too much, they let you map version ranges to manifests for a package. So if 3.13 rewrites the entire manifest, just remap it.
I have not found another package manager that uses git as a backend that isn’t a terrible and slow tool. Conan may not be as rigorous as Nix because of this decision but it is quite pragmatic and useful. The real solution is to use a database, of course, but unless someone wants to wire me ten thousand dollars plus server costs in perpetuity, what’s a guy supposed to do?
So you need a decentralized database? Those exist (or you can make your own, if you're feeling ambitious), probably ones that scale in different ways than git does.
It’s really important that someone is able to search for the manifest one of their dependencies uses for when stuff doesn’t work out of the box. That should be as simple as possible.
I’m all ears, though! Would love to find something as simple and good as a git registry but decentralized
The point being, if you're not sure whether your project will ever need to scale, then it may not make sense to reinvent the wheel when git is right there (and then invent the solution for hosting that git repo, when Github is right there), letting you spend time instead on other, more immediate problems.
for every project that managed to out-grow ext4/git there were a hundred that were well-served and never needed to over-invest in something else
It being unoptimal bandwidth wise is frankly just a technical hurdle to get over it, with benefits well worth the drawback
> The problem was that go get needed to fetch each dependency’s source code just to read its go.mod file and resolve transitive dependencies.
This article is mixing two separate issues. One is using git as the master database storing the index of packages and their versions. The other is fetching the code of each package through git. They are orthogonal; you can have a package index using git but the packages being zip/tar/etc archives, you can have a package index not using git but each package is cloned from a git repository, you can have both the index and the packages being git repositories, you can have neither using git, you can even not have a package index at all (AFAIK that's the case for Go).
Sqlite data is paged and so you can get away with only fetching the pages you need to resolve your query.
https://phiresky.github.io/blog/2021/hosting-sqlite-database...
eviks•2h ago