I’ve never used it for anything serious but my understanding is that Mercurial handles binary files better? Like it supports binary diffs if I understand correctly.
Any reason Git couldn’t get that?
Are there many binaries that people would store in git where this would actually help? I assume most files end up with compression or some other form of randomization between revisions making deduplication futile.
LFS does break disconnected/offline/sneakernet operations which wasn't mentioned and is not awesome, but those are niche workflows. It sounds like that would also be broken with promisors.
The `git partial clone` examples are cool!
The description of Large Object Promisors makes it sound like they take the client-side complexity in LFS, move it server-side, and then increases the complexity? Instead of the client uploading to a git server and to a LFS server it uploads to a git server which in turn uploads to an object store, but the client will download directly from the object store? Obviously different tradeoffs there. I'm curious how often people will get bit by uploading to public git servers which upload to hidden promisor remotes.
I dunno if their solution is any better but it's fairly unarguable that LFS is bad.
While git LFS is just a kludge for now, writing a filter argument during the clone operation is not the long-term solution either.
Git clone is the very first command most people will run when learning how to use git. Emphasized for effect: the very first command.
Will they remember to write the filter? Maybe, if the tutorial to the cool codebase they're trying to access mentions it. Maybe not. What happens if they don't? It may take a long time without any obvious indication. And if they do? The cloned repo might not be compilable/usable since the blobs are missing.
Say they do get it right. Will they understand it? Most likely not. We are exposing the inner workings of git on the very first command they learn. What's a blob? Why do I need to filter on it? Where are blobs stored? It's classic abstraction leakage.
This is a solved problem: Rsync does it. Just port the bloody implementation over. It does mean supporting alternative representations or moving away from blobs altogether, which git maintainers seem unwilling to do.
And yes, you can fix defaults without breaking backwards compatibility.
Not strictly true. They did change the default push behaviour from "matching" to "simple" in Git 2.0.
I agree with GP. The git community is very fond of doing checkbox fixes for team problems that aren’t or can’t be set as defaults and so require constant user intervention to work. See also some of the sparse checkout systems and adding notes to commits after the fact. They only work if you turn every pull and push into a flurry of activity. Which means they will never work from your IDE. Those are non fixes that pollute the space for actual fixes.
Can you explain what the solution is? I don't mean the details of the rsync algorithm, but rather what it would like like from the users' perspective. What files are on your local filesystem when you do a "git clone"?
Only the histories of the blobs are filtered out.
A good version control system would support petabyte scale history and terabyte scale clones via sparse virtual filesystem.
Git’s design is just bad for almost all projects that aren’t Linux.
I like this idea in principle but I always wonder what that would look in practice, outside a FAANG company: How do you ensure the virtual file system works equally well on all platforms, without root access, possibly even inside containers? How do you ensure it's fast? What do you do in case of network errors?
Then you can clone without checking out all the unnecessary large files to get a working build, This also helps on the legal side to correctly license your repos.
I'm struggling to see how this is a problem with git and not just antipatterns that arise from badly organized projects.
> if I git clone a repo with many revisions of a noisome 25 MB PNG file
FYI ‘noisome’ is not a synonym for ‘noisy’ - it’s more of a synonym for ‘noxious’; it means something smells bad.
Prolly trees are very similar to Merkle trees or the rsync algorithm, but they support mutation and version history retention with some nice properties. For example: you always obtain exactly the same tree (with the same root hash) irrespective of the order of incremental edit operations used to get to the same state.
In other words, two users could edit a subset of a 1 TB file, both could merge their edits, and both will then agree on the root hash without having to re-hash or even download the entire file!
Another major advantage on modern many-core CPUs is that Prolly trees can be constructed in parallel instead of having to be streamed sequentially on one thread.
Then the really big brained move is to store the entire SCM repo as a single Prolly tree for efficient incremental downloads, merges, or whatever. I.e.: a repo fork could share storage with the original not just up to the point-in-time of the fork, but all future changes too.
Does anyone have feedback about personally using DVC vs LFS?
matheusmoreira•9h ago
These new features are pretty awesome too. Especially separate large object remotes. They will probably enable git to be used for even more things than it's already being used for. They will enable new ways to work with git.