It's a really cool system for hyper converged architecture where storage requests can pull data from the local machine and only hit the network when needed.
Erasure coding is another debate, for now we have chosen not to implement it, but I would personally be open to have it supported by Garage if someone codes it up.
> For the metadata storage, Garage does not do checksumming and integrity verification on its own, so it is better to use a robust filesystem such as BTRFS or ZFS. Users have reported that when using the LMDB database engine (the default), database files have a tendency of becoming corrupted after an unclean shutdown (e.g. a power outage), so you should take regular snapshots to be able to recover from such a situation.
It seems like you can also use SQLite, but a default database that isn't robust against power failure or crashes seems suprising to me.
If you really live somewhere with frequent outages, buy an industrial drive that has a PLP rating. Or get a UPS, they tend to be cheaper.
As I understood it, the capacitors on datacenter-grade drives are to give it more flexibility, as it allows the drive to issue a successful write response for cached data: the capacitor guarantees that even with a power loss the write will still finish, so for all intents and purposes it has been persisted, so an fsync can return without having to wait on the actual flash itself, which greatly increases performance. Have I just completely misunderstood this?
Unfortunately they do: https://news.ycombinator.com/item?id=38371307
Yes, otherwise those drives wouldn't work at all and would have a 100% warranty return rate. The reason they get away with it is that the misbehavior is only a problem in a specific edge-case (forgetting data written shortly before a power loss).
https://documents.westerndigital.com/content/dam/doc-library...
That doesn't even help if fsync() doesn't do what developers expect: https://danluu.com/fsyncgate/
I think this was the blog post that had a bunch more stuff that can go wrong too: https://danluu.com/deconstruct-files/
But basically fsync itself (sometimes) has dubious behaviour, then OS on top of kernel handles it dubiously, and then even on top of that most databases can ignore fsync erroring (and lie that the data was written properly)
So... yes.
If you use WITHOUT ROWID, you traverse only the BLOB->data tree.
Looking up lexicographically similar keys gets a huge performance boost since sqlite can scan a B-Tree node and the data is contiguous. Your current implementation is chasing pointers to random locations in a different b-tree.
I'm not sure exactly whether on disk size would get smaller or larger. It probably depends on the key size and value size compared to the 64 bit rowids. This is probably a well studied question you could find the answer to.
[1]: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/4efc8...
It's built specifically to run on object storage, currently relies on the `object_store` crate but we're consdering OpenDAL instead so if Garage works with those crates (I assume it does if its S3 compatible) it should just work OOTB.
Checksumming detects corruption after it happened. A database like Postgres will simply notice it was not cleanly shut down and put the DB into a consistent state by replaying the write ahead log on startup. So that is kind of my default expectation for any DB that handles data that isn't ephemeral or easily regenerated.
But I also likely have the wrong mental model of what Garage does with the metadata, as I wouldn't have expected that to be ever limited by Sqlite.
We do recommend SQLite in our quick-start guide to setup a single-node deployment for small/moderate workloads, and it works fine. The "real world deployment" guide recommends LMDB because it gives much better performance (with the current status of Garage, not to say that this couldn't be improved), and the risk of critical data loss is mitigated by the fact that such a deployment would use multi-node replication, meaning that the data can always be recovered from another replica if one node is corrupted and no snapshot is available. Maybe this should be worded better, I can see that the alarmist wording of the deployment guide is creating quite a debate so we probably need to make these facts clearer.
We are also experimenting Fjall as an alternate KV engine based on LSM, as it theoretically has good speed and crash resilience, which would make it the best option. We are just not recommending it by default yet, as we don't have much data to confirm that it works up to these expectations.
It’s worth noting too that B+ tree databases are not a fantastic match for ZFS - they usually require extra tuning (block sizes, other stuff like how WAL commits work) to get performance comparable to XFS/ext4. LSMs on the other hand naturally fit ZFS’s CoW internals like a glove.
Whether or not it's semantically "correct" because of usage of hyphen vs slash is irrelevant to that point.
A used-car lot
A value-added tax
A key-based access system
When you have two exclusive options, two sides to a situation, or separate things; you separate them with a slash: An on/off switch
A win/win situation
A master/slave arrangement
Therefore a key-value store and a key/value store are quite different.It's true that key–value store shouldn't be written with a hyphen. It should be written with an en dash, which is used "to contrast values or illustrate a relationship between two things [... e.g.] Mother–daughter relationship"
https://en.wikipedia.org/wiki/Dash#En_dash
I just didn't want to bother with typography at that level of pedanticism.
"...the slash is now used to represent division and fractions, as a date separator, in between multiple alternative or related terms"
-Wikipedia
And what is a key/value store? A store of related terms.
And if you had a system that only allowed a finite collection of key values, where might you put them? A key-value store.
Standard filesystems such as ext4 and xfs don't have data checksumming, so you'll have to rely on another layer to provide integrity. Regardless, that's not garage's job imo. It's good that they're keeping their design simple and focus their resources on implementing the S3 spec.
LMDB mode also runs with flush/syncing disabled
this is the reliability question no?
https://archive.fosdem.org/2024/schedule/event/fosdem-2024-3...
Slides are available here:
https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/4efc8...
Conditionnal writes : no, we can't do it with CRDTs, which are the core of Garage's design.
https://dd.thekkedam.org/assets/documents/publications/Repor... http://www.bailis.org/papers/ramp-sigmod2014.pdf
Garage looks really nice: I've evaluated it with test code and benchmarks and it looks like a winner. Also, very straightforward deployment (self contained executable) and good docs.
But no tags on objects is a pretty big gap, and I had to shelve it. If Garage folk see this: please think on this. You obviously have the talent to make a killer application, but tags are table stakes in the "cloud" API world.
I really, really appreciate that Garage accommodates running as a single node without work-arounds and special configuration to yield some kind of degraded state. Despite the single minded focus on distributed operation you no doubt hear endlessly (as seen among some comments here,) there are, in fact, traditional use cases where someone will be attracted to Garage only for the API compatibility, and where they will achieve availability in production sufficient to their needs by means other than clustering.
Arbitrary name+value pairs attached to S3 objects and buckets, and readily available via the S3 API. Metadata, basically. AWS has some tie-ins with permissions and other features, but tags can be used for any purpose. You might encode video multiple times at different bitrates, and store the rate in a tag on each object, for example. Tags are an affordance used by many applications for countless purposes.
The assumption Garage makes, which is well-documented, is that of 3 replica nodes, only 1 will be in a crash-like situation at any time. With 1 crashed node, the cluster is still fully functional. With 2 crashed nodes, the cluster is unavailable until at least one additional node is recovered, but no data is lost.
In other words, Garage makes a very precise promise to its users, which is fully respected. Database corruption upon power loss enters in the definition of a "crash state", similarly to a node just being offline due to an internet connection loss. We recommend making metadata snapshots so that recovery of a crashed node is faster and simpler, but it's not required per se: Garage can always start over from an empty database and recover data from the remaining copies in the cluster.
To talk more about concrete scenarios: if you have 3 replicas in 3 different physical locations, the assumption of at-most one crashed node is pretty reasonable, it's quite unlikely that 2 of the 3 locations will be offline at the same time. Concerning data corruption on a power loss, the probability to lose power at 3 distant sites at the exact same time with the same data in the write buffers is extremely low, so I'd say in practice it's not a problem.
Of course, this all implies a Garage cluster running with 3-way replication, which everyone should do.
If it's just the write buffer at risk, that's fine. But the chance of overlapping power loss across multiple sites isn't low enough to risk all the existing data.
It's downright stupid if you build a system that loses all existing data when all nodes go down uncleanly, not even simultaneously but just overlapping. What if you just happen to input a shutdown command the wrong way?
I really hope they meant to just say the write buffer gets lost.
Again, I'm not concerned for new writes, I'm concerned for all existing data from the previous months and years.
And getting in this situation only takes one out of a wide outage or a bad push that takes down the cluster. Even if that's stupid, it's a common enough stupid that you should never risk your data on the certainty you won't make that mistake.
You can't protect against everything, but you should definitely protect against unclean shutdown.
Also, garage gives you the possibility to automatically snapshot the metadata, advices on how to do the snapshotting at the filesystem level and to restore that.
How do filesystem level snapshots work if nodes might get corrupted by power loss? Booting from a snapshot looks exactly the same to a node as booting from a power loss event. Are you implying that it does always recover from power loss and you're defending a flaw it doesn't even have?
- you lose whatever writes to s3 haven't finished yet, if any
- the local node will need to repair itself a bit after rebooting
- the local node is now trashed and will have to copy all data back over
- all the nodes are now trashed and it's restore from backup time
I've been kicking the tyres for a bit and I think it's the happy case in the above, but lots of software out there completely falls apart on crashes so it's not generally a safe assumption. I think the behaviour is sqlite on zfs doesn't care about pulling the power cable out, lmdb is a bit further down the list.
Syncthing will synchronize a full folder between an arbitrary number of machines, but you still have to access this folder one way or another.
Garage provides an HTTP API for your data, and handles internally the placement of this data among a set of possible replica nodes. But the data is not in the form of files on disk like the ones you upload to the API.
Syncthing is good for, e.g., synchronizing your documents or music collection between computers. Garage is good as a storage service for back-ups with e.g. Restic, for media files stored by a web application, for serving personal (static) web sites to the Internet. Of course, you can always run something like Nextcloud in front of Garage and get folder synchronization between computers somewhat like what you would get with Syncthing.
But to answer your question, yes, Garage only provides a S3-compatible API specifically.
We’ve done some fairly extensive testing internally recently and found that Garage is somewhat easier to deploy in comparison to our existing use of MinIO, but is not as performant at high speeds. IIRC we could push about 5 gigabits of (not small) GET requests out of it, but something blocked it from reaching the 20-25 gigabits (on a 25g NIC) that MinIO could reach (also 50k STAT requests/s, over 10 nodes)
I don’t begrudge it that. I get the impression that Garage isn’t necessarily focussed on this kind of use case.
---
In addition:
Next time we come to this we are going to look at RustFS [1], as well as Ceph/Rook [2].
We can see we're going to have to move away from MinIO in the foreseeable future. My hope is that the alternatives get a boost of interest given the direction MinIO is now taking.
[0]: https://news.ycombinator.com/item?id=46140342
[1]: https://rustfs.com/
[2]: https://rook.io/
But it might be interesting to see where the time is spent. I suspect they may be doing fewer things in parallel than MinIO, but maybe it's something entirely different.
I wouldn't be surprised if this will be fixed sometime in the future.
My favorite thing about all of this is that I had just invested a ton of time in understanding MinIO and its Kubernetes operator and got everything into a state that I felt good about. I was nearly ready to deploy it to production when the announcement was released that they would not be supporting it.
I’m somewhat surprised that no one is forking it (or I haven’t heard about any organizations of consequence stepping up anyway) instead of all of these projects to rebuild it from scratch.
nothing prevents them from hiking pricing, so expectations are not clear.
> RustFS is a high-performance, distributed object storage software developed using Rust, the world's most popular memory-safe language.
I’m actually something of a Rust booster, and have used it professionally more than once (including working on a primarily Rust codebase for a while). But it’s hard to take a project’s docs seriously when it describes Rust as “the world’s most popular memory-safe language”. Java, JavaScript, Python, even C# - these all blow it out of the water in popularity and are unambiguously memory safe. I’ve had a lot more segfaults in Rust dependencies than I have in Java dependencies (though both are minuscule in comparison to e.g. C++ dependencies).
[0]
qed
Again the statement is probably still untrue and bad marketing, but I suspect this kind of reasoning was behind it
Of course Rust technically fails too since `unsafe` is a language featureThanks for the reality check on our documentation. We realize that some of our phrasing sounded more like marketing hype than a technical spec. That wasn’t our intent, and we are currently refining our docs to be more precise and transparent.
A few points to clarify where we’re coming from: 1. The Technical Bet on Rust: Rust wasn’t a buzzword choice for us. We started this project two years ago with the belief that the concurrency and performance demands of modern storage—especially for AI-driven workloads—benefit from a foundation with predictable memory behavior, zero-cost abstractions, and no garbage collector. These properties matter when you care about determinism and tail latency. 2. Language Safety vs. System Design: We’re under no illusion that using a memory-safe language automatically makes a system “100% secure.” Rust gives us strong safety primitives, but the harder problems are still in distributed systems design, failure handling, and correctness under load. That’s where most of our engineering effort is focused. 3. Giving Back to the Ecosystem: We’re committed to the ecosystem we build on. RustFS is a sponsor of the Rust Foundation, and as we move toward a global, Apache 2.0 open-source model, we intend to contribute back in more concrete ways over time.
We know there’s still work to do on the polish side, and we genuinely appreciate the feedback. If you have specific questions about our implementation details or the S3 compatibility layer, I’m happy to dive into the technical details.
At work I'm typically a consumer of such services from large cloud providers. I read in few places how "difficult" it is, how you need "4GB minimum RAM for most services" and how "friends do not let friends run Ceph below 10Gb".
But this setup runs on a non dedicated 2.5Gb interface (there is VLAN segmentation and careful QoSing).
My benchmarks show I'm primarily network latency and bandwidth limited. By the very definition you can't get better than that.
There were many factors why I chose Ceph and not Garage, Seaweed or MinIo. (One of the biggest is that ceph does 2 birds with one stone for me - block and object).
Also from our experience the docs outright lie about ceph's OSD memory usage and we've seen double or more than what docs claim (8-10GB instead of 4)
And also it is already rigged for a rug-pull
https://github.com/rustfs/rustfs/blob/main/rustfs/src/licens...
from https://rustfs.com/ if you click Documentation, it takes you to their main docs site. there's a nav header at the top, if you click Docs there...it 404s.
"Single Node Multiple Disk Installation" is a 404. ditto "Terminology Explanation". and "Troubleshooting > Node Failure". and "RustFS Performance Comparison".
on the 404 page, there's a "take me home" button...which also leads to a 404.
We’re actively reviewing the docs and cleaning up any 404s or navigation issues we can find. For the specific 404 you mentioned, we haven’t been able to reproduce it on our end so far, but we’re continuing to investigate in case it’s environment- or cache-related.
On the licensing side, we want to be clear that we’re fully committed to Apache 2.0 for the long term.
Previously I used LocalStack S3 but ultimately didn't like the lack of persistance thats not available on the OSS verison. MinIO OSS is apparently no longer maintained? Also looked at SeaweedFS and RustFS but from a quick reading into them this once was the easiest to set up.
Just run "weed sever -s3 -dir=..." to have an object store.
Does anyone know a good open source S3 alternarive that's easily extendable with custom storage backends?
For example, AWS offers IA and Glacier in addition to the defaults.
This is used for ransomware resistant backups.
https://garagehq.deuxfleurs.fr/blog/2022-ipfs/
Let's talk!
* no lifecycle management of any kind - if you're using it for backups you can't set "don't delete versions for 3 months", so if anyone takes hold of your key, you backups are gone. I relied on minio's lifecycle management for that but it's feature missing in garage (and to be fair, most other) S3
* no automatic mirroring (if you want to have second copy in something other than garage or just don't want to have a cluster but rather have more independent nodes)
* ACLs for access are VERY limited - can't make a key access only sub-path, can't make a "master key" (AFAIK, couldn't find an option) that can access all the buckets so the previous point is also harder - I can't easily use rclone to mirror entire instance somewhere else unless I write scrip iterating over buckets and adding them bucket by bucket to key ACK
* Web hosting features are extremely limited so you won't be able to say set CORS headers for the bucket
* No ability to set keys - you can only generate on inside garage or import garage-formatted one - which means you can't just migrate storage itself, you have to re-generate every key. It also makes automating it harder, in case of minio you can pre-generate key and pass then fed it to clients and to the minio key command, here you have to do the dance of "generate with tool" -> "scrape and put into DB" -> put onto clients.
Overall I like the software a lot but if you have setup that uses those features, beware.
If someone gets a hold of your key, can't they also just change your backup deletion policy, even if it supported one?
Minio have full on ACLs so you can just create a key that can only write/read but not change any settings like that.
So you just need to keep the "master key" that you use for setup away from potentially vulnerable devices, the "backup key" doesn't need those permissions.
I've spent a mostly pleasant day seeing whether I can reasonably use garage + rclone as a replacement for NFS and the answer appears to be yes. Not really a recommended thing to do. Garage setup was trivial, somewhat reminiscent of wireguard. Rclone setup was a nuisance, accumulated a lot of arguments to get latency down and I think the 1.6 in trixie is buggy.
Each node has rclone's fuse mount layer on it with garage as the backing store. Writes are slow and a bit async, debugging shows that to be wholly my fault for putting rclone in front of it. Reads are fast, whether pretending to be a filesystem or not.
Yep, I think I'm sold. There will be better use cases for this than replacing NFS. Thanks for sharing :)
SomaticPirate•1mo ago
https://www.repoflow.io/blog/benchmarking-self-hosted-s3-com... was useful.
RustFS also looks interesting but for entirely non-technical reasons we had to exclude it.
Anyone have any advice for swapping this in for Minio?
dpedu•1mo ago
https://github.com/versity/versitygw
I am also curious how Ceph S3 gateway compares to all of these.
zipzad•1mo ago
skrtskrt•1mo ago
They just completely swapped out the whole service from the stack and wrote one in Go because of how much better the concurrency management was, and Ceph's team and codebase C++ was too resistant to change.
jiqiren•1mo ago
Implicated•1mo ago
Able/willing to expand on this at all? Just curious.
NitpickLawyer•1mo ago
lima•1mo ago
I'm not sure if it even has any sort of cluster consensus algorithm? I can't imagine it not eating committed writes in a multi-node deployment.
Garage and Ceph (well, radosgw) are the only open source S3-compatible object storage which have undergone serious durability/correctness testing. Anything else will most likely eat your data.
KevinatRustFS•1mo ago
To clarify our architecture: RustFS is purpose-built for high-performance object storage. We intentionally avoid relying on general-purpose consensus algorithms like Raft in the data path, as they introduce unnecessary latency for large blobs.
Instead, we rely on Erasure Coding for durability and Quorum-based Strict Consistency for correctness. A write is strictly acknowledged only after the data has been safely persisted to the majority of drives. This means the concern about "eating committed writes" is addressed through strict read-after-write guarantees rather than a background consensus log.
While we avoid heavy consensus for data transfer, we utilize dsync—a custom, lightweight distributed locking mechanism—for coordination. This specific architectural strategy has been proven reliable in production environments at the EiB scale.
lima•1mo ago
It's really hard to solve this problem without a consensus algorithm in a way that doesn't sacrifice something (usually correctness in edge cases/network partitions). Data availability is easy(ish), but keeping the metadata consistent requires some sort of consensus, either using Raft/Paxos/..., using strictly commutative operations, or similar. I'm curious how RustFS solves this, and I couldn't find any documentation.
EiB scale doesn't mean much - some workloads don't require strict metadata consistency guarantees, but others do.
dewey•1mo ago
NitpickLawyer•1mo ago
> Beijing Address: Area C, North Territory, Zhongguancun Dongsheng Science Park, No. 66 Xixiaokou Road, Haidian District, Beijing
> Beijing ICP Registration No. 2024061305-1
dewey•1mo ago
misnome•1mo ago
Otherwise, the built in admin on one-executable was nice, and support for tiered storage, but single node parallel write performance was pretty unimpressive and started throwing strange errors (investigating of which led to the AI ticket discovery).
scottydelta•1mo ago
klooney•1mo ago
chrislusf•1mo ago
Why skipping SeaweedFS? It rank #1 on all benchmarks, and has a lot of features.
dionian•1mo ago
chrislusf•1mo ago
meotimdihia•1mo ago
magicalhippo•1mo ago
Not a concern for many use-cases, just something to be aware of as it's not a universal solution.
[1]: https://github.com/seaweedfs/seaweedfs?tab=readme-ov-file#st...
chrislusf•1mo ago
magicalhippo•1mo ago
ted_dunning•1mo ago
elvinagy•1mo ago
We know trust matters, especially for a newer project, and we try to earn it through transparency and external validation. we were excited to see RustFS recently added as an optional service in Laravel Sail’s official Docker environment (PR #822). Having our implementation reviewed and accepted by a major ecosystem like Laravel was an encouraging milestone for us.
If the “non-technical reasons” you mentioned are around licensing or governance, I’m happy to discuss our long-term Apache 2.0 commitment and path to a stable GA.