Recently I've been looking into Garage and liking the idea of it, but it seems to have a very different design (no EC).
It's open source / free to boot. I have no direct experience with it myself however.
We would occasionally try cephFS, the POSIX shared network filesystem, but it couldn't match our gluster performance for our workload. But also, we built the ceph long term storage to maximize TB/$, so it was at a disadvantage compared to our gluster install. Still, I never heard of cephFS being used anywhere despite it being the original goal in the papers back at UCSC. Keep an eye on CERN for news about one of the bigger ceph installs with public info.
I love both of the systems, and am glad to see that gluster is still around.
It's the classic horizontal/vertical scaling trade off, that's why flash tends to be more space/cost efficient for speedy access.
However, if you need high IOPS, you need flash on MDS for Lustre and some Log SSDs (esp. dedicated write and read ones) for ZFS.
Basically, I have a single big server with 80 high-capacity HDDs and 4 high-endurance NVMes, and it's the S3 endpoint that gets a lot of writes.
So yes, for now my best candidate is ZFS + Garage, this way I can get away with using replica=1 and rely on ZFS RAIDz for data safety, and the NVMEs can get sliced and diced to act as the fast metadata store for Garage, the "special" device/small records store for the ZFS, the ZIL/SLOG device and so on.
Currently it's a bit of a Frankenstein's monster: using XFS+OpenCAS as the backing storage for an old version of MinIO (containerized to run as 5 instances), I'm looking to replace it with a simpler design and hopefully get a better performance.
TrueNAS can handle the OpenZFS (zRAID, Caches and Logs) part and you can deploy Garage or any other S3 gateway on top of it.
It can be an interesting experiment, and 80 disk server is not too big for a TrueNAS installation.
I honestly figured that it must be powered by SSD for the standard tier and the slower tiers were the ones using HDD or slower systems.
I assume would have lots of queues, caches and long running workers.
Biasing away from lots of small services in favour of larger ones that handle more of the work so that as much as possible you avoid the costs and latency of preparing, transmitting, receiving and processing requests.
I know S3 has changed since I was there nearly a decade ago, so this is outdated. Off the top of my head it used to be about a dozen main services at that time. A request to put an object would only touch a couple of services en route to disk, and similar on retrieval. There were a few services that handled fixity and data durability operations, the software on the storage servers themselves, and then stuff that maintained the mapping between object and storage.
S3 is comprised primarily of layers of Java-based web services. The hot path (object get / put / list) are all served by synchronous API servers - no queues or workers. It is the best example of how many transactions per second a pretty standard Java web service stack can handle that I’ve seen in my career.
For a get call, you first hit a fleet of front-end HTTP API servers behind a set of load balancers. Partitioning is based on the key name prefixes, although I hear they’ve done work to decouple that recently. Your request is then sent to the Indexing fleet to find the mapping of your key name to an internal storage id. This is returned to the front end layer, which then calls the storage layer with the id to get the actual bits. It is a very straightforward multi-layer distributed system design for serving synchronous API responses at massive scale.
The only novel bit is all the backend communication uses a home-grown stripped-down HTTP variant, called STUMPY if I recall. It was a dumb idea to not just use HTTP but the service is ancient and originally built back when principal engineers were allowed to YOLO their own frameworks and protocols so now they are stuck with it. They might have done the massive lift to replace STUMPY with HTTP since my time.
can you give some numbers? or at least ballpark?
EwanToo•2h ago
https://www.allthingsdistributed.com/2023/07/building-and-op...
giancarlostoro•2h ago
enether•1h ago
littlesnugblood•17m ago
gostsamo•11m ago