S3 scales to petabytes a second on top of slow HDDs

https://bigdata.2minutestreaming.com/p/how-aws-s3-scales-with-tens-of-millions-of-hard-drives

86•todsacerdoti•4h ago

Comments

EwanToo•2h ago

I think a more interesting article on S3 is "Building and operating a pretty big storage system called S3"

https://www.allthingsdistributed.com/2023/07/building-and-op...

giancarlostoro•2h ago

Really nice read, thank you for that.

enether•1h ago

Author of the 2minutestreaming blog here. Good point! I'll add this as a reference at the end. I loved that piece. My goal was to be more concise and focus on the HDD aspect

littlesnugblood•17m ago

Andy Warfield is a narcissistic asshole. I speak from experience.

gostsamo•11m ago

Can you share some anecdotes?

crabique•2h ago

Is there an open source service designed with HDDs in mind that achieves similar performance? I know none of the big ones work that well with HDDs: MinIO, Swift, Ceph+RadosGW, SeaweedFS; they all suggest flash-only deployments.

Recently I've been looking into Garage and liking the idea of it, but it seems to have a very different design (no EC).

giancarlostoro•1h ago

Doing some light googling aside from Ceph being listed, there's one called Gluster as well. Hypes itself as "using common off-the-shelf hardware you can create large, distributed storage solutions for media streaming, data analysis, and other data- and bandwidth-intensive tasks."

It's open source / free to boot. I have no direct experience with it myself however.

https://www.gluster.org/

a012•56m ago

I’ve used GlusterFS before because I was having tens of old PCs and it worked for me very well. It’s basically a PoC to see how it work than production though

epistasis•9m ago

A decade ago where I worked we used gluster for ~200TB of shared file system on a SLURM compute cluster, as a much better clustered version of NFS. And we used ceph for its S3 interface (RadowGW) for tens of petabytes of back storage after the high IO stages of compute were finished.

We would occasionally try cephFS, the POSIX shared network filesystem, but it couldn't match our gluster performance for our workload. But also, we built the ceph long term storage to maximize TB/$, so it was at a disadvantage compared to our gluster install. Still, I never heard of cephFS being used anywhere despite it being the original goal in the papers back at UCSC. Keep an eye on CERN for news about one of the bigger ceph installs with public info.

I love both of the systems, and am glad to see that gluster is still around.

elitepleb•1h ago

Any of them will work just as well, but only with many datacenters worth of drives, which very few deployments can target.

It's the classic horizontal/vertical scaling trade off, that's why flash tends to be more space/cost efficient for speedy access.

bayindirh•1h ago

Lustre and ZFS can do similar speeds.

However, if you need high IOPS, you need flash on MDS for Lustre and some Log SSDs (esp. dedicated write and read ones) for ZFS.

crabique•1h ago

Thanks, but I forgot to specify that I'm interested in S3-compatible servers only.

Basically, I have a single big server with 80 high-capacity HDDs and 4 high-endurance NVMes, and it's the S3 endpoint that gets a lot of writes.

So yes, for now my best candidate is ZFS + Garage, this way I can get away with using replica=1 and rely on ZFS RAIDz for data safety, and the NVMEs can get sliced and diced to act as the fast metadata store for Garage, the "special" device/small records store for the ZFS, the ZIL/SLOG device and so on.

Currently it's a bit of a Frankenstein's monster: using XFS+OpenCAS as the backing storage for an old version of MinIO (containerized to run as 5 instances), I'm looking to replace it with a simpler design and hopefully get a better performance.

foobarian•1h ago

Do you know if some of these systems have components to periodically checksum the data at rest?

bayindirh•43m ago

ZFS/OpenZFS can do scrub and do block-level recovery. I'm not sure about Lustre, but since Petabyte sized storage is its natural habitat, there should be at least one way to handle that.

bayindirh•41m ago

It might not be the most ideal solution, but did you consider installing TrueNAS on that thing?

TrueNAS can handle the OpenZFS (zRAID, Caches and Logs) part and you can deploy Garage or any other S3 gateway on top of it.

It can be an interesting experiment, and 80 disk server is not too big for a TrueNAS installation.

creiht•20m ago

It is probably worth noting that most of the listed storage systems (including S3) are designed to scale not only in hard drives, but horizontally across many servers in a distributed system. They really are not optimized for a single storage node use case. There are also other things to consider that can limit performance, like what does the storage back plane look like for those 80 HDDs, and how much throughput can you effectively push through that. Then there is the network connectivity that will also be a limiting factor.

olavgg•50m ago

SeaweedFS has evolved a lot the last few years, with RDMA support and EC.

nerdjon•1h ago

So is any of S3 powered by SSD's?

I honestly figured that it must be powered by SSD for the standard tier and the slower tiers were the ones using HDD or slower systems.

MDGeist•1h ago

I always assumed the really slow tiers were tape.

hobs•14m ago

Not even the higher tiers of Glacier were tape afaict (at least when it was first created), just the observation that hard drives are much bigger than you can reasonably access in useful time.

wg0•1h ago

Does anyone know what is the technology stack of S3? Monolith or multiple services?

I assume would have lots of queues, caches and long running workers.

jyscao•1h ago

> conway’s law and how it shapes S3’s architecture (consisting of 300+ microservices)

Twirrim•1h ago

Amazon biases towards Systems Oriented Architecture approach that is in the middle ground between monolith and microservices.

Biasing away from lots of small services in favour of larger ones that handle more of the work so that as much as possible you avoid the costs and latency of preparing, transmitting, receiving and processing requests.

I know S3 has changed since I was there nearly a decade ago, so this is outdated. Off the top of my head it used to be about a dozen main services at that time. A request to put an object would only touch a couple of services en route to disk, and similar on retrieval. There were a few services that handled fixity and data durability operations, the software on the storage servers themselves, and then stuff that maintained the mapping between object and storage.

hnexamazon•39m ago

I was an SDE on the S3 Index team 10 years ago, but I doubt much of the core stack has changed.

S3 is comprised primarily of layers of Java-based web services. The hot path (object get / put / list) are all served by synchronous API servers - no queues or workers. It is the best example of how many transactions per second a pretty standard Java web service stack can handle that I’ve seen in my career.

For a get call, you first hit a fleet of front-end HTTP API servers behind a set of load balancers. Partitioning is based on the key name prefixes, although I hear they’ve done work to decouple that recently. Your request is then sent to the Indexing fleet to find the mapping of your key name to an internal storage id. This is returned to the front end layer, which then calls the storage layer with the id to get the actual bits. It is a very straightforward multi-layer distributed system design for serving synchronous API responses at massive scale.

The only novel bit is all the backend communication uses a home-grown stripped-down HTTP variant, called STUMPY if I recall. It was a dumb idea to not just use HTTP but the service is ancient and originally built back when principal engineers were allowed to YOLO their own frameworks and protocols so now they are stuck with it. They might have done the massive lift to replace STUMPY with HTTP since my time.

js4ever•23m ago

"It is the best example of how many transactions per second a pretty standard Java web service stack can handle that I’ve seen in my career."

can you give some numbers? or at least ballpark?

hnexamazon•11m ago

Tens of thousands of TPS per node.

dgllghr•55m ago

I enjoyed this article but I think the answer to the headline is obvious: parallelism

Yt-dlp: Upcoming new requirements for YouTube downloads

US Airlines Push to Strip Away Travelers' Rights by Rolling Back Key Protections

That Secret Service SIM farm story is bogus

Just Let Me Select Text

EU age verification app not planning desktop support

Learning Persian with Anki, ChatGPT and YouTube

How to Lead in a Room Full of Experts

My Ed(1) Toolbox

Rights groups urge UK PM Starmer to abandon plans for mandatory digital ID

S3 scales to petabytes a second on top of slow HDDs

Preparing for the .NET 10 GC

Huntington's disease treated for first time

My game's server is blocked in Spain whenever there's a football match on

Exploring GrapheneOS secure allocator: Hardened Malloc

I Spent Three Nights Solving Listen Labs Berghain Challenge (and Got #16)

Everyone's trying vectors and graphs for AI memory. We went back to SQL

Find SF parking cops

Baldur's Gate 3 Steam Deck – Native Version

Deep researcher with test-time diffusion

WiGLE: Wireless Network Mapping

Libghostty is coming

Qwen3-VL

How Neural Super Sampling Works: Architecture, Training, and Inference

Markov chains are the original language models

Getting AI to work in complex codebases

Top Programming Languages 2025

From Rust to reality: The hidden journey of fetch_max

Podman Desktop celebrates 3M downloads

A webshell and a normal file that have the same MD5

New study shows plants and animals emit a visible light that expires at death

Yt-dlp: Upcoming new requirements for YouTube downloads

US Airlines Push to Strip Away Travelers' Rights by Rolling Back Key Protections

That Secret Service SIM farm story is bogus

Just Let Me Select Text

EU age verification app not planning desktop support

Learning Persian with Anki, ChatGPT and YouTube

How to Lead in a Room Full of Experts

My Ed(1) Toolbox

Rights groups urge UK PM Starmer to abandon plans for mandatory digital ID

S3 scales to petabytes a second on top of slow HDDs

Preparing for the .NET 10 GC

Huntington's disease treated for first time

My game's server is blocked in Spain whenever there's a football match on

Exploring GrapheneOS secure allocator: Hardened Malloc

I Spent Three Nights Solving Listen Labs Berghain Challenge (and Got #16)

Everyone's trying vectors and graphs for AI memory. We went back to SQL

Find SF parking cops

Baldur's Gate 3 Steam Deck – Native Version

Deep researcher with test-time diffusion

WiGLE: Wireless Network Mapping

Libghostty is coming

Qwen3-VL

How Neural Super Sampling Works: Architecture, Training, and Inference

Markov chains are the original language models

Getting AI to work in complex codebases

Top Programming Languages 2025

From Rust to reality: The hidden journey of fetch_max

Podman Desktop celebrates 3M downloads

A webshell and a normal file that have the same MD5

New study shows plants and animals emit a visible light that expires at death

S3 scales to petabytes a second on top of slow HDDs

Comments