What Does a Database for SSDs Look Like?

https://brooker.co.za/blog/2025/12/15/database-for-ssd.html

148•charleshn•1mo ago

Comments

mrkeen•1mo ago

> Design decisions like write-ahead logs, large page sizes, and buffering table writes in bulk were built around disks where I/O was SLOW, and where sequential I/O was order(s)-of-magnitude faster than random.

Overall speed is irrelevant, what mattered was the relative speed difference between sequential and random access.

And since there's still a massive difference between sequential and random access with SSDs, I doubt the overall approach of using buffers needs to be reconsidered.

crazygringo•1mo ago

Can you clarify? I thought a major benefit of SSDs is that there isn't any difference between sequential and random access. There's no physical head that needs to move.

Edit: thank you for all the answers -- very educational, TIL!

b112•1mo ago

Read up on IOPS, conjoined with requests for sequential reads.

threeducks•1mo ago

Lets take the Samsung 9100 Pro M.2 as an example. It has a sequential read rate of ~6700 MB/s and a 4k random read rate of ~80 MB/s:

https://i.imgur.com/t5scCa3.png

https://ssd.userbenchmark.com/ (click on the orange double arrow to view additional columns)

That is a latency of about 50 µs for a random read, compared to 4-5 ms latency for HDDs.

OptionOfT•1mo ago

At the 4K random reads impacted by the fact that you still cannot switch Samsung SSDs to 4K native clusters?

diroussel•1mo ago

I think that is a bigger impact on writes than reads, but certainly means there is some gap from optimal.

To me a 4k read seems anachronistic from a modern application perspective. But I gather 4kb pages are still common in many file systems. But that doesn’t mean the majority of reads are 4kb random in a real world scenario.

mgerdts•1mo ago

Datacenter storage will generally not be using M.2 client drives. They employ optimizations that win many benchmarks but sacrifice on consistency multiple dimensions (power loss protection, write performance degrades as they fill, perhaps others).

With SSDs, the write pattern is very important to read performance.

Datacenter and enterprise class drives tend to have a maximum transfer size of 128k, which is seemingly the NAND block size. A block is the thing that needs to be erased before rewriting.

Most drives seem to have an indirection unit size of 4k. If a write is not a multiple of the IU size or not aligned, the drive will have to do a read-modify-write. It is the IU size that is most relevant to filesystem block size.

If a small write happens atop a block that was fully written with one write, a read of that LBA range will lead to at least two NAND reads until garbage collection fixes it.

If all writes are done such that they are 128k aligned, sequential reads will be optimal and with sufficient queue depth random 128k reads may match sequential read speed. Depending on the drive, sequential reads may retain an edge due to the drive’s read ahead. My own benchmarks of gen4 U.2 drives generally backs up these statements.

At these speeds, the OS or app performing buffered reads may lead to reduced speed because cache management becomes relatively expensive. Testing should be done with direct IO using libaio or similar.

hinkley•1mo ago

That’s literally faster to do a full table scan below a particular table size.

yyyk•1mo ago

SSD controllers and VFSs are often optimized for sequential access (e.g. readahead cache) which leads to software being written to do sequential access for speed which leads to optimization for that access pattern, and so on.

PunchyHamster•1mo ago

SSD block size is far bigger than 4kB. They still benefit from sequential write

formerly_proven•1mo ago

SSDs have three block/page sizes:

- The access block size (LBA size). Either 512 bytes or 4096 bytes modulo DIF. Purely a logical abstraction.

- The programming page size. Something in the 4K-64K range. This is the granularity at which an erased block may be programmed with new data.

- The erase block size. Something in the 1-128 MiB range. This is the granularity at which data is erased from the flash chips.

SSDs always use some kind of journaled mapping to cope with the actual block size being roughly five orders of magnitude larger than the write API suggests. The FTL probably looks something like an LSM with some constant background compaction going on. If your writes are larger chunks, and your reads match those chunks, you would expect the FTL to perform better, because it can allocate writes contiguously and reads within the data structure have good locality as well. You can also expect for drives to further optimize sequential operations, just like the OS does.

(N.b. things are likely more complex, because controllers will likely stripe data with the FEC across NAND planes and chips for reliability, so the actual logical write size from the controller is probably not a single NAND page)

lazide•1mo ago

It depends on the side of read - most SSD’s have internal block sizes much larger than a typical (actual) random read, so they internally have to do a lot more work for a given byte of output in a random read situation than they would in a sequential one.

Most filesystems read in 4K chunks (or sometimes even worse, 512 byes), and internally the actual block is often multiple MB in size, so this internal read multiplication is a big factor in performance in those cases.

Note the only real difference between a random read and a sequential one is the size of the read in one sequence before it switches location - is it 4K? 16mb? 2G?

loeg•1mo ago

Some discussion in the FragPicker paper (2021) FWIW: https://dl.acm.org/doi/10.1145/3477132.3483593

> Our extensive experiments discover that, unlike HDDs, the performance degradation of modern storage devices incurred by fragmentation mainly stems from request splitting, where a single I/O request is split into multiple ones.

Mawr•1mo ago

That's not possible. It's not an SSD thing either, it always applies to everything [0].

Sequential access is just the simplest example of predictable access, which is always going to perform better than random access because it's possible to optimize around it. You can't optimize around randomness.

So if you give me your fanciest, fastest random access SSD, I can always hand you back that SSD but now with sequential access faster than the random access.

[0]: RAM access, CPU branch prediction, buying stuff in bulk...

Lwerewolf•1mo ago

Same with doing things in RAM as well. Sequential writes and cache-friendly reads, which b-trees tend to achieve for any definition of cache. Some compaction/GC/whatever step at some point. Nothing's fundamentally changed, right?

vegabook•1mo ago

pity Optane which solved for this quite well, was discontinued.

__turbobrew__•1mo ago

It really is a shame optane is discontinued. For durable low latency writes there really is nothing else out there.

londons_explore•1mo ago

Median database workloads are probably doing writes of just a few bytes per transaction. Ie 'set last_login_time = now() where userid=12345'.

Due to the interface between SSD and host OS being block based, you are forced to write a full 4k page. Which means you really still benefit from a write ahead log to batch together all those changes, at least up to page size, if not larger.

esperent•1mo ago

Don't some SSDs have 512b page size?

zokier•1mo ago

They might present 512 blocks to host, but internally the ssd almost certainly manages data in larger pages

cm2187•1mo ago

And the filesystem will also likely be 4k block size.

digikata•1mo ago

I would guess by now none have that internally. As a rule of thumb every major flash density increase (SLC, TLC, QLC) also tended to double internal page size. There were also internal transfer performance reasons for large sizes. Low level 16k-64k flash "pages" are common, and sometimes with even larger stripes of pages due to the internal firmware sw/hw design.

Sesse__•1mo ago

Also due to error correction issues. Flash is notoriously unreliable, so you get bit errors _all the time_ (correcting errors is absolutely routine). And you can make more efficient error-correcting codes if you are using larger blocks. This is why HDDs went from 512 to 4096 byte blocks as well.

Sesse__•1mo ago

A write-ahead log isn't a performance tool to batch changes, it's a tool to get durability of random writes. You write your intended changes to the log, fsync it (which means you get a 4k write), then make the actual changes on disk just as if you didn't have a WAL.

If you want to get some sort of sub-block batching, you need a structure that isn't random in the first place, for instance an LSM (where you write all of your changes sequentially to a log and then do compaction later)—and then solve your durability in some other way.

throw0101a•1mo ago

> A write-ahead log isn't a performance tool to batch changes, it's a tool to get durability of random writes.

¿Por qué no los dos?

Sesse__•1mo ago

Because it is in addition to your writes, not instead of them. That's what “ahead” points to.

_bohm•1mo ago

The actual writes don’t need to be persisted on transaction commit, only the WAL. In most DBs the actual writes won’t be persisted until the written page is evicted from the page cache. In this sense, writing WAL generally does provide better perf than synchronously doing a random page write

Tostino•1mo ago

Look up how "checkpointing" works in Postgres.

Sesse__•1mo ago

I know how checkpointing works in Postgres (which isn't very different from how it works in most other redo-log implementations). It still does not change that you need to actually update the heap at some point.

Postgres allows a group commit to try to combine multiple transactions to avoid the multiple fsyncs, but it adds delay and is off by default. And even so, it reduces fsyncs, not writes.

Tostino•1mo ago

But it turns those multiplied writes into two more sequential streams of writes. Yeah, it duplicates things, but the purpose is to allow as much sequential IO as possible (along with the other benefits and tradeoffs).

toolslive•1mo ago

you can unify database with write-ahead log using a persistent data structure. It also gives you cheap/free snapshots/checkpoints.

formerly_proven•1mo ago

WALs are typically DB-page-level physical logs, and database page sizes are often larger than the I/O page size or the host page size.

mythoughtsexact•1mo ago

I keep seeing this crap ass design where developers or some product a-hole decides their schema for users needs a field in their table where it shows the last_login_time. It's a shit design, it does not scale, nor is it necessary at all to get that one piece of information.

Use a damned log/journal table if you absolutely must, but make sure it's a fire away and forget non-transactional insert because you morons keep writing software that requires locking the whole users table and that causes huge performance issues.

jiggawatts•1mo ago

Your rant is mostly valid except for one thing: most DBMS systems don’t lock the whole table for a point update. Almost all of them will take a short lived row lock only.

danielfalbo•1mo ago

Reminds me of: Databases on SSDs, Initial Ideas on Tuning (2010) [1]

[1] https://www.dr-josiah.com/2010/08/databases-on-ssds-initial-...

zokier•1mo ago

Author could have started by surveying current state of art instead of just falsely assuming that DB devs have just been resting on the laurels for past decades. If you want to see (relational) DB for SSD just check out stuff like myrocks on zenfs+; it's pretty impressive stuff.

lazide•1mo ago

But then how would they have anything to do?

einpoklum•1mo ago

There has also been some significant academic study of DBMS design for persistent memory - which SSD technology can serve as (e.g. as NVDIMMs or abstractly) : Think of no distinction between primary and secondary storage, RAM and disk - there's just a huge amount of not-terribly-fast memory; and whatever you write to memory never goes away. It's an interesting model.

nextaccountic•1mo ago

> myrocks

anything like this, but for postgres?

actually, is it even possible to write a new db engine for postgres? like mysql has innodb, myisam, etc

evanelias•1mo ago

Postgres's strategy has traditionally been to focus on pluggable indexing methods which can be provided by extensions, rather than completely replacing the core heap storage engine design for tables.

That said, there are a few alternative storage engines for Postgres, such as OrioleDB. However due to limitations in Postgres's storage engine API, you need to patch Postgres to be able to use OrioleDB.

MySQL instead focused on pluggable storage engines from the get-go. That has had major pros and cons over the years. On the one hand, MyISAM is awful, so pluggable engines (specifically InnoDB) are the only thing that "saved" MySQL as the web ecosystem matured. It also nicely forced logical replication to be an early design requirement, since with a multi-engine design you need a logical abstraction instead of a physical one.

But on the other hand, pluggable storage introduces a lot of extra internal complexity, which has arguably been quite detrimental to the software's evolution. For example: which layer implements transactions, foreign keys, partitioning, internal state (data dictionary, users/grants, replication state tracking, etc). Often the answer is that both the server layer and the storage engine layer would ideally need to care about these concerns, meaning a fully separated abstraction between layers isn't possible. Or think of things like transactional DDL, which is prohibitively complex in MySQL's design so it probably won't ever happen.

koverstreet•1mo ago

bcachefs's btree still beats the pants off of the entire rocksdb lineage :)

evanelias•1mo ago

Rocksdb / myrocks is heavily used by Meta at extremely massive scale. For sake of comparison, what's the largest real-world production deployment of bcachefs?

koverstreet•1mo ago

We're talking about database performance here, not deployment numbers. And personally, I don't much care what Meta does, they're not pushing the envelope on reliability anywhere that I know of.

evanelias•1mo ago

Many other companies besides Meta use RocksDB; they're just the largest.

Production adoption at scale is always relevant as a measure of stability, as well as a reflection of whether a solution is applicable to general-purpose workloads.

There's more to the story than just raw performance anyway; for example Meta's migration to MyRocks was motivated by superior compression compared to other alternatives.

loeg•1mo ago

Aren't B-trees and LSM-trees fundamentally different tradeoffs? B-trees will always win in some read-biased workloads, and LSM-trees in other write-biased workloads (with B epsilon (Bε) trees somewhere in the middle).

koverstreet•1mo ago

For on disk data structures, yes.

LSM-trees do really badly at multithreaded update workloads, and compaction overhead is really problematic when there isn't much update locality.

On the other hand, having most of your index be constant lets you use better data structures. Binary search is really bad.

For pure in memory indexes, according to the numbers I've seen it's actually really hard to beat a pure (heavily optimized) b-tree; for in-memory you use a much smaller node size than on disk (I've seen 64 bytes, I'd try 256 if I was writing one).

For on disk, you need to use a bigger node size, and then binary search is a problem. And 4k-8k as is still commonly used is much too small; you can do a lockless or mostly lockless in-memory b-tree, but not if it's persistent, so locking overhead, cache lookups, all become painful for persistent b-trees at smaller node sizes, not to mention access time on cache miss.

So the reason bcachefs's (and bcache's) btree is so fast is that we use much bigger nodes, and we're actually a hybrid compacting data structure. So we get the benefits of LSM-trees (better data structures to avoid binary search for most of a lookup) without the downsides, and having the individual nodes be (small, simple) compacting data structures is what makes big btree nodes (avoiding locking overhead, access time on node traversal) practical.

B-epsilon btrees are dumb, that's just taking the downsides of both - updating interior nodes in fastpaths kills multithreaded performance.

raggi•1mo ago

It may not matter for clouds with massive margins but there are substantial opportunities for optimizing wear.

joek1301•1mo ago

I would think hyperscalers stand to benefit the most from optimizing wear!

loeg•1mo ago

We care about wear to the extent we can get the expected 5 years out of SSDs as a capital asset, but below that threshold it doesn't really matter to us.

ljosifov•1mo ago

Not for SSD specifically, but I assume the compact design doesn't hurt: duckdb saved my sanity recently. Single file, columnar, with builtin compression I presume (given in columnar even simplest compression maybe very effective), and with $ duckdb -ui /path/to/data/base.duckdb opening a notebook in browser. Didn't find a single thing to dislike about duckdb - as a single user. To top it off - afaik can be zero-copy 'overlayed' on the top of a bunch of parquet binary files to provide sql over them?? (didn't try it; wd be amazing if it works well)

dist1ll•1mo ago

Is there more detail on the design of the distributed multi-AZ journal? That feels like the meat of the architecture.

PunchyHamster•1mo ago

> WALs, and related low-level logging details, are critical for database systems that care deeply about durability on a single system. But the modern database isn’t like that: it doesn’t depend on commit-to-disk on a single system for its durability story. Commit-to-disk on a single system is both unnecessary (because we can replicate across storage on multiple systems) and inadequate (because we don’t want to lose writes even if a single system fails).

And then a bug crashes your database cluster all at once and now instead of missing seconds, you miss minutes, because some smartass thought "surely if I send request to 5 nodes some of that will land on disk in reasonably near future?".

I love how this industry invents best practices that are actually good then people just invent badly researched reasons to just... not do them.

dist1ll•1mo ago

> "surely if I send request to 5 nodes some of that will land on disk in reasonably near future?"

That would be asynchronous replication. But IIUC the author is instead advocating for a distributed log with synchronous quorum writes.

formerly_proven•1mo ago

But we know this is not actually robust because storage and power failures tend to be correlated. The most recent Jepsen analysis again highlights that it's flawed thinking: https://jepsen.io/analyses/nats-2.12.1

dist1ll•1mo ago

The Aurora paper [0] goes into detail of correlated failures.

> In Aurora, we have chosen a design point of tolerating (a) losing an entire AZ and one additional node (AZ+1) without losing data, and (b) losing an entire AZ without impacting the ability to write data. [..] With such a model, we can (a) lose a single AZ and one additional node (a failure of 3 nodes) without losing read availability, and (b) lose any two nodes, including a single AZ failure and maintain write availability.

As for why this can be considered durable enough, section 2.2 gives an argument based on their MTTR (mean time to repair) of storage segments

> We would need to see two such failures in the same 10 second window plus a failure of an AZ not containing either of these two independent failures to lose quorum. At our observed failure rates, that’s sufficiently unlikely, even for the number of databases we manage for our customers.

[0] https://pages.cs.wisc.edu/~yxy/cs764-f20/papers/aurora-sigmo...

PunchyHamster•1mo ago

I believe testing over paper claims

lazide•1mo ago

Happens all the time (the ignores best practices because it’s convenient or ‘just because’ to do something different), literally everywhere including normal society.

Frankly, it’s shocking anything works at all.

sreekanth850•1mo ago

The biggest lie we’ve been told is that databases require global consistency and a global clock. Traditional databases are still operating with Newtonian assumptions about absolute time, while the real world moves according to Einstein’s relativistic theory, where time is local and relative. You dont need global order, you dont need global clock.

dotancohen•1mo ago

That's why we use UUIDv7 primary keys. Relativity be damned, our replication strategy does not depend upon the timestamp factor.

PunchyHamster•1mo ago

Till the financial controller shows up at the very least.

Also even if not required makes reasoning about how systems work a hell lot easier. So for vast majority that doesn't need massive throughtputs sacrificing some speed for easier to understand consistency model is worthy tradeoff

ayende•1mo ago

All financial systems don't care about time.

Prety much all financial transactions are settled with a given date, not instantly. Go sell some stocks, it takes 2 days to actually settle. (May be hidden by your provider, but that how it works).

For that matter, the ultimate in BASE for financial transactions is the humble check.

That is a great example of "money out" that will only be settled at some time in the future.

There is a reason there is this notion of a "business day" and re-processing transactions that arrived out of order.

sreekanth850•1mo ago

The deeper problem isnt global clocks or even strict consistency, it’s the assumption that synchronous coordination is the default mechanism for correctness.That’s the real Newtonian mindset, a belief that serialization must happen before progress is allowed. Synchronous coordination can enforce correctness, but it should not be the only mechanism to achieve it. Physics actually teaches the opposite assumption, time is relative and local, not globally ordered. Yet traditional databases were designed as if absolute time and global serialization were fundamental laws, rather than conveniences.We treat global coordination as inevitable when it’s really just a historical design choice, not a requirement for correctness.

jandrewrogers•1mo ago

You need a clock but you can have more than one. This is an important distinction.

Arbitrating differences in relative ordering across different observer clocks is what N-temporal databases are about. In databases we usually call the basic 2-temporal case “bitemporal”. The trivial 1-temporal case (which is a quasi-global clock) is what we call “time-series”.

The complexity is that N-temporality turns time into a true N-dimensional data type. These have different behavior than the N-dimensional spatial data types that everyone is familiar with, so you can’t use e.g. quadtrees as you would in the 2-spatial case and expect it to perform well.

There are no algorithms in literature for indexing N-temporal types at scale. It is a known open problem. That’s why we don’t do it in databases except at trivial scales where you can just brute-force the problem. (The theory problem is really interesting but once you start poking at it you quickly see why no one has made any progress on it. It hurts the brain just to think about it.)

sreekanth850•1mo ago

Unpopular Opinion: Database were designed for 1980-90 mechanics, the only thing that never innovates is DB. It still use BTree/LSM tree that were optimized for spinning disc. Inefficiency is masked by hardware innovation and speed (Moores Law).

nly•1mo ago

Optimising hardware to run existing software is how you sell your hardware.

The amount of performance you can extract from a modern CPU if you really start optimising cache access patterns is astounding

High performance networking is another area like this. High performance NICs still go to great lengths to provide a BSD socket experience to devs. You can still get 80-90% of the performance advantages of kernel bypass without abandoning that model.

gethly•1mo ago

> The amount of performance you can extract from a modern CPU if you really start optimising cache access patterns is astounding

I think this was one, and I want to emphasise this, of the main points behind Odin programming language.

cmrdporcupine•1mo ago

There's plenty of innovation in DB storage tech, but the hardware interface itself is still page-based.

It turns out that btrees are still efficient for this work. At least until the hardware vendors deign to give us an interface to SSD that looks more like RAM.

Reading over https://www.cs.cit.tum.de/dis/research/leanstore/ and associated papers and follow up work is recommended.

In the meantime with RAM prices sky rocketing, work and research in buffer & page management for greater-than-main-memory-sized DBs is set to be Hot Stuff again.

I like working in this area.

sreekanth850•1mo ago

Btrees are not optimal for SSD, and the only reason we still use them is legacy constraints of page-oriented storage and POSIX block interfaces.We pay a lot of unnecessary write amplification, metadata churn, and small random writes because we’re still force-fitting tree structures into a block device abstraction.

cmrdporcupine•1mo ago

I don't think we're disagreeing. But the issue is at the boundary between software and hardware, which the hardware device manufacturers have dictated, not further up.

sscdotopen•1mo ago

Umbra: A Disk-Based System with In-Memory Performance, CIDR'20

https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf

cmrdporcupine•1mo ago

Yep, and the work on https://www.cs.cit.tum.de/dis/research/leanstore/ that preceded it.

And CedarDB https://cedardb.com/ the more commercialized product that is following up on some of this research, including employing many of the key researchers.

pmontra•1mo ago

A tangent:

> Companies are global, businesses are 24/7

Only a few companies are global, so only a few of them should optimize for those kind of workload. However maybe every startup in SV must aim to becoming global, so probably that's what most of them must optimize for, even the ones that eventually fail to get traction.

24/7 is different because even the customers of local companies, even B2B ones, mighty feel like doing some work at midnight once in a while. They'll be disappointed to find the server down.

evanelias•1mo ago

> Only a few companies are global, so only a few of them should optimize for those kind of workload

A massive number of companies have global customers, regardless of where the company itself has employees.

For example my b2b business is relatively tiny, yet my customer base spans four continents. Or six continents if you count free users!

rafabulsing•1mo ago

(I believe) OP's point is about a company being global relative to amount of users, not just their geography. If you have single digit thousands of users or less, you still don't need those optimizations even if those users are located all around the world.

gethly•1mo ago

> I’d move durability, read and write scale, and high availability into being distributed

So, essentially just CQRS, which is usually handled in the application level with event sourcing and similar techniques.

hyperman1•1mo ago

Postgres allows you to choose a different page size (at initdb time? At compile time?). The default is 8K. I've always wondered if 32K wouldn't be a better value, and this article points in the same direction.

taffer•1mo ago

On the other hand, smaller pages mean that more pages can fit in your CPU cache. Since CPU speed has improved much more than memory bus speed, and since cache is a scarce resource, it is important to use your cache lines as efficiently as possible.

Ultimately, it's a trade-off: larger pages mean faster I/O, while smaller pages mean better CPU utilisation.

wpietri•1mo ago

I should add that the bond between relational databases and spinning rust goes back further. My dad, who started working as a programmer in the 60s with just magtape as storage, talked about the early era of disks as a big step forward but requiring a lot of detailed work to decide where to put the data and how to find it again. For him, databases were a solution to the problems that that disks created for programmers. And I can certainly imagine that. Suddenly you have to deal with way more data stored in multiple dimensions (platter, cylinder, sector) with wildly nonlinear access times (platter rotation, head movement). I can see how commercial solutions to that problem would have been wildly popular, but also build around solving a number of problems that don't matter.

saghm•1mo ago

I'm not sure I totally understand the timeline you're describing, but my understanding is that relational databases themselves were only invented in the 1970s. Is your reference to the 60s just giving context for when he started but before this link happened (with the idea that the problems predated the solution)?

evanelias•1mo ago

Non-relational databases existed in the 60s, and many programmers who worked in the 60s presumably continued working into the 70s, so either way I don't see any problems with the timeline GP mentions.

saghm•1mo ago

Sure, I never claimed that relational databases were the first ones. I was confused because they were trying to explain a specific timeline different from the article, but only mentioned a single time period that didn't seem likely to be what they intended, and I thought it might make sense to clarify what that time period was. I'm sure this isn't the case for everyone, but at least to me, it's surprising not to be explicit in the information they were trying to claim the article glossed over.

formerly_proven•1mo ago

Hierarchical databases, which are often more similar to what we would consider a file system today, predate relational ones by a decade or two.

saghm•1mo ago

The parent comment was pretty explicit in what types of databases it was talking about:

> I should add that the bond between relational databases and spinning rust goes back further.

It's certainly possible I failed to infer the point they were trying to make, but personally I would find it confusing for them to only mention one type of database (relational) and one time period (the 60s) if they actually meant a different type of database or a different time period in their central point of there being a relationship with, in their words, relational databases and spinning rust, in a specific time period other than what the article describes.

formerly_proven•1mo ago

Relational databases didn't exist in the 60s but if you change/strike that one word, the entire comment becomes cromulent.

saghm•1mo ago

There are some circumstances where striking a word pretty clearly changes the intent (e.g. if I removed the word "relational" from your comment in the same way, it would be pretty clearly not what you're trying to say), and there are others where it's fine. If you're capable of figuring this out easily, these types of questions probably don't seem necessary to you, but when I ask questions like the one I did to gain clarify, I'm not being pedantic for it's own sake. Trying harder to understand what others are saying is a genuine attempt to improve communication on my part.

Without going out of my way to infer more intent than you're stating, it's not clear if you're actually saying that you're confident this is what the parent comment was intending or if you're just proposing it's one plausible explanation. I can think of a few other ones as well, and my point in asking was to try to understand which one they actually meant. If I were to go out on a limb and try to infer your actual intent though, it honestly seems like you're intentionally trying to come across as dismissive based on the presumption that I was intentionally trying to be difficult. Having to make this sort of judgment is exactly what I'm trying to avoid in the first place by asking clarifying questions though, so it's a bit jarring to be getting these types of terse responses from someone who wasn't the original commenter I was responding to that don't really help me understand the situation at all and only make me more confused about why someone would take the time to respond to what I said without really addressing my actual question.

toolslive•1mo ago

but... but... SSD/MVMes are not really block devices. Not wrangling them into a block device interface but using the full set of features can already yield major improvements. Two examples: metadata and indexes need smaller granularities compared to data and an NVMe can do this quite naturally. Another example is that the data can be sent directly from the device to the network, without the CPU being involved.

cornholio•1mo ago

You know you need to be careful when an Amazon engineer will argue for a database architecture that fully leverages (and makes you dependent of) the strengths of their employer's product. In particular:

> Commit-to-disk on a single system is both unnecessary (because we can replicate across storage on multiple systems) and inadequate (because we don’t want to lose writes even if a single system fails).

This is surely true for certain use cases, say financial applications which must guarantee 100% uptime, but I'd argue the vast, vast majority of applications are perfectly ok with local commit and rapid recovery from remote logs and replicas. The point is, the cloud won't give you that distributed consistency for free, you will pay for it both in money and complexity that in practice will lock you in to a specific cloud vendor.

I.e, make cloud and hosting services impossible to commoditize by the database vendors, which is exactly the point.

amluto•1mo ago

Skipping flushing the local disk seems rather silly to me:

- A modern high end SSD commits faster than the one way time to anywhere much farther than a few miles away. (Do the math. A few tens of microseconds specified write latency is pretty common. NVDIMMs (a sadly dying technology) can do even better. The speed of light is only so fast.

- Unfortunate local correlated failures happen. IMO it’s quite nice to be able to boot up your machine / rack / datacenters and have your data there.

- Not everyone runs something on the scale of S3 or EBS. Those systems are awesome, but they are (a) exceedingly complex and (b) really very slow compared to SSDs. If I’m going to run an active/standby or active/active system with, say, two locations, I will flush to disk in both locations.

adgjlsfhk1•1mo ago

I think it is fair to argue that there is a strong correlation between criticality of data and network scale. Most small buisnesses don't need anything S3 scale, but they also don't need 24 hour uptime, and losing the most recent day of data is annoying rather than catastrophic, so they can probably get away without flushing but with daily asynchronous backups to a different machine and a 1 minute UPS to allow for safe storage in the event of a power outage.

ayende•1mo ago

Committing to NVMe drive properly is really costly. I'm talking using O_DIRECT | OSYNC or fsync here. Can be in the order of whole milliseconds, easily. And it is much worse if you are using cloud systems.

ffsm8•1mo ago

Isn't that why a WAL exists, so you didn't actually need to do that with eg postgres and other rdbms?

SigmundA•1mo ago

You must still commit the WAL to disk, this is why the WAL exists it writes ahead to the log on durable storage. Its doesn't have to commit the main storage to disk only the WAL which is better since its just an append to end rather than placing correctly in the table storage which is slower.

You must have a single flushed write to disk to be durable, but it doesn't need the second write.

Palomides•1mo ago

I just tested the mediocre enterprise nvme I have sitting on my desk (micron 7400 pro), it does over 30000 fsyncs per second (over a thunderbolt adapter to my laptop, even)

hawk_•1mo ago

If you tested this on macos, be careful. The fsync on it lies.

packetlost•1mo ago

fsync on most OSes lie to some degree

Palomides•1mo ago

nope, linux python script that writes a little data and calls os.fsync

ncruces•1mo ago

What's a little data?

In many situations, fsync flushes everything, including totally uncorrelated stuff that might be running on your system.

saltcured•1mo ago

Another complexity here besides syncs per second is the size of the requests and duration of this test, since so many products will have faster cache/buffer layers which can be exhausted. The effect is similar whether this is a "non-volatile RAM" area on a traditional RAID controller, intermediate write zones in a complex SSD controller, or some logging/journaling layer on another volume storage abstraction like ZFS.

It is great as long as your actual workload fits, but misleading if a microbenchmark doesn't inform you of the knee in the curve where you exhaust the buffer and start observing the storage controller as it retires things from this buffer zone to the other long-term storage areas. There can also be far more variance in this state as it includes not just slower storage layers, but more bookkeeping or even garbage-collection functions.

seastarer•1mo ago

It is actually very cheap if done right. Enterprise SSDs have write-through caches, so an O_DIRECT|O_DSYNC write is sufficient, if you set things up so the filesystem doesn't have to also commit its own logs.

rdtsc•1mo ago

> Skipping flushing the local disk seems rather silly to me

It is. Coordinated failures shouldn't be a surprise these days. It's kind of sad to here that from an AWS engineer. Same data pattern fills the buffers and crashes multiple servers, while they were all "hoping" that others fsynced the data, but it turns out they all filled up and crashed. That's just one case there are others.

hawk_•1mo ago

Durability always has an asterisk i.e. guaranteed up to N number of devices failing. Once that N is set, your durability is out the moment those N devices all fail together. Whether that N counts local disks or remote servers.

rdtsc•1mo ago

This is about not even trying durability before returning a result ("Commit-to-disk on a single system is [...] unnecessary") it's hoping that servers won't crash and restart together: some might fail but others will eventually commit. However that assumes a subset of random (uncoordinated) hardware failures, maybe a cosmic ray blasts the ssd controller. That's fine, but it fails to account for coordinated failure where, a particular workload leads to the same overflow scenario on all servers the same. They all acknowledge the writes to the client but then all crash and restart.

zrm•1mo ago

To some extent the only way around that is to use non-uniform hardware though.

Suppose you have each server commit the data "to disk" but it's really a RAID controller with a battery-backed write cache or enterprise SSD with a DRAM cache and an internal capacitor to flush the cache on power failure. If they're all the same model and you find a usage pattern that will crash the firmware before it does the write, you lose the data. It's little different than having the storage node do it. If the code has a bug and they all run the same code then they all run the same bug.

rdtsc•1mo ago

Yeah good point, at least if you wait till you get an acknowledgement for the fsync on N nodes it's already in an a lot better position. Maybe overkill but you can also read the back the data and reverify the checksum. But yeah in general you make a good point, I think that's why some folks deliberately use different drive models and/or raid controllers to avoid cases like that.

amluto•1mo ago

Interestingly, on bare metal or old-school VMs, durability of local storage was pretty good. If the rack power failed, your data was probably still there. Sure, maybe it was only 99% or 99.9%, but that’s not bad if power failures are rare.

AWS etc, in contrast, have really quite abysmal durability for local disk storage. If you want the performance and cost benefits of using local storage (as opposed to S3, EBS, etc), there are plenty of server failure scenarios where the probability that your data is still there hovers around 0%.

bee_rider•1mo ago

This is an aside, but has anyone tried NVDIMMs as the disk, behind in-package HBM for ram? No idea if it would be any good, just kind of a funny thought. It’s like everything shifted one slot closer to the cores, haha, nonvolatile memory where the RAM use to live, memory pretty close to the core.

amluto•1mo ago

I think this entire design approach is on its way out. It turns out that the DIMM protocol was very much designed for volatile RAM, and shoehorning anything else in involves changes through a bunch of the stack (CPU, memory controller, DIMMs), which are largely proprietary and were never intended to support complex devices from other vendors according to published standards. Sure, every CPU and motherboard attempts to work with every appropriately specced DIMM, but that doesn’t mean that the same “physical” bits land in the same DRAM cells if you change your motherboard. Beyond interoperability issues, the entire cache system on most CPUs was always built on the assumption that, if the power fails, the contents of memory do not need to retain any well-defined values. Intel had several false starts trying to build a reliable mechanism to flush writes all the way to the DIMM.

Instead the industry seems to be moving toward CXL for fancy memory-like-but-not-quite-memory. CXL is based on PCIe, and it doesn’t have these weird interoperability and caching issues. Flushing writes all the way to PCIe has never been much of a problem, since basically every PCI or PCIe device ever requires a mechanism by which host software can communicate all the way to the device without the IO getting stalled in some buffer on the way.

pragmatic•1mo ago

Yes, my first thought here was how to build a database that locks you into the cloud vs "for ssds".

beAbU•1mo ago

Not just any old amazon engineer. He's been with Amazon since at least 2008, and he's from Cape Town.

It's very likely that he was part of the team that invented EC2.

ritcgab•1mo ago

SSDs are more of a black box per se. FTL adds another layer of indirection and they are mostly proprietary and vendor-specific. So the performance of SSDs are not generalizable.

exabrial•1mo ago

> Commit-to-disk on a single system is both unnecessary

If you believe this, then what you want already exists. For example: MySQL has in memory tables, but also this design pretty much sounds like NDB.

I don’t think I’d build a database the way they are describing for anything serious. Maybe a social network or other unimportant app where the consequences of losing data aren’t really a big deal.

Havoc•1mo ago

I'm a little bit surprised enterprise isn't sticking to optane for this. It's EoL tech at this point, but it'll still smoke top of the line nvmes for small Q1 which I'd think you'd want for some databases.

dbzero•1mo ago

Please give a try to dbzero. It eliminates the database from the developer's stack completely - by replacing a database with the DISTIC memory model (durable, infinite, shared, transactional, isolated, composable). It's build for the SSD/NVME drive era.

ghqqwwee•1mo ago

I’m a bit disappointed the article doesn’t mention Aerospike. It’s not a rdbms but a kvdb commonly used in adtech, and extremely performant on that use case. Anyway, it’s actually designed for ssds, which makes it possible to persist all writes even when the nic is saturated with write operations. Of course the aggregated bandwidth of the attached ssd hardware needs to be faster than the throughput of the nic, but not much, there’s very little overhead in the software.

CraigJPerry•1mo ago

How does that work? Is that an open source solution like the ZCRX stuff with io uring or does it require proprietary hardware setups? I'm hopeful that the open source solutions today are competitive.

I was familiar with Solarflare and Mellanox zero copy setups in a previous fintech role, but at that time it all relied on black boxes (specifically out of tree kernel modules, delivered as blobs without DKMS or equivalent support, a real headache to live with) that didn't always work perfectly, it was pretty frustrating overall because the customer paying the bill (rightfully) had less than zero tolerance for performance fluctuations. And fluctuations were annoyingly common, despite my best efforts (dedicating a core to IRQ handling, bringing up the kernel masked to another core, then pinning the user space workloads to specific cores and stuff like that) It was quite an extreme setup, GPS disciplined oscillator with millimetre perfect antenna wiring for the NTP setup etc we built two identical setups one in Hong Kong and one in new york. Ah very good fun overall but frustrating because of stack immaturity at that time.

firesteelrain•1mo ago

At first glance this reads like a storage interface argument, but it’s really about media characteristics. SSDs collapse the random vs sequential gap, yet most DB engines still optimize for throughput instead of latency variance and write amplification. That mismatch is the interesting part

ksec•1mo ago

It may be worth pointing out, current highest capacity EDSFF drive offers ~8PB in 1U. That is 320PB per rack, and current roadmaps in 10 years time up to 1000+ PB or 1EB per rack.

Design Database for SSD would still go a very very long way before what I think the author is suggesting which is designing for cloud or datacenter.

adsharma•1mo ago

Re: keeping the relational model

This made sense for product catalogs, employee dept and e-commerce type of use cases.

But it's an extremely poor fit for storing a world model that LLMs are building in an opaque and probabilistic way.

Prediction: a new data model will take over in the next 5 years. It might use some principles from many decades of relational DBs, but will also be different in fundamental ways.

didgetmaster•1mo ago

Back in college (for me the 80s), I learned that storing table data in rows would greatly increase performance due to high seek times on hard disks. SELECT * FROM table WHERE ... could read in the entire row in a single seek. This was very valuable when your table has 100 columns.

However; a different query (e.g. SELECT name, phone_number FROM table) might result in fewer seeks if the data is stored by column instead of by row.

The article only seems to address data structures with respect to indexes, and not for the actual table data itself.

javaunsafe2019•1mo ago

AI slop for sure

grogers•1mo ago

An often underlooked aspect of SSDs for databases is write endurance. With the wrong workload random writes can burn through your drive in months instead of years. This is the hidden gem of LSM over b-tree - low write amplification, not just better write performance. Maybe doesn't really matter on AWS since you can just pitch your instance once you've trashed the SSDs

Al Lowe on model trains, funny deaths and working with Disney

Hoot: Scheme on WebAssembly

First Proof

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Reinforcement Learning from Human Feedback

The Waymo World Model

Start all of your commands with a comma (2009)

France's homegrown open source online office suite

Vocal Guide – belt sing without killing yourself

The AI boom is causing shortages everywhere else

Coding agents have replaced every framework I used

Software factories and the agentic moment

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

British drivers over 70 to face eye tests every three years

Making geo joins faster with H3 indexes

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Ga68, a GNU Algol 68 Compiler

Show HN: I spent 4 years building a UI design tool with only the features I use

An Update on Heroku

Show HN: If you lose your memory, how to regain access to your computer?

What Is Ruliology?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Al Lowe on model trains, funny deaths and working with Disney

Hoot: Scheme on WebAssembly

First Proof

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Reinforcement Learning from Human Feedback

The Waymo World Model

Start all of your commands with a comma (2009)

France's homegrown open source online office suite

Vocal Guide – belt sing without killing yourself

The AI boom is causing shortages everywhere else

Coding agents have replaced every framework I used

Software factories and the agentic moment

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

British drivers over 70 to face eye tests every three years

Making geo joins faster with H3 indexes

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Ga68, a GNU Algol 68 Compiler

Show HN: I spent 4 years building a UI design tool with only the features I use

An Update on Heroku

Show HN: If you lose your memory, how to regain access to your computer?

What Is Ruliology?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

What Does a Database for SSDs Look Like?

Comments