Faster Index I/O with NVMe SSDs

https://www.marginalia.nu/log/a_123_index_io/

172•ingve•5mo ago

Comments

marginalia_nu•5mo ago

I urge you to read the papers and articles I linked at the end if any of this is your jam. They are incredible bangers all of them.

6r17•5mo ago

Thanks for sharing this !

kvemkon•5mo ago

> 256 KB vs 512 B

> A counter argument might be that this drives massive read amplification,

For that, one need to know the true minimal block size SSD controller is able to physically read from flash. Asking for less than this wouldn't avoid the amplification.

jeffbee•5mo ago

Fun post. One unmentioned parameter is the LBA format being used. Most devices come from the factory configured for 512B, so you can boot NetWare or some other dumb compatibility concern. But there isn't a workload from this century where this makes sense, so it pays to explore the performance impact of the LBA formats your device offers. Using a larger one can mean your device manages io backlogs more efficiently.

kvemkon•5mo ago

> 128 KB appears a point of diminishing returns, larger block sizes yield similar or worse performance.

Indeed, 128 KB is a well-known long lasted optimal buffer size [1], [2].

Until it has been increased to 256 KB recently (07.04.2024) [3].

[1] https://github.com/MidnightCommander/mc/commit/e7c01c7781dcd...

[2] https://github.com/MidnightCommander/mc/issues/2193

[3] https://github.com/MidnightCommander/mc/commit/933b111a5dc7d...

marginalia_nu•5mo ago

I wonder if a more robust option is to peek in the sysfs queue info on Linux.

It has some nice information about hardware io operation limits, and also an optimal_io_size hint.

https://www.kernel.org/doc/html/v5.3/block/queue-sysfs.html

jandrewrogers•5mo ago

This doesn't generalize.

In 2014, the common heuristic was 256kB based on measurements in many systems, so the 128kB value is in line with that. At the time, optimal block sizing wasn't that sensitive to the I/O architecture so many people arrived at the same values.

In 2024, the optimal block size based on measurement largely reflects the quality and design of your I/O architecture. Vast improvements in storage hardware expose limitations of the software design to a much greater extent than a decade ago. As a general observation, the optimal I/O sizing in sophisticated implementations has been trending toward smaller sizes over the last decade, not larger.

The seeming optimality of large block sizes is often a symptom of an I/O scheduling design that can't keep up with the performance of current storage hardware.

marginalia_nu•5mo ago

I think what you're trying to accomplish is a factor here.

If you just want to saturate the bandwidth, to move some coherent blob of data from point A to point B as fast as possible (say you're implementing the `cp` command), then using large buffers is the best and easiest way. Small buffers confer no additional benefit other than driving more complicated designs, forcing io_uring with registered buffers and fds, etc.

If you want to maximize IOPS, then by the fact that we just established that large buffers saturate the bandwidth better, small buffers is the only viable option, but then you need to whittle down the per-read overhead, and end up with io_uring or even more specialized tools.

codeaether•5mo ago

Actually, to fully utilize NVME performance, one really need to try to avoid OS overhead by leveraging AsyncIO such as IO_Uring. In fact, 4KB page works quite well if you can issue enough outstanding requests. See a paper from the link below by the TUM folks.

https://dl.acm.org/doi/abs/10.14778/3598581.3598584

dataflow•5mo ago

SPDK is what folks who really care about this use, I think.

vlovich123•5mo ago

SPDK requires taking over the device. OP is correct if you want to have a multi tenant application where the disk is also used for other things.

dataflow•5mo ago

Not an expert on this but I think that's... half-true? There is namespace support which should allow multiple users I think (?), but it does still require direct device access.

vlovich123•5mo ago

Namespaces are a hack device manufacturers came up with to try to make this work anyway. Namespaces at the device level are a terrible idea IMP because it’s still not multi tenant - your just carving up a single drive into logically separated chunks that you have to decide on up front. So you have to say “application X gets Y% of the drive while application A gets B%”. It’s an expensive static allocation that’s not self adjusting based on actual dynamic usage.

10000truths•5mo ago

Dynamic allocation implies the ability to shrink as well as grow. How do you envision shrinking an allocation of blocks to which your tenant has already written data that is (naturally) expected to be durable in perpetuity?

vlovich123•5mo ago

you mean something filesystems do as a matter of course? Ignoring resizing them which is also supported through supporting technologies I’m not talking about partitioning a drive. You can have different applications sharing a filesystem just fine, with each application growing how much space it uses naturally as usage increases or shrinks. Partitioning and namespaces are similar (namespaces are significantly more static) in that you have to make decisions about the future really early vs a normal file on a filesystem growing over time.

10000truths•5mo ago

If you're assuming that every tenant's block device is storing a filesystem, then you're not providing your tenant a block device, you're providing your tenant a filesystem. And if you're providing them a filesystem, then you should use something like LVM for dynamic partitioning.

The point of NVMe namespaces is to partition at the block device layer. To turn one physical block device into multiple logical block devices, each with their own queues, LBA space, etc. It's for when your tenants are interacting with the block device directly. That's not a hack, that's intended functionality.

jandrewrogers•5mo ago

The only thing SPDK buys you is somewhat lower latency, which isn't that important for most applications because modern high-performance I/O schedulers usually are not that latency sensitive anyway.

The downside of SPDK is that it is unreasonably painful to use in most contexts. When it was introduced there were few options for doing high-performance storage I/O but a lot has changed since then. I know many people that have tested SPDK in storage engines, myself included, but none that decided the juice was worth the squeeze.

electricshampo1•5mo ago

Depending on the IOPS rate for your app; SPDK can result in less CPU time spent in issuing IO/reaping completions compared to ex. io_uring.

See Ex. https://www.vldb.org/pvldb/vol16/p2090-haas.pdf What Modern NVMe Storage Can Do, And How To Exploit It: High-Performance I/O for High-Performance Storage Engines

for actual data on this.

OFC, If your block size is large enough and/or your design is batching enough etc. that you already don't spend much time in issuing IO/reaping completion then as you say, SPDK will not provide much of a gain.

__turbobrew__•5mo ago

I believe seastar uses it and that is the base of scylladb storage engine: https://seastar.io/

I believe the next generation ceph OSD is built on seastar as well: https://docs.ceph.com/en/reef/dev/crimson/crimson/

With something like ceph, latency is everything as writes need to be synchronously committed to each OSD replica before the writing client is unblocked. I think for ceph they are trying to move to nvme-of to basically bypass the OS for remote NVME access. Im not sure how this will work security wise however as you cannot just have any node on the network reading and writing random blocks of nvme-of devices.

lossolo•5mo ago

> I believe seastar uses it and that is the base of scylladb storage engine: https://seastar.io/

They use DPDK (optionally) for network IO, not SPDK.

marginalia_nu•5mo ago

As part of the problem domain in index lookups, issuing multiple requests at the same time is not possible, unless as part of some entirely guess-based readahead scheme thay may indeed drive up disk utilization but are unlikely to do much else. Large blocks are a solution with that constraint as a given.

That paper seems to mostly focus on throughput via concurrent independent queries, rather than single-query performance. It's arriving at a different solution because it's optimizing for a different variable.

Veserv•5mo ago

Large block reads are just a readahead scheme where you prefetch the next N small blocks. So you are just stating that contiguous readahead is close enough to arbitrary readahead especially if you tune your data structure appropriately to optimize for larger regions of locality.

marginalia_nu•5mo ago

Well I mean yes, you can use io_uring to read the 128KB blocks as 8 4KB blocks, but that's a very roundabout way of doing it that doesn't significantly improve your performance since with either method, the operation time is more or less the same. If a 128 KB read takes roughly the same time as a 4K read, 8 parallel 4K reads isn't going to be faster with io_uring.

Also, an index with larger block sizes is not equivalent to a structure with smaller block sizes with readahead. The index structure is not the same since having larger coherent blocks gives you better precision in your indexing structure for the same number of total forward pointers, as there's no need to index within each 128 KB block, the forward pointer resolution that would have gone to distinguishing between 4K blocks can instead help you rapidly find the next relevant 128 KB block.

throwaway81523•5mo ago

In most search engines the top few tree layers are in ram cache, and can also have disk addresses for the next levels. So maybe that can let you start some concurrent requests.

ozgrakkurt•5mo ago

4KB is much slower than 512KB if you are using the whole data. Smaller should be better if there is read amplification

mgerdts•5mo ago

> Modern enterprise NVMe SSDs are very fast…. This is a simple benchmark on a Samsung PM9A1 on a with a theoretical maximum transfer rate of 3.5 GB/s. … It should be noted that this is a sub-optimal setup that is less powerful than what the PM9A1 is capable of due to running on a downgraded PCIe link.

Samsung has client, datacenter, and enterprise lines. The PM9A1 is part of the OEM client segment and is about the same as a 980 Pro. Its top speeds (about 7GB/s read, 5GB/s write) are better than the comparable datacenter class drive, PM9A3. This top speeds comes with less consistent performance than you get with a PM9A3 or an enterprise drive like a PM1733 from the same era (early PCIe Gen 4 drives).

dataflow•5mo ago

Beginner(?) question: why is the model

  map<term_id, 
      list<pair<document_id, positions_idx>>
     > inverted_index;

and not

  map<term_id, 
      map<document_id, list<positions_idx>>
     > inverted_index;

(or using set<> in lieu of list<> as appropriate)?

marginalia_nu•5mo ago

This is to be seen as metaphorical to give a mental model for the actual data structures on disk so there's some tradeoff to finding the most accurate metaphor for what is happening.

I actually think you are right, list<pair<...>> is a bit of a weird choice that doesn't quite convey the data structures quite well. Map is better.

The most accurate thing would probably be something like map<term_id, map<document_id, pair<document_id, positions_idx>>>, but I corrected it to just a map<document_id, positions_idx> to avoid making things too confusing.

sour-taste•5mo ago

Currently it looks like this:

    map<term_id, 
      map<pair<document_id, positions_idx>>
      inverted_index;

list<positions> positions;

Think you also meant to remove the pair in map<pair>?

marginalia_nu•5mo ago

Haha, apparently very hard to get this right. Fixed again.

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

The F Word

Speed up responses with fast mode

LLMs as the new high level language

GitBlack: Tracing America's Foundation

Software factories and the agentic moment

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

FDA intends to take action against non-FDA-approved GLP-1 drugs

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

First Proof

I write games in C (yes, C) (2016)

Vocal Guide – belt sing without killing yourself

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Start all of your commands with a comma (2009)

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection rather than prediction

The AI boom is causing shortages everywhere else

72M Points of Interest

Coding agents have replaced every framework I used

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

France's homegrown open source online office suite

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Where did all the starships go?

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

The F Word

Speed up responses with fast mode

LLMs as the new high level language

GitBlack: Tracing America's Foundation

Software factories and the agentic moment

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

FDA intends to take action against non-FDA-approved GLP-1 drugs

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

First Proof

I write games in C (yes, C) (2016)

Vocal Guide – belt sing without killing yourself

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Start all of your commands with a comma (2009)

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection rather than prediction

The AI boom is causing shortages everywhere else

72M Points of Interest

Coding agents have replaced every framework I used

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

France's homegrown open source online office suite

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Where did all the starships go?

Faster Index I/O with NVMe SSDs

Comments