In some situations, the “logical” block size can differ. For example, buffered writes use the page cache, which operates in PAGE_SIZE blocks (usually 4K). Or your RAID stripe size might be misconfigured, stuff like that. Otherwise they should be equal for best outcomes.
In general, we want it to be as small as possible!
NVMe drives have at least three "hardware block sizes". There's the LBA size that determines what size IO transfers the OS must exchange with the drive, and that can be re-configured on some drives, usually 512B and 4kB are the options. There's the underlying page size of the NAND flash, which is more or less the granularity of individual read and write operations, and is usually something like 16kB or more. There's the underlying erase block size of the NAND flash that comes into play when overwriting data or doing wear leveling, and is usually several MB. There's the granularity of the SSD controller's Flash Translation Layer, which determines the smallest size write the SSD can handle without doing a read-modify-write cycle, usually 4kB regardless of the LBA format selected, but on some special-purpose drives can be 32kB or more.
And then there's an assortment of hints the drive can provide to the OS about preferred granularity and alignment for best performance, or requirements for atomic operations. These values will generally be a consequence of the the above values, and possibly also influenced by the stripe and parity choices the SSD vendor made.
If you have bigger files, then having bigger blocks means less fixed overhead from syscalls and NVMe/SATA requests.
If your native device block size is 4KiB, and you fetch 512 byte blocks, you need storage side RAM to hold smaller blocks and you have to address each block independently. Meanwhile if you are bigger than the device block size you end up with fewer requests and syscalls. If it turns out that the requested block size is too large for the device, then the OS can split your large request into smaller device appropriate requests to the storage device, since the OS knows the hardware characteristics.
The most difficult to optimize case is the one where you issue many parallel requests to the storage device using asynchronous file IO for latency hiding. In that case, knowing the device's exact block size is important, because you are IOPs bottlenecked and a block size that is closer to what the device supports natively will mean fewer IOPs per request.
Just opted to use fixed 512 byte alignment in this library since all decent SSDs I have encountered so far are ok with 512 byte file alignment and 64 byte memory alignment.
This makes the code a bit simpler both in terms of the allocator and the file operations.
There are some config examples here [0] but they would be different for AIO so need to check fio documentation to find corresponding config for AIO.
Their referenced previous post [1] demonstrates ~240,000 IO/s when using basic settings. Even that seems pretty low, but is still more than enough to completely trivialize this benchmark and saturate the hardware IO with zero tuning.
Planning to add random reads with 4K and 512 blocksize to the example so I can measure IOPS too
It didn't make any difference when I was benchmarking with fio but I didn't use many threads so not sure.
I added it anyway since I saw this comment and also the io_uring document says it should make a difference.
4.083 / 3.802 = 1.0739
2^30 / 10^9 = 1.0737
I think the same rate was likely achieved but there is confusion between GiB and GB.
Fio seems to interpret `16g` as 16GiB so it creates a 16GiB ~= 17.2GB file. But not sure if it is reading/writing the whole thing.
It seems like the max performance of the SSD is 7GB/s in spec so it is kind of confusing.
NVMe drive vendors always market size in GB (or TB) and data rates in GB/s.
I have this config:
bs=512KB size=16GB
But it interprets KiB and GiB so this was causing my confusion.
The IOPS and timing is basically identical.
So output seems fine but it always interprets parameters as SI.
Edit: Actually after looking into it more. It seems like there is a good chance that fio reports GiB and KiB in output and it also does the calculation based on that but in reality it uses GB/KB so measurements are a bit wrong.
laserbeam•1d ago
I see it’s 0.15.1 in the zon file, but that should also be part of the post somewhere.
ozgrakkurt•14h ago