Strictly speaking, the bottleneck was latency, not bandwidth.
compressing the kernel loads it faster on RAM even if it still has to execute the un compressing operation. Why?
Load from disk to RAM is a larger bottleneck than CPU uncompressing.
Same is applied to algorithms, always find the largest bottleneck in your dependent executions and apply changes there as the rest of the pipeline waits for it. Often picking the right algorithm “solves it” but it may be something else, like waiting for IO or coordinating across actors (mutex if concurrency is done as it used to).
That’s also part of the counterintuitive take that more concurrency brings more overhead and not necessarily faster execution speeds (topic largely discussed a few years ago with async concurrency and immutable structures).
Stupid question: why can't we get a syscall to load an entire directory into an array of file descriptors (minus an array of paths to ignore), instead of calling open() on every individual file in that directory? Seems like the simplest solution, no?
Otherwise you can open a dir and pass its fd to openat together with a relative path to a file, to reduce the kernel overhead of resolving absolute paths for each file.
Notice that this avoids `lstat` calls; for symlinks you may still need to do a stat call if you want to stat the target.
You mean like a range of file descriptors you could use if you want to save files in that directory?
You could use io_uring but IMO that API is annoying and I remember hitting limitations. One thing you could do with io_uring is using openat (the op not the syscall) with the dir fd (which you get from the syscall) so you can asynchronously open and read files, however, you couldn't open directories for some reason. There's a chance I may be remembering wrong
https://man7.org/linux/man-pages/man3/io_uring_prep_open.3.h...
https://man7.org/linux/man-pages/man2/readdir.2.html
Note that the prep open man page is a (3) page. You could of course construct the SQEs yourself.
Even assume it takes 1us per mode switch, which would be insane, you'd be looking at 0.3s out of the 17s for syscall overhead.
It's not obvious to me where the overhead is, but random seeks are still expensive, even on SSDs.
For example, our integration test suite on a particular service has become quite slow, but it's not particularly clear where the time is going. I suspect a decent amount of time is being spent talking to postgres, but I'd like a low touch way to profile this
There are a few challenges here. - Off-cpu is missing the interrupt with integrated collection of stack traces, so you instrument a full timeline when they move on and off cpu or periodically walk every thread for its stack trace - Applications have many idle threads and waiting for IO is a common threadpool case, so its more challenging to associate the thread waiting for a pool doing delegated IO from idle worker pool threads
Some solutions: - Ive used nsight systems for non GPU stuff to visualize off CPU time equally with on CPU time - gdb thread apply all bt is slow but does full call stack walking. In python, we have py-spy dump for supported interpreters - Remember that any thing you can represent as call stacks and integers can be converted easily to a flamegraph. eg taking strace durations by tid and maybe fd and aggregating to a flamegraph
https://github.com/golang/go/issues/28739#issuecomment-10426...
https://stackoverflow.com/questions/64656255/why-is-the-c-fu...
https://github.com/valhalla/valhalla/issues/1192
https://news.ycombinator.com/item?id=13628320
Not sure what's the root cause, though.
That said, I'm certainly no expert on filesystems or OS kernels, so I wouldn't know if Linux would perform faster or slower... But it would be very interesting to see a comparison, possibly even with a hypervisor adding overhead.
Of course syscalls suck, slurping the whole file at once always wins, and in this case all files at once.
Kernels suck in general. You don't really need one for high perf and low space.
That said, it wouldn’t explain why a MacBook (which should have the SSD already on the fastest/dedicated pathway) be this slow unless something else in the OS was the bottleneck?
I think we’re just scratching the surface here and there is more to this story that is waiting to be discovered. But yeah, to get the job done, package it in fewer files for the OS, preload into RAM or use mmap, then profit.
Then I moved to Windows, and Linux. Each had its own idiosyncrasies, like how everything is a file on Linux, and you're supposed to write programs by chaining existing executables together, or on the desktop, both Win32 and X11 started out with their own versions of UI elements, so XWindow or Win32 would know about where a 'button' was, and the OS was responsible for event handling and drawing stuff.
Eventually both Windows and Linux programs moved to a model where the OS just gave you the window as a drawing surface, and you were supposed to fill it.
Similarly, all other OS supplied abstractions slowly fell out of use beyond the bare minimum.
Considering this, I wonder if it's time to design a new, much lower level abstraction, for file systems in this case, this would be a way to mmap an entire directory into the process space, where each file would be a struct, whicha had a list of pointers for the pages on the disk, and each directory would be a list of such entries, again stored in some data structure you could access, synchronizing reads/writes would be orechestrated by the kernel somehow (I'm thinking locking/unlocking pages being written to).
So that way there'd be no difference between traversing an in-memory data structure and reading the disk.
I know this approach isnt super compatible with the async/await style of I/O, however I'm not 100% convinced that's the correct approach either (disk paging is a fundamental feature of all OSes, yet is absolutely inexpressible in programming terms)
Bring back the "segmented" memory architecture. It was not evil because of segments, but because of segment size. If any segment can be any size the bad aspects fall away.
File handles aren't needed anymore. You open a file, you get back a selector rather than an ID. You reference memory from that selector, the system silently swaps pages in as needed.
You could probably do the same thing with directories but I haven't thought about it.
> You could probably do the same thing with directories but I haven't thought about it.
For example in the FAT filesystem, a directory is just a file with a special flag set in its file descriptor and inside said file there is just a list of file descriptors. Not sure if something so simple would a good idea, but it certainly works and has worked IRL.
If you follow this model, how do you solve the accessibility issue?
https://199.233.217.201/pub/pkgsrc/distfiles/dictd-1.13.3.ta...
Wikipedia:
"In order to efficiently store dictionary data, dictzip, an extension to the gzip compression format (also the name of the utility), can be used to compress a .dict file. Dictzip compresses file in chunks and stores the chunk index in the gzip file header, thus allowing random access to the data."
So as someone who has seen a long spread of technological advancements over the years I can confidently tell you that chips have far surpassed any peripheral components.
Kind of that scenario where compute has to be fast enough anyway to support I/O. So really it always has to be faster, but I am saying that it has exceeded those expectations.
marginalia_nu•1w ago
You can still get the compression benefits by serving files with Content-Encoding: gzip or whatever. Though it has builtin compression, you can just not use that and use external compression instead, especially over the wire.
It's pretty widely used, though often dressed up as something else. JAR files or APK files or whatever.
I think the articles complaints about lacking unix access rights and metadata is a bit strange. That seems like a feature more than a bug, as I wouldn't expect this to be something that transfers between machines. I don't want to unpack an archive and have to scrutinize it for files with o+rxst permissions, or have their creation date be anything other than when I unpacked them.
stabbles•1w ago
SquashFS with zstd compression is used by various container runtimes, and is popular in HPC where filesystems often have high latency. It can be mounted natively or with FUSE, and the decompression overhead is not really felt.
ciupicri•1w ago
stabbles•1w ago
LtdJorge•1w ago
__turbobrew__•1w ago
1718627440•1w ago
> I don't want to unpack an archive and have to scrutinize it for files with o+rxst permissions, or have their creation date be anything other than when I unpacked them.
I'm the opposite, when I pack and unpack something, I want the files to be identical including attributes. Why should I throw away all the timestamps, just because the file were temporarily in an archive?
rustyhancock•1w ago
If your archive drops it you can't get it back.
If you don't want it you can just chmod -R u=rw,go=r,a-x
1718627440•1w ago
Hence, the common archive format is tar not zip.
password4321•1w ago
In case anyone is unaware, you don't have to throw away all the timestamps when using "zip with no compression". The metadata for each zipped file includes one timestamp (originally rounded to even number of seconds in local time).
I am a big last modified timestamp fan and am often discouraged that scp, git, and even many zip utilities are not (at least by default).
rcxdude•1w ago
ralferoo•1w ago
nh2•1w ago
ZIP retains timestamps. This makes sense because timestamps are a global concept. Consider them a attribute dependent on only the file in ZIP, similar to the file's name.
Owners and permissions are dependent also on the computer the files are stored on. User "john" might have a different user ID on another computer, or not exist there at all, or be a different John. So there isn't one obvious way to handle this, while there is with timestamps. Archiving tools will have to pick a particular way of handling it, so you need to pick the tool that implements the specific way you want.
pwg•1w ago
It does, but unless the 'zip' archive creator being used makes use of the extensions for high resolution timestamps, the basic ZIP format retains only old MSDOS style timestamps (rounded to the closed two seconds). So one may lose some precision in ones timestamps when passing files through a zip archive.
nh2•1w ago
zahlman•1w ago
I'm not aware of standards language mandating it, but build tools generally do compress wheels and sdists.
If you're thinking of zipapps, those are not actually common.
1718627440•1w ago
zahlman•1w ago
Dylan16807•1w ago
I would expect modified dates to stay the same, and other dates to change similar to copying a directory. I think this is the normal experience with zip?
For creation dates, Linux usually doesn't even track those at all. There's partial support on BTRFS and ZFS, and on ext4 there nowhere to store it at all.
LtdJorge•1w ago
conradludgate•1w ago
marginalia_nu•1w ago
With tar you need to scan the entire file start-to-finish before you know where the data is located, as it's literally a tape archiving format, designed for a storage medium with no random access reads.
01HNNWZ0MV43FF•1w ago
st_goliath•1w ago
The format used by `ar` is a quite simple, somewhat like tar, with files glued together, a short header in between and no index.
Early Unix eventually introduced a program called `ranlib` that generates and appends and index for libraries (also containing extracted symbols) to speed up linking. The index is simply embedded as a file with a special name.
The GNU version of `ar` as well as some later Unix descendants support doing that directly instead.
cb321•1w ago
zahlman•1w ago
JAR files generally do/did use compression, though. I imagine you could forgo it, but I didn't see it being done. (But maybe that was specific to the J2ME world where it was more necessary?)
charcircuit•1w ago
zahlman•1w ago
charcircuit•1w ago
zahlman•1w ago
paulddraper•1w ago
Maybe non-UNIX machines I suppose.
But I 100% need executable files to be executable.
marginalia_nu•1w ago
Joker_vD•1w ago
paulddraper•1w ago
pram•1w ago
Joker_vD•1w ago
dwattttt•1w ago
On the other hand, tie the container structure to your OS metadata structure, and your (hopefully good) container format is now stuck with portability issues between other OSes that don't have the same metadata layout, as well as your own OS in the past & future.
paulddraper•1w ago
Just an id,blob format?
The purpose of tar (or competitors) is to serialize files and their metadata.
dwattttt•1w ago
Tar's purpose was to serialise files and metadata in 1979, accounting for tape foibles such as fixed or variable data block size.
hinkley•1w ago
The real one-two punch is make your parser faster and then spend the CPU cycles on better compression.
cb321•1w ago
smallstepforman•1w ago
0xbadcafebee•1w ago
spwa4•1w ago
Zip has 2 tricks: First, compression is per-file, allowing extraction of single files without decompressing anything else.
Second, the "directory" is at the end, not the beginning, and ends in the offset of the beginning of the directory. Meaning 2 disk seeks (matters even on SSDs) and you can show the user all files.
Then, you know exactly what bytes are what file and everything's fast. Second, you can easily take off the directory from the zip file, allowing new files to be added without modifying the rest of the file, which can be extended to allow for arbitrary modification of the contents, although you may need to "defragment" the file.
And I believe, encryption is also per-file. Meaning to decrypt a file you need both the password and the directory entry, which means that if you delete a file, and rewrite just the directory, the data is unrecoverable without requiring a total rewrite of the bytes.
hedora•1w ago
thaumasiotes•1w ago
How do you access a particular file without seeking through the entire file? You can't know where anything is without first seeking through the whole file.
charcircuit•1w ago
nikanj•1w ago
thaumasiotes•1w ago
Where does that begin?
> Read the last block
You mean the last 4KB chunk defined by the file system, or what? The comment can be up to 64KB long.
Dylan16807•1w ago
Okay, the last 65KB.
Are you nitpicking now that you learned about the directory, or did you know about it before your first comment and pretended not to for some reason?
account42•1w ago