Compression: For anything that ends up large it's probably desired. Though consider both algorithm and 'strength' based on the use case carefully. Even a simple algorithm might make things faster when it comes time to transfer or write to permanent storage. A high cost search to squeeze out yet more redundancy is probably worth it if something will be copied and/or decompressed many times, but might not be worth it for that locally compiled kernel you'll boot at most 10 times before replacing it with another.
>This also allows for easy concatenation.
How would it be easier than putting it at the front?
So if you rewrite an index at the head of the file, you may end up having to rewrite everything that comes afterwards, to push it further down in the file, if it overflows any padding offset. Which makes appending an extremely slow operation.
Whereas seeking to end, and then rewinding, is not nearly as costly.
But if you're writing indices, there's a good chance that you do care about performance.
Which is what dbm, bdb, Windows search indexes, IBM datasets, and so many, many other standards will do.
In theory, files should be just unrolled linked lists (or trees) of bytes, but I guess a lot of internal code still assumes full, aligned blocks.
Have you ever wondered why `tar` is the Tape Archive? Tape. Magnetic recording tape. You stream data to it, and rewinding is Hard, so you put the list of files you just dealt with at the very end. This now-obsolete hardware expectation touches us decades later.
Reading the index from the end of the file is also quick; where you read next depends on what you are trying to find in it, which may not be the start.
[1] SQLite's own sqlar format is a bad idea for this reason.
It ends up having some overhead compared to established ones, but the ability to query over the attributes of 10000s of files is pretty nice, and definitely faster than the worst case of tar.
My archiver could even keep up with 7z in some cases (for size and access speed).
Implementing it is also not particularly tricky, and SQLite even allows streaming the blobs.
Making readers for such a format seems more accessible to me.
> My archiver could even keep up with 7z in some cases (for size and access speed).
7z might feel slow because it enables solid compression by default, which trades decompression speed with compression ratio. I can't imagine 7z having a similar compression ratio with correct options though, was your input incompressible?
It just may not always be the most performant option. For example, for map tiles there is alternatively the pmtiles binary format which is optimized for http range requests.
https://shapeof.com/archives/2025/4/acorn_file_format.html
The author notes that an advantage is that other programs can easily read the file format and extract information from it.
> Acorn’s native file format is used to losslessly store layer data, editable text, layer filters, an optional composite of the image, and various metadata. Its advantage over other common formats such as PNG or JPEG is that it preserves all this native information without flattening the layer data or vector graphics.
As I've mentioned, this is a good use case for SQLite as a container. But ZIP would work equally well here.
[1] https://flyingmeat.com/acorn/docs/technotes/ACTN002.html
A friend wanted a newer save viewer/editor for Dragonball Xenoverse 2, because there's about a total of two, and they're slow to update.
I thought it'd be fairly easy to spin up something to read it, because I've spun up a bunch of save editors before, and they're usually trivial.
XV2 save files change over versions. They're also just arrays of structs [0], that don't properly identify themselves, so some parts of them you're just guessing. Each chunk can also contain chunks - some of which are actually a network request to get more chunks from elsewhere in the codebase!
[0] Also encrypted before dumping to disk, but the keys have been known since about the second release, and they've never switched them.
- Agreed that human-readable formats have to be dead simple, otherwise binary formats should be used. Note that textual numbers are surprisingly complex to handle, so any formats with significant number uses should just use binary.
- Chunking is generally good for structuring and incremental parsing, but do not expect it to provide reorderability or back/forward compatibility somehow. Unless explicitly designed, they do not exist. Consider PNG for example; PNG chunks were designed to be quite robust, but nowadays some exceptions [1] do exist. Versioning is much more crucial for that.
[1] https://www.w3.org/TR/png/#animation-information
- Making a new file format from scratch is always difficult. Already mentioned, but you should really consider using existing file formats as a container first. Some formats are even explicitly designed for this purpose, like sBOX [2] or RFC 9277 CBOR-labeled data tags [3].
E.g. if you are interested in storing significant amounts of structured floating point data, choosing something like HDF5 will not only make your life easier it will also make it easy to communicate what you have done to others.
Is there a reason not to use a lot more characters? If your application's name is MustacheMingle, call the file foo.mustachemingle instead of foo.mumi?
This will decrease the probability of collision to almost zero. I am unaware of any operating systems that don't allow it, and it will be 100% clear to the user which application the file belongs to.
It will be less aesthetically pleasing than a shorter extension, but that's probably mainly a matter of habit. We're just not used to longer file name extensions.
Any reason why this is a bad idea?
The most popular operating system hides it from the user, so clarity would not improve in that case. At leat one other (Linux) doesn't really use "extensions" and instead relies on magic headers inside the files to determine the format.
Otherwise I think the decision is largely aestethic. If you value absolute clarity, then I don't see any reason it won't work, it'll just be a little "ugly"
mostly for executable files.
I doubt many Linux apps look inside a .py file to see if it's actually a JPEG they should build a thumbnail for.
https://wiki.archlinux.org/title/XDG_MIME_Applications
A lot of apps implement this (including most file managers)
When under pixel pressure, a graphical file manager might choose to prioritize displaying the file extension and truncate only the base filename. This would help the user identify file formats. However, the longer the extension, the less space remains for the base name. So a low-entropy file extension with too many characters can contribute to poor UX.
You could go the whole java way then foo.com.apache.mustachemingle
> Any reason why this is a bad idea
the focus should be on the name, not on the extension.
Some cop-out (not necessarily in a bad way) file formats:
1. Don’t have a file format, just specify a directory layout instead. Example: CinemaDNG. Throw a bunch of particularly named DNGs (a file for each frame of the footage) in a directory, maybe add some metadata file or a marker, and you’re good. Compared to the likes of CRAW or BRAW, you lose in compression, but gain in interop.
2. Just dump runtime data. Example: Mnemosyne’s old format. Do you use Python? Just dump your state as a Python pickle. (Con: dependency on a particular runtime, good luck rewriting it in Rust.)
3. Almost dump runtime data. Example: Anki, newer Mnemosyne with their SQLite dumps. (Something suggests to me that they might be using SQLite at runtime.) A step up from a pickle in terms of interop, somewhat opens yourself (but also others) to alternative implementations, at least in any runtime that has the means to read SQLite. I hope if you use this you don’t think that the presence of SQL schema makes the format self-documenting.
4. One or more of the above, except also zip or tar it up. Example: VCV, Anki.
adelpozo•4h ago
flowerthoughts•3h ago
Similar is the discussion of delimited fields vs. length prefix. Delimited fields are nicer to write, but length prefixed fields are nicer to read. I think most new formats use length prefixes, so I'd start there. I wrote a blog post about combining the value and length into a VLI that also handles floating point and bit/byte strings: https://tommie.github.io/a/2024/06/small-encoding
lifthrasiir•3h ago