Shortcomings of Parquet are mentioned as overcome by this, which ones? Certainly not wide tool support...
Why should one leave Parquet or ORC for this structure?
In the "future."
Nimble? Lance? Also in the future. Maybe.
I'll use Parquet in the present.
I think you might get some traction if you post the advantages over parquet and other files directly on the readme, so that if someone goes to https://github.com/future-file-format/f3 the see why they should try it.
Mention the advantages and post metrics. Cherry pick the metrics! There's probably a good use case for this but, from the current readme, it's not clear who should use this and why.
> "Each self-describing F3 file includes both the data and meta-data, as well as WebAssembly (Wasm) binaries to decode the data. Embedding the decoders in each file requires minimal storage (kilobytes) and ensures compatibility on any platform in case native decoders are unavailable. "A file is a bag of bytes. You can send those bytes to different things, like a text editor's content-stream, or as the input to a WASM interpreter.
What you decide to do with the bytes in a file is your own prerogative. Each byte is whatever you make of it.
F3: Open-source data file format for the future
Previous discussion:
Also, f3 is already “fight-flash-fraud”.
Like, can this file be efficiently mmap'd? Maybe if it emulates tar internally, but you don't know until you run it. Can it be seeked to specific bytes to only decompress part? It only supports a pre-release version of ISO-36898533 seeking, and your file library dropped support for it 6 years ago. If I rewrite 1MB in the middle, can it only change those pages on disk (and maybe an index), or do I have to rewrite the whole thing? Well the wasm blob supports 97 different APIs for it (there are 35 copies of one with different names), so it's larger than the data (but nobody paid attention to that), so you have 19 options that you recognize, but your CPU's native WASM accelerator only handles two or three so you've still got to specialize your code heavily.
At least with "*.tar.gz" you have some idea of what's possible.
Additionally, putting the decoding logic inside an WASM binary introduces an active execution layer into what should be a cold storage.
Backward compatibility doesn't seem like a worthwhile use case, so I guess the main one is forward compatibility: say I come up with a video compression scheme that's better than H.265, but not all platforms support it, so I embed a decoder that would allow me to play it back on legacy hardware. But I guess that also shows the weakness of the idea: it's unlikely that legacy hardware will perform well doing software-only decode for video formats from the future.
It's also a maintenance nightmare: if your decoder has a bug that needs fixing, how do you patch all the files that already embed it?
All in all, it sounds like we're adding megabytes of code and a considerable attack surface to every format parser. It's more opportunities for remote code execution, resource exhaustion attacks, and so on. What do we realistically get in return?
runs screaming
I admit I only skimmed the beginning of the paper, and maybe the format is less general than it sounds.
What context am I missing?
Has nimble/velox had any better luck lately? I forget what stories someone shared, but, it seemed to have such big intent, then real trouble actually getting released. I want to say someone was saying the lawyers ended up not letting a lot of the work get released. Nimble is the one competitor benchmarked against here that beats them, and is also extensible (to some degree?), so I'd love to know how things have gone for the past 6-12 months for nimble/velox.
Doing `head foo.exe` is quite different than `run foo.exe`
If I encode executable instructions in "image.png" and then send them to an interpreter that runs those instructions, the file extension doesn't matter.
I'd say at worst it's setup for poor security
The WASM encoders/decoders are embedded resources that exist as byte offsets in the file metadata, not header info.
Compare that to JSON. The parser NEVER needs to execute arbitrary instructions. Parser might have bugs, but it avoids a whole class of issues.
> the attacker can embed whatever WASM payload they want into the file since the file will be “opened” by “execute this offset into the file”.
And then do what with it?WASM physically cannot interact with the underlying host or perform I/O -- you need a WASI environment for that.
Yes, which is why nobody uses PDFs.
Maybe I would pick the eBPF VM instead, with all its limiting and verifying mechanics.
[1] https://www.bleepingcomputer.com/news/security/facebook-disc...
[2] https://blakecrosley.com/blog/truetype-hinting-swift-migrati...
> This security update resolves a publicly disclosed vulnerability in Microsoft Windows. The vulnerability could allow remote code execution if a user opens a specially crafted document or visits a malicious Web page that embeds TrueType font files.
> This security update is rated Critical for all supported releases of Microsoft Windows. For more information, see the subsection, Affected and Non-Affected Software, in this section.
> The security update addresses the vulnerability by modifying the way that a Windows kernel-mode driver handles TrueType font files. For more information about the vulnerability, see the Frequently Asked Questions (FAQ) subsection for the specific vulnerability entry under the next section, Vulnerability Information.
(And as if we didn't trust the compiler enough already!)
You need to run a WASI environment for that.
If this will pan out security-wise I don't know. I'm more worried that it will be so slow that no one will use it. Interesting idea, though, and I can see applications outside of the "big data" realm this apparently targets.
They don't currently either do they? It's the tight coupling of the interface layer no? I'm not sure this would be faster, or more secure so reliability might be the best usecase?
You could have some kind of OOM killer, but that will be a "footgun" that people who are actually doing "big data" will constantly shoot.
This pretty much kills any ingestion pipeline where the source is untrusted.
Hell, Node.js didn't even get this ability until LAST MONTH:
https://nodejs.org/en/blog/release/v26.1.0
You'd have to write a second library to interface the C ABI with Node via NAPI just to consume it.
What do you mean by C bindings? C bindings to what?
When I'm working with data I'm working in a specific set of languages. Usually one. Yeah, other people might be working in other languages, but no individual author really needs a language-agnostic way of accessing data beyond compile time. Add to that the likely runtime boundaries that may need to be crossed instead of e.g. inlined by the compiler because it's in-language and dealing with known offsets or tags (depends on the data format of course). To the other commenter's point, am I going to have to sandbox all data access code just to be sure it's not able to do something unexpected? There's a lot of complexity here. And the inherent risk is going to slow down the operation that should be the simplest and fastest: interpreting bytes.
So this is really about making a file that is forwards compatible in a way that lets you push the standards more than existing formats.
Arainach•1h ago
It doesn't explain what the project does (a file format for what? Name dropping other things I haven't heard of isn't useful)
There are no examples. It links to a flatbuffer schema which is at least well commented, but is full of deep implementation details.
The point is that within 2-3 minutes I'm not convinced why I care and still don't know enough about what this is to even think back to if if I encounter a scenario in the future where it would be useful.
> designed with efficiency, interoperability, and extensibility in mind. It provides a data organization that rectifies the layout shortcomings of the last-generation formats like Parquet,
This is all marketing speak that says nothing.
> maintaining good interoperability and extensibility (a.k.a future-proof) via embedded Wasm decoders What does this even mean? Providing a decoder is no guarantee of futureproofness.
adammarples•1h ago