F3

https://github.com/future-file-format/f3

286•tosh•1h ago

Comments

Arainach•1h ago

This project README is not particularly useful:

It doesn't explain what the project does (a file format for what? Name dropping other things I haven't heard of isn't useful)

There are no examples. It links to a flatbuffer schema which is at least well commented, but is full of deep implementation details.

The point is that within 2-3 minutes I'm not convinced why I care and still don't know enough about what this is to even think back to if if I encounter a scenario in the future where it would be useful.

> designed with efficiency, interoperability, and extensibility in mind. It provides a data organization that rectifies the layout shortcomings of the last-generation formats like Parquet,

This is all marketing speak that says nothing.

> maintaining good interoperability and extensibility (a.k.a future-proof) via embedded Wasm decoders What does this even mean? Providing a decoder is no guarantee of futureproofness.

adammarples•1h ago

Tabular data, it wants to replace Parquet

largbae•1h ago

This could use a bit more "why".

Shortcomings of Parquet are mentioned as overcome by this, which ones? Certainly not wide tool support...

Why should one leave Parquet or ORC for this structure?

altairprime•58m ago

The ‘why’ is referenced in the bibliography at the end of the readme; this repo is not meant to be consumed standalone. Start with the paper instead:

https://doi.org/10.1145/3749163

skrtskrt•53m ago

Yeah it seems like most of this can be handled by some more dev hours to Parquet

dj_axl•23m ago

Paper mentions Parquet, ORC, Nimble, Lance, TSFile, Bullion, and BtrBlocks.

thisisauserid•1h ago

Great! I'll use it.

In the "future."

Nimble? Lance? Also in the future. Maybe.

I'll use Parquet in the present.

adammarples•1h ago

No commits in 8 months?

yung_lean•40m ago

Yeah this was a research project, it doesn't look like this is getting any adoption

owentbrown•1h ago

Nice! The world can always use a better data format.

I think you might get some traction if you post the advantages over parquet and other files directly on the readme, so that if someone goes to https://github.com/future-file-format/f3 the see why they should try it.

Mention the advantages and post metrics. Cherry pick the metrics! There's probably a good use case for this but, from the current readme, it's not clear who should use this and why.

gavinray•1h ago

This bit is quite genius, rather than depend on a language-specific SDK/lib for working with the formats you can fallback to exported WASM methods if none exist:

  > "Each self-describing F3 file includes both the data and meta-data, as well as WebAssembly (Wasm) binaries to decode the data. Embedding the decoders in each file requires minimal storage (kilobytes) and ensures compatibility on any platform in case native decoders are unavailable. "

verdverm•58m ago

Is embedding executable code into a file a security risk? My assumption is a yes

mirashii•56m ago

That would be why it chose a VM that is explicitly designed for sandboxing rather than native executable code or similar, the risk can be minimized by reducing the surface area available to that executable code to almost nothing.

gavinray•56m ago

There is no concept of "executable" vs "non-executable" content in a file.

A file is a bag of bytes. You can send those bytes to different things, like a text editor's content-stream, or as the input to a WASM interpreter.

What you decide to do with the bytes in a file is your own prerogative. Each byte is whatever you make of it.

outside1234•53m ago

I mean can't we say the same thing about sending around a .exe though?

krzyk•1h ago

File format for what? Text, graphics, compiled code?

ghkbrew•51m ago

For columnar data storage I think. The description references Parquet and they appear to benchmark against Parquet, Vortex, and Lance.

meindnoch•50m ago

The future.

ChrisArchitect•58m ago

A more descriptive title would be helpful OP:

F3: Open-source data file format for the future

Previous discussion:

2025 https://news.ycombinator.com/item?id=45437759

meta-level•55m ago

Don't know why but I had to think of https://xkcd.com/2116/

drdexebtjl•49m ago

Probably not a good idea to name your project “future” anything, if you expect that future to become the present.

Also, f3 is already “fight-flash-fraud”.

GolDDranks•49m ago

I love the idea, and I developed something similar of myself in the past (https://github.com/golddranks/kobuta), but... this reeks of slop. With Rust code, edition="2021" is a dead giveaway.

corvad•48m ago

https://xkcd.com/927/

Groxx•42m ago

Hm. I can kinda see it replacing self-extracting EXEs, but a lot of why you choose specific file formats is for specific features they offer - any self-describing system can fall into "there are too many competing features and nobody handles them all" exactly as easily as any other format.

Like, can this file be efficiently mmap'd? Maybe if it emulates tar internally, but you don't know until you run it. Can it be seeked to specific bytes to only decompress part? It only supports a pre-release version of ISO-36898533 seeking, and your file library dropped support for it 6 years ago. If I rewrite 1MB in the middle, can it only change those pages on disk (and maybe an index), or do I have to rewrite the whole thing? Well the wasm blob supports 97 different APIs for it (there are 35 copies of one with different names), so it's larger than the data (but nobody paid attention to that), so you have 19 options that you recognize, but your CPU's native WASM accelerator only handles two or three so you've still got to specialize your code heavily.

At least with "*.tar.gz" you have some idea of what's possible.

coffeecoders•35m ago

If I am archiving PBs of data for 10+ years, I don't want to rely on a WASM interpreter being available and performant in the future just to read a file. I want a dead-simple, heavily documented byte specification like Parquet.

Additionally, putting the decoding logic inside an WASM binary introduces an active execution layer into what should be a cold storage.

0xbadcafebee•30m ago

You don't want to run a custom 10-year-old data parsing function every time you read a single data record?

MoonWalk•34m ago

Is what?

zerobees•27m ago

Some folks described it as genius. My initial reaction that it's somewhat silly. A major aspect of any data format is what you do with the data once decoded. An audio file represents something completely different than a HTML document, which is completely different from model weights. An embedded VM that convers one bit stream to another solves just a small piece of the puzzle. It doesn't let you play videos with a text editor. So what's the goal?

Backward compatibility doesn't seem like a worthwhile use case, so I guess the main one is forward compatibility: say I come up with a video compression scheme that's better than H.265, but not all platforms support it, so I embed a decoder that would allow me to play it back on legacy hardware. But I guess that also shows the weakness of the idea: it's unlikely that legacy hardware will perform well doing software-only decode for video formats from the future.

It's also a maintenance nightmare: if your decoder has a bug that needs fixing, how do you patch all the files that already embed it?

All in all, it sounds like we're adding megabytes of code and a considerable attack surface to every format parser. It's more opportunities for remote code execution, resource exhaustion attacks, and so on. What do we realistically get in return?

lowbloodsugar•24m ago

>via embedded Wasm decoders

runs screaming

amluto•20m ago

One nice thing about some modern formats is that there are tools that read them at extraordinarily high effective speed. For example, DuckDB can do all manner of nifty optimizations while reading its own native format or Parquet. And I’m not sure that those optimizations can be effectively applied to a format that needs a WASM blob to be run to understand it. By the time you run a non-SIMD or even a SIMD-optimized pass over app the data, if that pass doesn’t understand your query, you may have already lost.

I admit I only skimmed the beginning of the paper, and maybe the format is less general than it sounds.

hahahacorn•19m ago

My understanding is it’s a fallback mechanism

ShinyLeftPad•18m ago

To save a click it's a file format for columnar data specifically (like Parquet), which they very generically named Future-proof File Format. Most of this could fit in the title instead of just "F3" that is a bit of bait.

antisthenes•16m ago

The description mentions shortcomings of the previous file types like parquet, but it isn't really evident to me what those shortcomings are, or if the use cases for parquet and F3 have really that much of an overlap to make this comparison valid in the first place.

anygivnthursday•15m ago

My concern is, if decode fails I need to debug WASM added by some other party maybe containing random bugs. Maybe a library of standard decoders maintained and tested by the project could help, but then not sure if it kills the advantage of the flexibility it provides.

sph•9m ago

I don’t know what are people commenting on. I see a README with little to no information about what this is, what problems it solves, just links to its Flatbuffer description and a directory full of source code.

What context am I missing?

burkaman•6m ago

There is a linked paper: https://dl.acm.org/doi/epdf/10.1145/3749163

jauntywundrkind•7m ago

The wasm decoder thing was also done in Anyblox. https://github.com/AnyBlox

Has nimble/velox had any better luck lately? I forget what stories someone shared, but, it seemed to have such big intent, then real trouble actually getting released. I want to say someone was saying the lawyers ended up not letting a lot of the work get released. Nimble is the one competitor benchmarked against here that beats them, and is also extensible (to some degree?), so I'd love to know how things have gone for the past 6-12 months for nimble/velox.

gavinray•50m ago

Double-clicking an ".exe" (or running it via a shell) is not the same as "bag of bytes", it's "send these bytes to an executable environment".

Doing `head foo.exe` is quite different than `run foo.exe`

If I encode executable instructions in "image.png" and then send them to an interpreter that runs those instructions, the file extension doesn't matter.

bluejekyll•48m ago

.exe has bindings to OS ABI and system calls, WASM doesn’t have this by default, it’s up to the VM to provide whatever environment the WASM executable needs, ideally there should be no system calls, no stdio, just instructions on how to interpret the file format.

jastanton•52m ago

gotcha, so the vulnerability will be in some common libraries that attackers force some wasm fallback path with custom wasm instructions that when executed does something nefarious.

I'd say at worst it's setup for poor security

jedberg•51m ago

Sure, but when the standard says "read this file and execute the instructions you find at the beginning" that is more dangerous than "this is a file with data and your program needs to figure out how to read it".

gavinray•48m ago

I guess it's a good thing that the F3 standard does not say "read this file and execute the instructions you find at the beginning", then?

The WASM encoders/decoders are embedded resources that exist as byte offsets in the file metadata, not header info.

jedberg•28m ago

Ok if you want to be pedantic, the standard says, "if you can't read this file, go to the offset and then execute the code you find" which isn't functionally different from what I said.

ratorx•46m ago

There’s a big difference in the expected use of a file. If the file is attacker provided, and the fallback path is being used, the attacker can embed whatever WASM payload they want into the file since the file will be “opened” by “execute this offset into the file”.

Compare that to JSON. The parser NEVER needs to execute arbitrary instructions. Parser might have bugs, but it avoids a whole class of issues.

gavinray•45m ago

  >  the attacker can embed whatever WASM payload they want into the file since the file will be “opened” by “execute this offset into the file”.

And then do what with it?

WASM physically cannot interact with the underlying host or perform I/O -- you need a WASI environment for that.

ratorx•36m ago

Putting aside the WASM sandboxing (I’m not familiar enough with it to understand how sandboxing works) there’s a DoS vector at least. Even regexes have had many DoS issues, and I can’t imagine WASM being easier to sandbox for DoS risk.

7373737373•28m ago

There exist Wasm interpreters capable of limiting the number of instructions executed.

msla•49m ago

> Is embedding executable code into a file a security risk?

Yes, which is why nobody uses PDFs.

nine_k•44m ago

TrueType and OpenType fonts include code executed by a VM to even render them. This wasn't a viable source of attacks so far, due to the properly limited nature of the VMs.

Maybe I would pick the eBPF VM instead, with all its limiting and verifying mechanics.

tedd4u•38m ago

There are many documented, exploited-in-the-wild font-file attacks (one example in 1]). Apple is re-writing their font interpreter specifically to improve security. [2]

[1] https://www.bleepingcomputer.com/news/security/facebook-disc...

[2] https://blakecrosley.com/blog/truetype-hinting-swift-migrati...

cmiles74•38m ago

https://learn.microsoft.com/en-us/security-updates/SecurityB...

> This security update resolves a publicly disclosed vulnerability in Microsoft Windows. The vulnerability could allow remote code execution if a user opens a specially crafted document or visits a malicious Web page that embeds TrueType font files.

> This security update is rated Critical for all supported releases of Microsoft Windows. For more information, see the subsection, Affected and Non-Affected Software, in this section.

> The security update addresses the vulnerability by modifying the way that a Windows kernel-mode driver handles TrueType font files. For more information about the vulnerability, see the Frequently Asked Questions (FAQ) subsection for the specific vulnerability entry under the next section, Vulnerability Information.

jasonjayr•58m ago

So attackers don't have to craft specially corrupted files? They can just include the code to perform the attack in the data file itself?

arcfour•53m ago

Yes...my first thought. No way in hell anyone actually trusts this.

(And as if we didn't trust the compiler enough already!)

Omega359•7m ago

Meh, it's not that bad. Pretty simple to block inline wasm and to use well known external decoders.

doctorpangloss•49m ago

But the WASM runs in the sandbox! It only has access to some files, your display, inputs, ... nothing insecure at all!

gavinray•44m ago

WASM runs in a confined memory space allocated for the program. There is no I/O or host address space access.

You need to run a WASI environment for that.

nine_k•48m ago

Does WASM have built-in I/O? If not, all that a decoder would be able to do is to decode into a buffer.

weinzierl•43m ago

WASM has strong tried and proven sandboxing. We basically can build on nearly 30 years of experience. The decoders don't need a lot of access, they can basically be pure functions.

If this will pan out security-wise I don't know. I'm more worried that it will be so slow that no one will use it. Interesting idea, though, and I can see applications outside of the "big data" realm this apparently targets.

bilekas•35m ago

> The decoders don't need a lot of access, they can basically be pure functions

They don't currently either do they? It's the tight coupling of the interface layer no? I'm not sure this would be faster, or more secure so reliability might be the best usecase?

ok123456•25m ago

How do you prevent compression bomb attacks when files can define their own compression functions?

You could have some kind of OOM killer, but that will be a "footgun" that people who are actually doing "big data" will constantly shoot.

This pretty much kills any ingestion pipeline where the source is untrusted.

johncolanduoni•8m ago

OOM killing in WebAssembly is trivial, since it’s all in a growable linear memory. All the runtimes I’m aware of have a simple maximum memory setting, and they’ll trap any allocation requests after that point.

Retr0id•6m ago

WASM implementations are fairly mature now, but if there was e.g. an image file format with embedded WASM that needed to execute before you could view it, it would become the new low-hanging-fruit target for 0-click RCEs - whether it's exploiting the WASM engine itself or some other attack surface that's influenceable via it (See also, the FORCEDENTRY JBIG2 exploit).

Kiboneu•2m ago

[delayed]

grodes•57m ago

How is wasm better than C bindings?

gavinray•54m ago

Many languages don't have ergonomic experiences for working with C ABI's without explicit wrapper code.

Hell, Node.js didn't even get this ability until LAST MONTH:

https://nodejs.org/en/blog/release/v26.1.0

You'd have to write a second library to interface the C ABI with Node via NAPI just to consume it.

bluejekyll•52m ago

WASM is platform independent.

What do you mean by C bindings? C bindings to what?

andrewstuart2•53m ago

I would call it clever. I'm not sure I'd call it genius.

When I'm working with data I'm working in a specific set of languages. Usually one. Yeah, other people might be working in other languages, but no individual author really needs a language-agnostic way of accessing data beyond compile time. Add to that the likely runtime boundaries that may need to be crossed instead of e.g. inlined by the compiler because it's in-language and dealing with known offsets or tags (depends on the data format of course). To the other commenter's point, am I going to have to sandbox all data access code just to be sure it's not able to do something unexpected? There's a lot of complexity here. And the inherent risk is going to slow down the operation that should be the simplest and fastest: interpreting bytes.

yung_lean•35m ago

A big problem with parquet, which this aims to replace, is that it's hard to add new encodings because everyone wants to stay compatible with old readers. Embedding the decoders in the file as WASM solves this problem since in theory, old readers will be able to read new files by just using the provided WASM to decode a column whose format the reader doesn't recognize.

So this is really about making a file that is forwards compatible in a way that lets you push the standards more than existing formats.

rebeccajae•32m ago

It sounds neat, but feels like it might fall apart with higher-complexity formats. What does an embedded decoder for a PDF look like? I guess since they are tightly-coupled to the file bytes themselves, the author of the file gets to choose what formats make sense, but not all formats have a one-true-decode-step.

aseipp•19m ago

Despite the name seemingly implying otherwise, F3 is an alternative to columnar storage formats like Parquet; the goal is not to support every conceivable encoding of every file type such as a PDF. Think of the use cases being more like "What if you used a specialized compressor and need a custom block decompression algorithm" or "Decode internal format into Arrow output" or something like that.

cbm-vic-20•25m ago

Applets redux.

The End of Code Review: Coding Agents Supersede Human Inspection

Confidential Apple Files Leaked on Dark Web After Supplier Cyberattack

AI Models Soccer Tournament [video]

Building Intelligent Games

Yet Another Piece of AI-Pilled Speculative Fiction Has Gone Dangerously Viral

TDD is how I trust the code AI agents write [video]

Opinionated Python template: uv under the hood, Makefile as the control surface

Show HN: We Help Voice AI Handle Group Conversations

Okrug.tv – Free Online Video and Entertainment Portal

Book-to-skill: Turn a technical book PDF into a Claude Code skill

Phillip K Dick's Divine Interference

Dai Studio – Better context. Better output

Lethe – Brain-Centric AI Assistant

Ukraine's Drone Commander on Drone Warfare and Use of AI [Dutch]

Experimenting with the Proposed Cross-Origin Storage API in Transformers.js

Fired by Google for Creating the Google Workspace CLI

Microsoft now underperforms the Nasdaq100 by the largest margin in ~9 years

Javalin: A simple and modern web framework for Kotlin and Java

In memory of the man who put red and green squiggles under words

Gitember: Not just another Git GUI client

Mission control for your Claude Code sessions

Shareholders sue Uber's board over sexual assaults, other incidents

I am a person who will look at the Steam Machine and cry

Semgrep Guardian: Security for AI-Generated Code

New satellites from years to weeks, days, or hours

Please keep code descriptions simple

a

Show HN: Hallu – a web framework where an LLM hallucinates your app

Show HN: Apple Watch 10.6.2 Kernel R+W with Process Dumping

Think this open-source Flutter-native AI agent worth building?