The reason I gave (which was accepted) was that the process of creating a proof of concept and iterating on it rapidly is vastly easier in Python (for me) than it is in Go. In essence, it would have taken me at least a week, possibly more, to write the program I ended up with in Golang, but it only took me a day to write it in Python, and, now that I understand the problem and have a working (production-ready) prototype, it would probably only take me another day to rewrite it in Golang.
Also, a large chunk of the functionality in this Python script seems to be libraries - pillow for image processing, but also pytorch and related vision/audio/codec libraries. Even if similar production-ready Rust crates are available (I'm not sure if they are), this kind of thing is something Python excels at and which these modules are already optimized for. Most of the "work" happening here isn't happening in Python, by and large.
Maybe with micro-kernels we'll finally fix this.
Persistent file systems are essentially key-value stores, usually with optimizations for enumerating keys under a namespace (also known as listing the files in a directory). IMO a big problem with POSIX filesystems is the lack of atomicity and lock guarantees when editing a file. This and a complete lack of consistent networked API are the key reasons few treat file systems as KV stores. It's a pity, really.
"Userspace vs not" is a different argument from "consistency vs not" or "atomicity vs not" or "POSIX vs not". Someone still needs to solve that problem. Sure instead of SQLite over POSIX you could implement POSIX over SQLite over raw blocks. But you haven't gained anything meaningful.
> Persistent file systems are essentially key-value stores
I think this is reductive enough to be equivalent to "a key-value store is a thin wrapper over the block abstraction, as it already provides a key-value interface, which is just a thin layer over taking a magnet and pointing it at an offset".
Persistent filesystems can be built over key-value stores. This is especially common in distributed filesystems. But they also circumvent a key-value abstraction entirely.
> IMO a big problem with POSIX filesystems is the lack of atomicity
Atomicity requires write-ahead logging + flushing a cache. I fail to see why this needs to be mandatory, when it can be effectively implemented at a higher layer.
> This and a complete lack of consistent networked API
A consistent networked API would require you to hit the metadata server for every operation. No caching. Your system would grind to a halt.
Finally, nothing in the POSIX spec prohibits an atomic filesystem or consistency guarantees. It is just that no one wants to implement these things that way because it overprovisions for one property at the expense of others.
This was an attempt to possibly explain the microkernel point GP made, which only really matters below the FS.
> I think this is reductive enough to be equivalent to "a key-value store is a thin wrapper over the block abstraction, as it already provides a key-value interface, which is just a thin layer over taking a magnet and pointing it at an offset".
I disagree with this premise. Key-value stores are an API, not an abstraction over block storage (though many are or can be configured to be so). File systems are essentially a superset of a KV API with a multitude of "backing stores". Saying KV stores are always backed by blocks is overly reductive, no?
> Atomicity requires write-ahead logging + flushing a cache. I fail to see why this needs to be mandatory, when it can be effectively implemented at a higher layer.
You're confusing durability for atomicity. You don't need a log to implement atomicity, you just need a way to lock one or more entities (whatever the unit of atomic updates are). A CoW filesystem in direct mode (zero page caching) would need neither but could still support atomic updates to file (names).
> A consistent networked API would require you to hit the metadata server for every operation. No caching. Your system would grind to a halt.
Sorry, I don't mean consistent in the ACID context, I mean consistent in the loosely defined API shape context. Think NFS or 9P.
I also disagree with this to some degree: pipelined operations would certainly still be possible and performant but would be rather clunky. End-to-end latency for get->update-write, the common mode of operation, would be pretty awful.
> Finally, nothing in the POSIX spec prohibits an atomic filesystem or consistency guarantees. It is just that no one wants to implement these things that way because it overprovisions for one property at the expense of others.
I didn't say it did, but it doesn't require it which means it effectively doesn't exist as far as the users of FS APIs are concerned. Rename operations are the only API that atomicity is required by POSIX. However without a CAS-like operation you can't safely implement a lock without several extra syscalls.
The safest way to put the FS on a level-playing field with other interfaces is to make the kernel not know about, just as it doesn't know about, say, SQL.
Would you store all your ~/ in something like SQLite database?
Actually yeah that sounds pretty good.
For Desktop/Finder/Explorer you'd just need a nice UI.
Searching Documents/projects/etc would be the same just maybe faster?
All the arbitrary stuff like ~/.npm/**/* would stop cluttering up my ls -la in ~ and could be stored in their own tables whose names I genuinely don't care about. (This was the dream of ~/Library, no?)
[edit] Ooooh, I get it now. This doesn't solve namespacing or traversal.
Almost all of the operations done on actual filesystems are not database like, they are close to the underlying hardware for practical reasons. If you want a database view, add one in an upper layer.
There was no query language for updating files, or even inspecting anything about a file that was not published in the EAs (or implicitly do as with adapters), there were no multi-file transactions, no joins, nothing. Just rich metadata support in the FS.
Could you provide reference information to support this background assertion? I'm not totally familiar with filesystems under the hood, but at this point doesn't storage hardware maintain an electrical representation relatively independent from the logical given things like wear leveling?
Mature database implementations also bypass a lot of kernel machinary to get closer to the underlying block devices. The layering of DB on top of FS is a failure.
Common usage does this by convention, but that's just sloppy thinking and populist extentional definitining. I posit that any rigorous, thought-out, not overfit intentional definition of a database will, as a matter of course, also include file systems.
- You can reason about block offsets. If your writes are 512B-aligned, you can be ensured minimal write amplification.
- If your writes are append-only, log-structured, that makes SSD compaction a lot more straightforward
- No caching guarantees by default. Again, even SSDs cache writes. Block writes are not atomic even with SSDs. The only way to guarantee atomicity is via write-ahead logs.
- The NVMe layer exposes async submission/completion queues, to control the io_depth the device is subjected to, which is essential to get max perf from modern NVMe SSDs. Although you need to use the right interface to leverage it (libaio/io_uring/SPDK).
I'll try to do an example. The kernel doesn't currently know about SQL. Instead, you e.g. connect to a socket, and start talking to postgres. Imagine if FS stuff was the same thing: you connect to a socket, and then issue various command to read and write files. Ignore perf for a moment, it works right?
Now, one counter-argument might be "hold up, what is this socket you need to connect to, isn't that part of a file system? Is there now an all-userspace inner filesystem, still kernel-supported 'meta filesystem'?" Well, the answer to that is maybe the Unix idea of making communication channels like pipes and (to a lesser extent) sockets, was a bad idea. Or rather, there may be nothing wrong with saying a directory can have a child which may be such a communication channel, but there is a problem with saying that every such communication channel should live inside some directory.
1. Distributed filesystems do often use databases for metadata (FoundationDB for 3FS being a recent example)
2. Using a B+ tree for metadata is not much different from having a sorted index
3. Filesystems are a common enough usecase that skipping the abstraction complexity to co-optimize the stack is warranted
Does VectorVFS do retrieval, or store embeddings in EXT4?
Is retrieval logic obscured by VectorVFS?
If VectorVFS did retrieval with non-opaque embeddings, how would one debug why a file surfaced?
Let me ask another question: is this intended for production use, or is it more of a research project? Because as a user I care about things like speed, simplicity, flexibility, and robustness.
When BigQuery was still in alpha I had to ingest ~15 billion HTTP requests a day (headers, bodies, and metadata). None of the official tooling was ready, so I wrote a tiny bash script that:
1. uploaded the raw logs to Cloud Storage, and
2. tracked state with three folders: `pending/`, `processing/`, `done/`.
A cron job cycled through those directories and quietly pushed petabytes every week without dropping a byte. Later, Google’s own pipelines—and third-party stacks like Logstash—never matched that script’s throughput or reliability.Lesson: reach for the filesystem first; add services only once you’ve proven you actually need them.
[1] https://en.wikipedia.org/wiki/Everything_is_a_file [2] https://en.wikipedia.org/wiki/Unix_philosophy
We were building Reblaze (started 2011), a cloud WAF / DDoS-mitigation platform. Every HTTP request—good, bad, or ugly—had to be stored for offline anomaly-detection and clustering.
Traffic profile
- Baseline: ≈ 15 B requests/day
- Under attack: the same 15 B can arrive in 2-3 hours
Why BigQuery (even in alpha)?It was the only thing that could swallow that firehose and stay query-able minutes later — crucial when you’re under attack and your data source must not melt down.
Pipeline (all shell + cron)
Edge nodes → write JSON logs locally and a local cron push to Cloud Storage
Tiny VM with a cron loop
- Scans `pending/`, composes many small blobs into one “max-size” blob in `processing/`.
- Executes `bq load …` into the customer’s isolated dataset.
- On success, moves the blob to `done/`; on failure, drops it back to `pending/`.
Downstream ML/alerting* pulls straight from BigQueryThat handful of `gsutil`, `bq`, and `mv` commands moved multiple petabytes a week without losing a byte. Later pipelines—Dataflow, Logstash, etc.—never matched its throughput or reliability.
I would add that filesystems are superior to data formats (XML, JSON, YAML, TOML) for many use cases such as configuration or just storing data.
- Hierarchy are dirs,
- Keys are file names,
- Value is the content of the file.
- Other metadata are in hidden files
It will work forever, you can leverage ZFS, Git, rsync, syncthing much better. If you want, a fancy shells like Nushell will bring the experience pretty close to a database.
Most important you don't need fancy editor plugins or to learn XPath, jq or yq.
1. For config, it spreads the config across a bunch of nested directories, making it hard to read and write it without some sort of special tool that shows it all to you at once. Sure, you can easily edit 50 files from all sorts of directories in your text editor, but that’s pretty painful.
2. For data storage is that lots of smaller files will waste partial storage blocks in many file systems. Some do coalesce small files, but many don’t.
3. For both, it’s often going to be higher performance to read a single file from start to finish than a bunch of files. Most file systems will try to keep file blocks in mostly sequential order (defrag’d), whereas they don’t typically do that for multiple files in different directories. SSD makes this mostly a non-issue these days, however. You still have the issue of openings, closings, and more read calls, however.
It really depends how comfortable you are using the shell and which one you use.
cat, tree, sed, grep, etc will get you quite far and one might argue that it is simpler to master than vim and various format. Actually mastering VSCode also takes a lot of efforts.
> 2. For data storage is that lots of smaller files will waste partial storage blocks in many file systems. Some do coalesce small files, but many don’t.
> 3. For both, it’s often going to be higher performance to read a single file from start to finish than a bunch of files. Most file systems will try to keep file blocks in mostly sequential order (defrag’d), whereas they don’t typically do that for multiple files in different directories. SSD makes this mostly a non-issue these days, however. You still have the issue of openings, closings, and more read calls, however.
Agreed but for most use case here it really doesn't matter and if I need to optimise storage I will need a database anyway.
And I sincerely believe that most micro optimisations at the filesystem level are cancelled by running most editors with data format support enabled....
I'm being slightly hypocritical because I've made plenty of use of the filesystem as a configuration store. In code it's quite easy to stat one path relative to a directory, or open it and read it, so it's very tempting.
- hard links (only tar works for backup)
- small file size (or inodes run out before disk space)
http://root.rupy.seIt's very useful for global distributed real-time data that don't need the P in CAP for writes.
(no new data can be created if one node is offline = you can login, but not register)
That obviously has a lot of interesting use cases, but my first assumption was that this could be used to quickly/easily search your filesystem with some prompt like "Play the video from last month where we went camping and saw a flock of turkeys". But that would require having an actual vector DB running on your system which you could use to quickly look up files using an embedding of your query, no?
For example, POSIX tar files have a defined file format that starts with a header struct: https://www.gnu.org/software/tar/manual/html_node/Standard.h...
You can see that at byte offset 257 is `char magic[6]`, which contains `TMAGIC`, which is the byte string "ustar\0". Thus, if a file has the bytes 'ustar\0' at offset 257 we can reasonably assume that it's a tar file. Almost every defined file type has some kind of string of 'magic' predefined bytes at a predefined location that lets a program know "yes, this is in fact a JPEG file" rather than just asserting "it says .jpg so let's try to interpret this bytestring and see what happens".
As for how it's similar: I don't think it actually is, I think that's a misunderstanding. The metadata that this vector FS is storing is more than "this is a a JPEG" or "this is a word document", as I understand it, so comparing it to magic(5) is extremely reductionist. I could be mistaken, however.
It's just a tool that can read "magic bytes" to figure out what files contains. Very different from what VectorVFS is.
I guess what I'm asking is: how does VectorVFS enable search besides iterating through all files and iteratively comparing file embeddings with the embedding of a search query? The project description says "efficient and semantically searchable" and "eliminating the need for external index files or services" but I can't think of any more efficient way to do a search without literally walking the entire filesystem tree to look for the file with the most similar vector.
Edit: reading the docs [1] confirmed this. The `vfs search TERM DIRECTORY` command:
> will automatically iterate over all files in the folder, look for supported files and then embed the file or load existing embeddings directly from the filesystem."
[1]: https://vectorvfs.readthedocs.io/en/latest/usage.html#vfs-se...
I'd be surprised if cloud storage services like OneDrive don't already do some kind of vector for every file you store. But an online web service isn't the same as being built into the core of the OS.
I invented it because I found searching conventional file systems that support extended attributes to be unbearably slow.
Microsoft saw the tech support nightmare this could generate, and abandoned the project.
It was also complex, ran poorly, and would have required developers to integrate their applications.
Microsoft had long solved the problem of blobs and metadata in ESE and SharePoint's use of MS SQL for binary + metadata storage.
malcolmgreaves•5h ago
I want to point out that this isn’t suitable for any kind of actual things you’d use a vector database for. There’s no notion of a search index. It’s always a O(N) linear search through all of your files: https://github.com/perone/vectorvfs/blob/main/vectorvfs/cli....
Still, fun idea :)
perone•5h ago
binarymax•5h ago
PaulHoule•5h ago
If you gotta gather the data from a lot of different inodes, it is a different story.
ori_b•3h ago
esafak•4h ago
int_19h•4h ago