Unix "find" expressions compiled to bytecode

https://nullprogram.com/blog/2025/12/23/

125•rcarmo•1mo ago

Comments

tasty_freeze•1mo ago

That is a fun exercise, but I imagine the time to evaluate the conditional expression is a tiny fraction, just a percent or less, than the time it takes to make the file system calls.

CerryuDu•1mo ago

... not to mention the time it takes to load directory entries and inodes when the cache is cold.

nasretdinov•1mo ago

For many cases you don't even need to make stat() call to determine whether or not the file is a directory (d_type specifically can tell it: https://man7.org/linux/man-pages/man3/readdir.3.html). That's what allows find(1) to be so quick

loeg•1mo ago

You could imagine determining from the parsed expression whether or not stat'ing was required.

NFS has readdirplus, but I don't think it ever made its way into Linux/POSIX. (Some filesystems could efficiently return dirents + stat information.)

nasretdinov•1mo ago

> readdirplus

Well, it definitely does _something_, because on NFS the subsequent stat() calls after reading the directory names do indeed complete instantly :), at least in my testing.

loeg•1mo ago

I mean, readdirplus as a local filesystem API. Ultimately unix programs are just invoking getdents() (or equivalent) + stat() (or statx, whatever). Linux nfsclient probably caches the result of readdirplus for subsequent stat.

drob518•1mo ago

From the article:

> I was later surprised all the real world find implementations I examined use tree-walk interpreters instead.

I’m not sure why this would be surprising. The find utility is totally dominated by disk IOPS. The interpretation performance of find conditions is totally swamped by reading stuff from disk. So, keep it simple and just use a tree-walk interpreter.

chubot•1mo ago

Yeah that's basically what was discussed here: https://lobste.rs/s/xz6fwz/unix_find_expressions_compiled_by...

And then I pointed to this article on databases: https://notes.eatonphil.com/2023-09-21-how-do-databases-exec...

Even MySQL, Duck DB, and Cockroach DB apparently use tree-walking to evaluate expressions, not bytecode!

Probably for the same reason - many parts are dominated by I/O, so the work on optimization goes elsewhere

And MySQL is a super-mature codebase

drob518•1mo ago

I was just reading a paper about compiling SQL queries (actually about a fast compilation technique that allows for full compilation to machine code that is suitable for SQL and WASM): https://dl.acm.org/doi/pdf/10.1145/3485513

Sounds like many DBs do some level of compilation for complex queries. I suspect this is because SQL has primitives that actually compute things (e.g. aggregations, sorts, etc.). But find does basically none of that. Find is completely IO-bound.

hxtk•1mo ago

Virtually all databases compile queries in one way or another, but they vary in the nature of their approaches. SQLite for example uses bytecode, while Postgres and MySQL both compile it to a computation tree which basically takes the query AST and then substitutes in different table/index operations according to the query planner.

SQLite talks about the reasons for each variation here: https://sqlite.org/whybytecode.html

drob518•1mo ago

Thanks for the reference.

Someone•1mo ago

Is it truly simpler to do that? A separate “command line to byte codes” module would be way easier to test than one that also does the work, including making any necessary syscalls.

Also, decreasing CPU usage many not speed up find (much), but it would leave more time for running other processes.

drob518•1mo ago

If it was easier to interpret byte codes, nobody would use a tree-walk interpreter. There’s no performance reason to use a tree-walk interpreter. They all do it because it’s easy. You basically already have the expression in tree form, regardless of where you end up. So, stop processing the tree and just interpret it.

Someone•1mo ago

> If it was easier to interpret byte codes

I’m not claiming anything, certainly not that; I’m questioning whether the additional complexity of designing a bytecode VM is worth it because it makes testing easier.

maxbond•1mo ago

File operations are a good candidate for testing with side effects since they ship with every OS and are not very expensive in a tmpfs, but you don't have to let it perform side effects. You could pass the eval function a delegate which it calls methods on to perform side effects and pass in a mocked delegate during testing.

adrian_b•1mo ago

The assumption that "find" performance is dominated by disk IOPS is not generally valid.

For instance, I normally compile big software projects in RAM disks (Linux tmpfs), because I typically use computers with no less than 64 GB of DRAM.

Such big software projects may have very great numbers of files and subdirectories and their building scripts may use "find".

In such a case there are no SSD or HDD I/O operations, everything is done in the main memory, so the intrinsic performance of "find" may matter.

burnt-resistor•1mo ago

I recently wrote a "du" summarizer of additional stats in C because it's faster than du, find, or any sort of scripting language tree walker. The latter is orders of magnitude slower, but ultimately it's bounded by iteration of kernel vfs structures and any hard IOPS that are spent to fetch metadata from slower media.

For archiving, I also wrote a parallel walker and file hasher that only does one pass of data and stores results to a sqlite database. It's basically poor-man's IDS and bitrot detection.

hxtk•1mo ago

The latter sounds like a reimplementation of AIDE, which exists in major Linux distributions’ default package managers.

Did you ever compare what you wrote to that?

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Haskell for all: Beyond agentic coding

SectorC: A C Compiler in 512 bytes (2023)

Speed up responses with fast mode

Software factories and the agentic moment

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Hoot: Scheme on WebAssembly

LLMs as the new high level language

Stories from 25 Years of Software Development

First Proof

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

FDA intends to take action against non-FDA-approved GLP-1 drugs

Al Lowe on model trains, funny deaths and working with Disney

Vouch

Show HN: Axiomeer – An open marketplace for AI agents

Start all of your commands with a comma (2009)

Show HN: A luma dependent chroma compression algorithm (image compression)

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

I write games in C (yes, C) (2016)

Selection rather than prediction

The F Word

Learning from context is harder than we thought

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The silent death of good code

Where did all the starships go?

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Reinforcement Learning from Human Feedback

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Haskell for all: Beyond agentic coding

SectorC: A C Compiler in 512 bytes (2023)

Speed up responses with fast mode

Software factories and the agentic moment

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Hoot: Scheme on WebAssembly

LLMs as the new high level language

Stories from 25 Years of Software Development

First Proof

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

FDA intends to take action against non-FDA-approved GLP-1 drugs

Al Lowe on model trains, funny deaths and working with Disney

Vouch

Show HN: Axiomeer – An open marketplace for AI agents

Start all of your commands with a comma (2009)

Show HN: A luma dependent chroma compression algorithm (image compression)

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

I write games in C (yes, C) (2016)

Selection rather than prediction

The F Word

Learning from context is harder than we thought

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The silent death of good code

Where did all the starships go?

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Reinforcement Learning from Human Feedback

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Unix "find" expressions compiled to bytecode

Comments