Author here. Quick map of the finding for anyone skimming:
Bug 1 is in the hashing path. Node.hash, TextNode.hash, and IngestionCache all include metadata via MetadataMode.ALL, which ignores excluded_embed_metadata_keys. Any volatile field (mtime, atime, file size) flips the hash and forces a re-embed of byte-identical content.
Bug 2 is that default_file_metadata_func queries POSIX-only stat keys (mtime, atime, created). Whether a given fsspec backend emits those keys decides whether Bug 1 is firing on you today. I source-inspected every backend under the fsspec GitHub org and every built-in in filesystem_spec.
Active today (bug fires at day-level precision): local, gcsfs, sshfs + built-in sftp, smb, arrow/HDFS, memory.
Wrapper: alluxiofs delegates to its wrapped backend.
GCS is the outlier on the active side because gcsfs/core.py explicitly sets result["mtime"] = parse(object_metadata["updated"]) as a legacy compatibility alias. There is a TODO about removing it. The code is still there.
Once default_file_metadata_func gets its natural one-line fix to use fs.modified(path) instead of POSIX-specific keys, every masked backend activates at sub-second precision simultaneously.
Reproducers at github.com/stirelli/llamaindex-embedding-churn (five progressively real levels, level 3 uses real OpenAI API with billed tokens). Fix is PR #21462 against run-llama/llama_index, three lines plus a regression test covering both directions.
Happy to answer questions on the benchmark, the fsspec inspection, or the cost math.
tirelli•1h ago
Bug 1 is in the hashing path. Node.hash, TextNode.hash, and IngestionCache all include metadata via MetadataMode.ALL, which ignores excluded_embed_metadata_keys. Any volatile field (mtime, atime, file size) flips the hash and forces a re-embed of byte-identical content.
Bug 2 is that default_file_metadata_func queries POSIX-only stat keys (mtime, atime, created). Whether a given fsspec backend emits those keys decides whether Bug 1 is firing on you today. I source-inspected every backend under the fsspec GitHub org and every built-in in filesystem_spec.
Active today (bug fires at day-level precision): local, gcsfs, sshfs + built-in sftp, smb, arrow/HDFS, memory.
Masked today (bug dormant, waiting on Bug 2 getting fixed): s3fs, adlfs, ossfs, swiftspec, tosfs, gdrive-fsspec, dropboxdrivefs, ipfsspec, opendalfs, dbfs, http, webhdfs, ftp, github, gist, git.
Wrapper: alluxiofs delegates to its wrapped backend.
GCS is the outlier on the active side because gcsfs/core.py explicitly sets result["mtime"] = parse(object_metadata["updated"]) as a legacy compatibility alias. There is a TODO about removing it. The code is still there.
Once default_file_metadata_func gets its natural one-line fix to use fs.modified(path) instead of POSIX-specific keys, every masked backend activates at sub-second precision simultaneously.
Reproducers at github.com/stirelli/llamaindex-embedding-churn (five progressively real levels, level 3 uses real OpenAI API with billed tokens). Fix is PR #21462 against run-llama/llama_index, three lines plus a regression test covering both directions.
Happy to answer questions on the benchmark, the fsspec inspection, or the cost math.