A 13-month-old LlamaIndex bug re-embeds unchanged content

https://sebastiantirelli.com/writing/llamaindex-embedding-churn/

1•tirelli•1h ago

Comments

tirelli•1h ago

Author here. Quick map of the finding for anyone skimming:

Bug 1 is in the hashing path. Node.hash, TextNode.hash, and IngestionCache all include metadata via MetadataMode.ALL, which ignores excluded_embed_metadata_keys. Any volatile field (mtime, atime, file size) flips the hash and forces a re-embed of byte-identical content.

Bug 2 is that default_file_metadata_func queries POSIX-only stat keys (mtime, atime, created). Whether a given fsspec backend emits those keys decides whether Bug 1 is firing on you today. I source-inspected every backend under the fsspec GitHub org and every built-in in filesystem_spec.

Active today (bug fires at day-level precision): local, gcsfs, sshfs + built-in sftp, smb, arrow/HDFS, memory.

Masked today (bug dormant, waiting on Bug 2 getting fixed): s3fs, adlfs, ossfs, swiftspec, tosfs, gdrive-fsspec, dropboxdrivefs, ipfsspec, opendalfs, dbfs, http, webhdfs, ftp, github, gist, git.

Wrapper: alluxiofs delegates to its wrapped backend.

GCS is the outlier on the active side because gcsfs/core.py explicitly sets result["mtime"] = parse(object_metadata["updated"]) as a legacy compatibility alias. There is a TODO about removing it. The code is still there.

Once default_file_metadata_func gets its natural one-line fix to use fs.modified(path) instead of POSIX-specific keys, every masked backend activates at sub-second precision simultaneously.

Reproducers at github.com/stirelli/llamaindex-embedding-churn (five progressively real levels, level 3 uses real OpenAI API with billed tokens). Fix is PR #21462 against run-llama/llama_index, three lines plus a regression test covering both directions.

Happy to answer questions on the benchmark, the fsspec inspection, or the cost math.

MiniZinc, constraint modelling language solve discrete optimisation problems

FluxBB Built with Rust

'Startup Cowboys' Are Making This Texas Town the New Tech Hotspot

Collaborative Autoresearch for Any Repo

Before Apple Music, There Was MapleMusic–Canada's Forgotten Pioneer

QR Lume – a privacy‑first iOS tool for inspecting QR codes safely

Wsl9x: Windows 9x Subsystem for Linux

Mercedes-Benz and Liquid AI Partner to Scale Embedded In-Car Intelligence

Multiview Stereo Projection [video]

Google investing up to $40B in Anthropic

The Nintendo Switch Switch (2019)

Benchmarking OpenAI's Privacy Filter

SFO Quiet Airport (2025)

Multiservice Impact for Azure Workloads in East US

QLMarkdown: macOS Quick Look extension for viewing Markdown files

Voice analysis pipeline that detects emotional incongruence

Ivanpah Solar Power Facility

Video recordings of software engineering pioneers, SD&m Bonn 2001

Mine, a Coalton and Common Lisp IDE

OpenAI releases GPT-5.5 and GPT-5.5 Pro in the API

Benchmarking How Postgres Scales

It's OK To Be Scared (Don't be in a rush to get screwed)

Ask HN: How would you improve this CLI tool for finding terminal commands?

Ubuntu 26.04 LTS

LLM research on Hacker News is drying up

What happened to Omegle? rise and fall of internet's favorite stranger danger

Tech bros: it's time to challenge Silicon Valley's saviour complex

There Will Be a Scientific Theory of Deep Learning

Kubuntu Linux 26.04 LTS (Resolute Raccoon)

Kubernetes v1.36: User Namespaces in Kubernetes are finally GA