2:55 vs 7:55 wall clock (2.7x faster) 688MB vs 21.9GB peak RAM (32x less) Single core vs 4+ cores Duplicate counts match exactly (51,392 both ways)
Fuzzy dedup (MinHash + LSH) vs datatrove:
36:44 vs 3h50m+ — datatrove stage 1 alone ran for 3h50m and we killed it datatrove's bottleneck turned out to be spaCy word tokenization on every document before shingling. fastdedup uses character n-grams directly which is significantly cheaper 23GB vs 1.1GB RAM — this is a real trade-off, not a win. datatrove streams to disk; fastdedup holds the LSH index in memory for speed
Honest caveats:
Fuzzy dedup needs ~23GB RAM at this scale — cloud workload, not a laptop workload datatrove is built for distributed execution, tasks=1 isn't its intended config — this is how someone would run it locally Tiered storage to spill LSH index to disk is on the roadmap
Demo: https://huggingface.co/spaces/wapplewhite4/fastdedup-demo Repo: https://github.com/wapplewhite4/fastdedup Happy to answer questions about implementation or methodology.