The usual tradeoff: - Indexed search (Lucene, dtSearch) = fast queries, slow setup, storage overhead - Direct search = no setup, too slow for >10GB
My approach: Multi-threaded direct search with intelligent caching. Result: 50GB searched in ~20s on a mid-range Ryzen 5 (5600, 16GB, SSD).
Technical details: - Format-agnostic byte-by-byte text extraction (finds text regardless of file format) - Producer-consumer pattern for directory traversal - PDF/OCR cache (0.2% overhead vs. 2-5% for full indexes) - Recursive archive handling (ZIP in ZIP in email) - Scales linearly with CPU cores
Benchmark vs. commercial tools on 50GB real-world archive: - Findit: 201 files, 20s, no prep - dtSearch ($199): 182 files, <1s but 15min indexing - FileLocator ($150): 132 files, 4min + 51min indexing - DocFetcher: 82 files, instant but 79min indexing
Full results: https://www.dateisuche.de/en/comparison6.html
Free to use indefinitely (WinRAR-style), $39 one-time to remove nag screen.
Stack: VB.NET (legacy from 1995), Tesseract for OCR. Windows only.
Looking for feedback on the architecture and performance on different hardware configs.