I’ve been building RAG systems for a while, and I noticed 90% of retrieval failures aren't due to the LLM—they're due to the data. I got tired of debugging hallucinations only to find the retriever had pulled "Page 1 of 5" headers or five duplicate versions of an old policy.
I couldn't find a simple "pandas-profiling" equivalent for unstructured text, so I built this.
It runs locally (CLI) and helps you:
Detect semantic duplicates (using all-MiniLM-L6-v2) to save vector storage costs.
Flag PII (API keys, emails) before they get indexed.
Identify "coverage gaps" by comparing user queries against your docs.
It outputs a standalone HTML report you can show to stakeholders.
Written in Python, open source (MIT). Feedback welcome!
aashirpersonal•2h ago
I’ve been building RAG systems for a while, and I noticed 90% of retrieval failures aren't due to the LLM—they're due to the data. I got tired of debugging hallucinations only to find the retriever had pulled "Page 1 of 5" headers or five duplicate versions of an old policy.
I couldn't find a simple "pandas-profiling" equivalent for unstructured text, so I built this.
It runs locally (CLI) and helps you:
Detect semantic duplicates (using all-MiniLM-L6-v2) to save vector storage costs.
Flag PII (API keys, emails) before they get indexed.
Identify "coverage gaps" by comparing user queries against your docs.
It outputs a standalone HTML report you can show to stakeholders.
Written in Python, open source (MIT). Feedback welcome!
https://github.com/aashirpersonal/rag-corpus-profiler