When you send real documents or customer data to LLMs, you face a painful tradeoff:
- Send raw text → privacy disaster - Redact with [REDACTED] → embeddings break, RAG retrieval fails, multi-turn chats become useless, and the model often refuses to answer questions about the redacted entities.
The practical solution is consistent pseudonymization: the same real entity always maps to the same token (e.g. “Tata Motors” → ORG_7 everywhere). This preserves semantic meaning for vector search and reasoning, then you rehydrate the response so the provider never sees actual names, numbers or addresses.
I got fed up fighting this with Presidio + custom glue (truncated RAG chunks, declension in Indian languages, fuzzy merging for typos/siblings, LLM confusion, percentages breaking math). So I built Cloakpipe as a tiny single-binary Rust proxy.
It does: • Multi-layer detection (regex + financial rules + optional GLiNER2 ONNX NER + custom TOML) • Consistent reversible mapping in an AES-256-GCM encrypted vault (memory zeroized) • Smart rehydration that survives truncated chunks like [[ADDRESS:A00 • Built-in fuzzy resolution for typos and similar names • Numeric reasoning mode so percentages still work for calculations
Fully open source (MIT), zero Python dependencies, <5 ms overhead.
Repo: https://github.com/rohansx/cloakpipe Demo & quick start: https://app.cloakpipe.co/demo
Would love feedback from anyone who has audited their RAG data flow or is struggling with the redaction-vs-semantics problem — especially in legal, fintech, or non-English workflows.
What approaches have you landed on?
ozgurozkan•2h ago
One dimension worth pressure-testing: the rehydration step. The proxy receives the LLM response and substitutes real entities back in. That rehydration layer is a potential exfiltration vector if the LLM can be made to include token patterns in its response that survive the substitution. We've run adversarial tests where an AI agent was instructed (via injected context) to embed entity tokens in its output in ways that leak the mapping.
We do this kind of adversarial testing at audn.ai (https://audn.ai) — specifically data leak and PII exfiltration scenarios against RAG and agentic pipelines. Sensitive data leak and re-identification are two of the risk categories we cover explicitly.
For fintech/legal use cases especially, would be worth running a red team pass on the rehydration and vault lookup logic. Happy to connect if that'd be useful.