It detects sensitive entities in requests, replaces them with consistent pseudonyms, forwards the sanitized request to the LLM provider, then rehydrates the response before returning it to your app.
“Consistent” means the same input always maps to the same token (e.g. "Tata Motors" → "ORG_7"). This preserves semantic structure so embeddings and retrieval still work, while ensuring the API provider never sees the real entity values.
The motivation came from looking at typical RAG architectures. A standard pipeline leaks data in multiple places per query:
- Raw document text sent to embedding APIs - Embeddings stored in cloud vector databases (recent work like Zero2Text shows they can be inverted) - Query embeddings sent to providers - Retrieved context sent to LLM generation APIs
Existing approaches tend to fall into three buckets:
- Redaction ([REDACTED]) which destroys semantic meaning and breaks retrieval - NER-based detection pipelines that add significant latency - Stateless replacements that break vector search because tokens change between requests
CloakPipe tries to solve this by doing deterministic pseudonymization with a local mapping vault.
Some implementation details:
- Written in Rust as a single binary - <5ms overhead per request in testing - AES-256-GCM encrypted mapping vault with zeroize memory safety - OpenAI-compatible proxy endpoints (`/v1/chat/completions`, `/v1/embeddings`) - Streaming response rehydration (handles tokens split across SSE chunks) - Pattern detection for API keys, JWTs, emails, IPs, financial amounts, fiscal dates - Custom detection rules via TOML config
It's designed to be drop-in: point your client to the proxy by changing `OPENAI_BASE_URL`.