A name fragment that's harmless in record #1 becomes identifying when it co-occurs with a location in record #47 and a timestamp in record #203. Static masking can't see that.
This project treats de-identification as a stateful control problem instead. The system maintains a per-subject exposure graph across time and modalities, computes rolling re-identification risk, and dynamically escalates masking strength only when cumulative exposure justifies it.
The core idea: privacy protection as a feedback loop, not a preprocessing step.
A few things I found interesting building this: - Cross-modal linkage (text + ASR + image proxy + waveform headers) creates non-obvious re-ID surfaces - Pseudonym versioning on risk escalation lets you contain linkage continuity without global reprocessing - The privacy–utility tradeoff is actually controllable if you model exposure state explicitly
All experiments run on synthetic streaming data (no real PHI). Reproducible from source. Colab demo included.
Repo: https://github.com/azithteja91/phi-exposure-guard
Happy to discuss the architecture, the RL policy design, or the tradeoffs vs. existing de-ID approaches.