frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: A local-first, reversible PII scrubber for AI workflows

https://medium.com/@tj.ruesch/a-local-first-reversible-pii-scrubber-for-ai-workflows-using-onnx-and-regex-e9850a7531fc
16•tjruesch•8h ago
Hi HN,

I’m one of the maintainers of Bridge Anonymization. We built this because the existing solutions for translating sensitive user content are insufficient for many of our privacy-concious clients (Governments, Banks, Healthcare, etc.).

We couldn't send PII to third-party APIs, but standard redaction destroyed the translation quality. If you scrub "John" to "[PERSON]", the translation engine loses gender context (often defaulting to masculine), which breaks grammatical agreement in languages like French or German.

So we built a reversible, local-first pipeline for Node.js/Bun. Here is how we implemented the tricky parts:

0. The Mapping

We use XML-like tags with ID’s that uniquely identify the PII, `<PII type=”PERSON” id=”1”>`. Translation models and the systems around them work with XML data structures since the dawn of Computer Aided Translation tools, so this improves compatibility with existing workflows and systems. A `PIIMap` is stored locally for rehydration after translation (AES-256-GCM-encrypted by default).

1. Hybrid Detection Engine

Obviously neither Regex nor NER was enough on its own.

- Structured PII: We use strict Regex with validation checksums for things like IBANs (Mod-97) and Credit Cards (Luhn). - Soft PII: For names and locations, we run a quantized `xlm-roberta` model via `onnxruntime-node` directly in the process. This lets us avoid a Python sidecar while keeping the package ‘lightweight’ (still ~280MB for the quantized model, but acceptable for desktop environments).

2. The "Hallucination" Guard (Fuzzy Rehydration)

LLMs often "mangle" the XML placeholders during translation (e.g., turning `<PII id="1"/>` into `< PII id = « 1 » >`). We implemented a Fuzzy Tag Matcher that uses flexible regex patterns to detect these artefacts. It identifies the tag even if attributes are reordered or quotes are changed, ensuring we can always map the token back to the original encrypted value.

3. Semantic Masking

We are currently working on "Semantic Masking"—adding context to the PII tag (like `<PII type="PERSON" gender="female" id="1" />` ) to preserve (gender) context for the translation. For now, we are relying on a lightweight lookup-table approach to avoid the overhead of a second ML model or the hassle of fine tuning. So far this works nicely for most use cases.

The code is MIT licensed. I’d love to hear how others are handling the "context loss" problem in privacy-preserving NLP pipelines! I think this could quite easily be generalized to other LLM applications as well.

Phoenix: A modern X server written from scratch in Zig

https://git.dec05eba.com/phoenix/about/
153•snvzz•2h ago•51 comments

Tell HN: Merry Christmas

309•basilikum•1h ago•102 comments

Microsoft please get your tab to autocomplete shit together

https://ivanca.github.io/programming/2025/11/26/microsoft-pls-get-your-tab-to-autocomplete-shit-t...
50•AmbroseBierce•1h ago•14 comments

Show HN: Minimalist editor that lives in browser, stores everything in the URL

https://github.com/antonmedv/textarea
222•medv•5h ago•80 comments

Who Watches the Waymos? I do [video]

https://www.youtube.com/watch?v=oYU2hAbx_Fc
14•notgloating•38m ago•0 comments

CSRF protection without tokens or hidden form fields

https://blog.miguelgrinberg.com/post/csrf-protection-without-tokens-or-hidden-form-fields
72•adevilinyc•2d ago•9 comments

Research team digitizes more than 100 years of Canadian infectious disease data

https://news.mcmaster.ca/mcmaster-research-team-digitizes-more-than-100-years-of-canadian-infecti...
39•XzetaU8•5d ago•1 comments

Fabrice Bellard: Biography (2009) [pdf]

https://www.ipaidia.gr/wp-content/uploads/2020/12/117-2020-fabrice-bellard.pdf
179•lioeters•6h ago•49 comments

Asterisk AI Voice Agent

https://github.com/hkjarral/Asterisk-AI-Voice-Agent
16•akrulino•1h ago•0 comments

Show HN: Vibium – Browser automation for AI and humans, by Selenium's creator

https://github.com/VibiumDev/vibium
208•hugs•7h ago•73 comments

Comptime – C# meta-programming with compile-time code generation and evaluation

https://github.com/sebastienros/comptime
31•bj-rn•4d ago•4 comments

Online Book: Exploring Mathematics with Python

https://coe.psu.ac.th/ad/explore/
15•Andrew2565•5d ago•0 comments

Qntm's Power Tower Toy

https://qntm.org/files/knuth/knuth.html
47•ravenical•4d ago•15 comments

Keystone (YC S25) is hiring engineer #1 to automate coding

https://www.ycombinator.com/companies/keystone/jobs/J3t9XeM-founding-engineer
1•pablo24602•3h ago

Nvidia buying AI chip startup Groq for about $20B in cash

https://www.cnbc.com/2025/12/24/nvidia-buying-ai-chip-startup-groq-for-about-20-billion-biggest-d...
305•nickrubin•3h ago•197 comments

When Compilers Surprise You

https://xania.org/202512/24-cunning-clang
197•brewmarche•11h ago•94 comments

How GNU Guile is 10x better (2021)

https://www.draketo.de/software/guile-10x
65•Tomte•3d ago•2 comments

Fabrice Bellard Releases MicroQuickJS

https://github.com/bellard/mquickjs/blob/main/README.md
1350•Aissen•1d ago•512 comments

The dawn of a world simulator

https://odyssey.ml/the-dawn-of-a-world-simulator
30•olivercameron•4d ago•5 comments

How I Left YouTube

https://zhach.news/how-i-left-youtube/
43•dhashe•2h ago•64 comments

Confessions to a Data Lake

https://confer.to/blog/2025/12/confessions-to-a-data-lake/
17•kkl•1d ago•5 comments

A faster path to container images in Bazel

https://www.tweag.io/blog/2025-12-18-rules_img/
58•malt3•6d ago•29 comments

Jingle Bells (Batman Smells): An incomplete festive folk-rhyme taxonomy

https://loreandordure.com/2025/12/16/jingle-bells/
54•helsinkiandrew•3d ago•15 comments

The port I couldn't ship

https://ammil.industries/the-port-i-couldnt-ship/
89•cjlm•6d ago•49 comments

Spaced repetition for efficient learning (2019)

https://gwern.net/spaced-repetition
81•tsenturk•4h ago•28 comments

I'm returning my Framework 16

https://yorickpeterse.com/articles/im-returning-my-framework-16/
145•YorickPeterse•11h ago•248 comments

The e-scooter isn't new – London was zooming around on Autopeds a century ago

https://www.ianvisits.co.uk/articles/the-e-scooter-isnt-new-london-was-zooming-around-on-autopeds...
144•zeristor•16h ago•107 comments

Show HN: A local-first, reversible PII scrubber for AI workflows

https://medium.com/@tj.ruesch/a-local-first-reversible-pii-scrubber-for-ai-workflows-using-onnx-a...
16•tjruesch•8h ago•0 comments

My 2026 Open Social Web Predictions

https://www.timothychambers.net/2025/12/23/my-open-social-web-predictions.html
76•todsacerdoti•8h ago•72 comments

Quake's Player Speed (2017)

https://rome.ro/quakes-player-speed-1
53•klaussilveira•1d ago•14 comments