frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: One-click AI employee with its own cloud desktop

https://cloudbot-ai.com
3•fainir•32m ago•0 comments

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

https://github.com/sandys/kappal
8•sandGorgon•2d ago•2 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
250•isitcontent•17h ago•27 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
351•vecti•20h ago•157 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
320•eljojo•20h ago•196 comments

Show HN: MCP App to play backgammon with your LLM

https://github.com/sam-mfb/backgammon-mcp
3•sam256•1h ago•1 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
79•phreda4•17h ago•14 comments

Show HN: Smooth CLI – Token-efficient browser for AI agents

https://docs.smooth.sh/cli/overview
93•antves•1d ago•70 comments

Show HN: I'm 75, building an OSS Virtual Protest Protocol for digital activism

https://github.com/voice-of-japan/Virtual-Protest-Protocol/blob/main/README.md
5•sakanakana00•2h ago•1 comments

Show HN: I built Divvy to split restaurant bills from a photo

https://divvyai.app/
3•pieterdy•3h ago•0 comments

Show HN: BioTradingArena – Benchmark for LLMs to predict biotech stock movements

https://www.biotradingarena.com/hn
26•dchu17•22h ago•12 comments

Show HN: ARM64 Android Dev Kit

https://github.com/denuoweb/ARM64-ADK
17•denuoweb•2d ago•2 comments

Show HN: Artifact Keeper – Open-Source Artifactory/Nexus Alternative in Rust

https://github.com/artifact-keeper
152•bsgeraci•1d ago•64 comments

Show HN: Slack CLI for Agents

https://github.com/stablyai/agent-slack
49•nwparker•1d ago•11 comments

Show HN: I Hacked My Family's Meal Planning with an App

https://mealjar.app
2•melvinzammit•5h ago•0 comments

Show HN: Gigacode – Use OpenCode's UI with Claude Code/Codex/Amp

https://github.com/rivet-dev/sandbox-agent/tree/main/gigacode
19•NathanFlurry•1d ago•9 comments

Show HN: I built a free UCP checker – see if AI agents can find your store

https://ucphub.ai/ucp-store-check/
2•vladeta•5h ago•2 comments

Show HN: Compile-Time Vibe Coding

https://github.com/Michael-JB/vibecode
10•michaelchicory•7h ago•1 comments

Show HN: Micropolis/SimCity Clone in Emacs Lisp

https://github.com/vkazanov/elcity
172•vkazanov•2d ago•49 comments

Show HN: Daily-updated database of malicious browser extensions

https://github.com/toborrm9/malicious_extension_sentry
14•toborrm9•22h ago•7 comments

Show HN: Slop News – HN front page now, but it's all slop

https://dosaygo-studio.github.io/hn-front-page-2035/slop-news
16•keepamovin•8h ago•5 comments

Show HN: Horizons – OSS agent execution engine

https://github.com/synth-laboratories/Horizons
23•JoshPurtell•1d ago•5 comments

Show HN: Falcon's Eye (isometric NetHack) running in the browser via WebAssembly

https://rahuljaguste.github.io/Nethack_Falcons_Eye/
5•rahuljaguste•17h ago•1 comments

Show HN: Fitspire – a simple 5-minute workout app for busy people (iOS)

https://apps.apple.com/us/app/fitspire-5-minute-workout/id6758784938
2•devavinoth12•10h ago•0 comments

Show HN: Local task classifier and dispatcher on RTX 3080

https://github.com/resilientworkflowsentinel/resilient-workflow-sentinel
25•Shubham_Amb•1d ago•2 comments

Show HN: I built a RAG engine to search Singaporean laws

https://github.com/adityaprasad-sudo/Explore-Singapore
4•ambitious_potat•11h ago•4 comments

Show HN: Sem – Semantic diffs and patches for Git

https://ataraxy-labs.github.io/sem/
2•rs545837•12h ago•1 comments

Show HN: A password system with no database, no sync, and nothing to breach

https://bastion-enclave.vercel.app
12•KevinChasse•22h ago•17 comments

Show HN: GitClaw – An AI assistant that runs in GitHub Actions

https://github.com/SawyerHood/gitclaw
10•sawyerjhood•23h ago•0 comments

Show HN: FastLog: 1.4 GB/s text file analyzer with AVX2 SIMD

https://github.com/AGDNoob/FastLog
5•AGDNoob•13h ago•1 comments
Open in hackernews

Show HN: A local-first, reversible PII scrubber for AI workflows

https://medium.com/@tj.ruesch/a-local-first-reversible-pii-scrubber-for-ai-workflows-using-onnx-and-regex-e9850a7531fc
38•tjruesch•1mo ago
Hi HN,

I’m one of the maintainers of Bridge Anonymization. We built this because the existing solutions for translating sensitive user content are insufficient for many of our privacy-concious clients (Governments, Banks, Healthcare, etc.).

We couldn't send PII to third-party APIs, but standard redaction destroyed the translation quality. If you scrub "John" to "[PERSON]", the translation engine loses gender context (often defaulting to masculine), which breaks grammatical agreement in languages like French or German.

So we built a reversible, local-first pipeline for Node.js/Bun. Here is how we implemented the tricky parts:

0. The Mapping

We use XML-like tags with ID’s that uniquely identify the PII, `<PII type=”PERSON” id=”1”>`. Translation models and the systems around them work with XML data structures since the dawn of Computer Aided Translation tools, so this improves compatibility with existing workflows and systems. A `PIIMap` is stored locally for rehydration after translation (AES-256-GCM-encrypted by default).

1. Hybrid Detection Engine

Obviously neither Regex nor NER was enough on its own.

- Structured PII: We use strict Regex with validation checksums for things like IBANs (Mod-97) and Credit Cards (Luhn). - Soft PII: For names and locations, we run a quantized `xlm-roberta` model via `onnxruntime-node` directly in the process. This lets us avoid a Python sidecar while keeping the package ‘lightweight’ (still ~280MB for the quantized model, but acceptable for desktop environments).

2. The "Hallucination" Guard (Fuzzy Rehydration)

LLMs often "mangle" the XML placeholders during translation (e.g., turning `<PII id="1"/>` into `< PII id = « 1 » >`). We implemented a Fuzzy Tag Matcher that uses flexible regex patterns to detect these artefacts. It identifies the tag even if attributes are reordered or quotes are changed, ensuring we can always map the token back to the original encrypted value.

3. Semantic Masking

We are currently working on "Semantic Masking"—adding context to the PII tag (like `<PII type="PERSON" gender="female" id="1" />` ) to preserve (gender) context for the translation. For now, we are relying on a lightweight lookup-table approach to avoid the overhead of a second ML model or the hassle of fine tuning. So far this works nicely for most use cases.

The code is MIT licensed. I’d love to hear how others are handling the "context loss" problem in privacy-preserving NLP pipelines! I think this could quite easily be generalized to other LLM applications as well.

Comments

handfuloflight•1mo ago
This is an awesome share and development. Kudos!
welcome_dragon•1mo ago
Reversible as in you can re-identify? That sounds not secure
bigiain•1mo ago
The post discusses that:

Security First

Because the “PII Map” (the link between ID:1 and John Smith) effectively is the PII, we treat it as sensitive material.

The library includes a crypto module that forces AES-256-GCM encryption for the mapping table. The raw PII never leaves the local memory space, and the state object that persists between the masking and rehydration steps is encrypted at rest.

I've bookmarked this for inspiration for a medium/long term project I am considering building. I'd like to be able to take dumps of our production database and automatically (one way) anonymize it. Replacing all names with meaningless but semantically representative placeholders (gender matching where obvious - Alice, Bob, Mallory, Eve, Trent perhaps, and gender neutral like Jamie or Alex when suitable). Use similar techniques to rewrite email addresses (alice@example.org, bob@example.com, mallory@example.net) and addresses/placenames/whatever else can be pulled out with Named Entity Recognition. I suspect I'll in general be able to do a higher accuracy version of this, since I'll have an understanding of the database structure and we're already in the process of adding metadata about table and column data sensitivity. I will definitely be checking out the regexes and NER models used here.

tjruesch•1mo ago
That sounds interesting! I've been thinking about using representative placeholders as well, but while they have their strengths, there are also some downsides. We decided to go with an XML tag also because it clearly identifies the anonymized text as being anonymized (for humans) so mixups don't happen. After reading your comment I think it would also be really interesting to be able to add custom metadata to the tags. Like if you have a username that you want to anonymize, but your database has additional (deterministic) information like the gender, we should add a callback for you as the user to add this information to the tag.
fluidcruft•1mo ago
My hope is it means it assigns coded identifiers and the key remains local. When the document returns, the identifiers can be restored. So the PII itself never leaves the premises.
tjruesch•1mo ago
that's exactly right. PII stays local (and the PII-Tag-Map is encrypted)
minixalpha•1mo ago
I'd like to know if there's a tool that can automatically replace sensitive information before I paste content into ChatGPT, and then automatically restore the sensitive information when I copy the results from ChatGPT. The logic for both "replacement" and "restoration" should be handled locally on my computer.
dsp_person•1mo ago
I've been thinking about playing with something like this.

I'm curious to what limit you can randomly replace words and reverse it later.

Even with code. Like say take the structure of a big project, but randomly remap words in function names, and to some extent replace business logic with dummy code. Then use cloud LLMs for whatever purpose, and translate back.

tjruesch•1mo ago
try https://playground.rehydra.ai/
bob1029•1mo ago
This is interesting work. My approach so far has been to keep the PII as far away as possible from the LLM. Right now it's salted hashes if it's anything at all.

I would be tempted to try a pseudonymous approach where inbound PII is mapped to a set of consistent, "known good" fake identities as we transition in and out of the AI layer.

The key with PII is to avoid combining factors over time that produce a strong signal. This is a wide spectrum. Some scenarios will be slightly identifying just because they are rare. Zip+gender isn't a very strong signal. Zip+DOB+gender uniquely identifies a large number of people. You don't need to screw up with an email address or tax id. Account balance over time might eventually be sufficient to target one person.

mishrapravin441•1mo ago
This feels broadly useful beyond translation — e.g., prompt sanitization for support agents or RAG pipelines. Have you experimented with feeding the enriched tags directly into LLM prompts (vs MT engines) and how they behave?
tjruesch•1mo ago
I haven't done as much testing as I'd like to confidently answer this in general terms. In our own environment we have the benefit of defining the system prompt for translation, so we can introduce the logic of the tags to the LLM explicitly. That said, in our limited general-purpose testing we've seen that the flagship models definitely capture the logic of the tags and their semantic properties reliably without 'explanation'. I'm currently exploring a general purpose prompt sanitizer and potentially even a browser plugin for behind-the-scenes sanitization in ChatGPT and other end-user interfaces.
zcw100•1mo ago
PII redaction is an interesting problem but what always concerns me is what gets lost in the marketing. This is always a best effort redaction and full redaction of PII can't be guaranteed. I wouldn't run HIPPA data through this although I know of one company that is doing exactly that.
Ekaros•1mo ago
So I take the tags are not unique across multiple requests or documents? But same tags and ids are reused in each document as needed? As if they were unique it would itself be PII.