Top Secret: Automatically filter sensitive information

75•thunderbong•1d ago

Comments

fine_tune•5h ago

I'm no ruby expert, so forgive my ignorance, but it looks like a small "NER model" packaged as a string convince wrapper named `filter` that tries to filter out "sensitive info" on input strings.

I assume the NER model is small enough to run on CPU at less than 1s~ per pass at the trade off of storage per instance (1s is fast enough in dev, in prod with long convos - that's a lot of inference time), generally a neat idea though.

Couple questions;

- NER doesn't often perform well in different domains, how accurate is the model?

- How do you actually allocate compute/storage for inferring on the NER model?

- Are you batching these `filter` calls or is it just sequential 1 by 1 calls

neilv•4h ago

When I had to implement "deidentification" for a kind of sensitive safety reporting, an LLM would've been a good way to augment the approaches I used.

Today, if I had to do it, I'd probably throw multiple computer approaches at it, including LLM-based one, and take the union of those as the computer result, and check it against a human result. (If computer and human agree, that's a good sign; if they disagree, see why before the document goes where it needs to be deidentified.)

(In some kinds of flight safety reporting, any kind of personnel can submit a report about any observation related to safety. It gets very seriously handled and analyzed. There are also multiple ways in which the reporting parties are protected. There are situations in which some artifacts need to have identifying information redacted.)

dwa3592•3h ago

Oh hey! Good to see this. I built something similar in python a while ago.

Check it out: https://github.com/deepanwadhwa/zink

The shield functionality fits directly in your LLM workflow.

sbpayne•2h ago

This is great but it does not “prevent”; it reduces the chances of. NER is not 100% performant. It is very good in many cases, but use with caution!

wombatpm•1h ago

There is an extension for PostGres, https://postgresql-anonymizer.readthedocs.io that allows you to mask data by user or group at the schema level with the options to return full mask, partial mask or dummy data.

A visual history of Visual C++ (2017)

Show HN: Creao – Vibe coding product for founders

Show HN: JavaScript-free (X)HTML Includes

The theory and practice of selling the Aga cooker (1935) [pdf]

Nitro: A tiny but flexible init system and process supervisor

The first Media over QUIC CDN: Cloudflare

Google says it dropped the energy cost of AI queries by 33x in one year

Shader Academy: Learn computer graphics by solving challenges

I Run a Full Linux Desktop in Docker Just Because I Can

Top Secret: Automatically filter sensitive information

Japan city drafts ordinance to cap smartphone use at 2 hours per day

Glyn: Type-safe PubSub and Registry for Gleam actors with distributed clustering

FFmpeg 8.0

Computer fraud laws used to prosecute leaking air crash footage to CNN

Popular Japanese smartphone games have introduced external payment systems

Developer gets 4 years for activating network "kill switch" to avenge his firing

Why is this hard?

My tips for using LLM agents to create software

Bluesky Goes Dark in Mississippi over Age Verification Law

From M1 MacBook to Arch Linux: A month-long experiment that became permanenent

Transcribe music in abc with syntax highlighting

Launch HN: BlankBio (YC S25) - Making RNA Programmable

LabPlot: Free, open source and cross-platform Data Visualization and Analysis

Leaving Gmail for Mailbox.org

The use of LLM assistants for kernel development

The issue of anti-cheat on Linux (2024)

U.S. government takes 10% stake in Intel

Mail Carriers Pause US Deliveries as Tariff Shift Sows Confusion

Closing the Nix gap: From environments to packaged applications for rust

It’s not wrong that "\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7 (2019)

Top Secret: Automatically filter sensitive information

Comments

A visual history of Visual C++ (2017)

Show HN: Creao – Vibe coding product for founders

Show HN: JavaScript-free (X)HTML Includes

The theory and practice of selling the Aga cooker (1935) [pdf]

Nitro: A tiny but flexible init system and process supervisor

The first Media over QUIC CDN: Cloudflare

Google says it dropped the energy cost of AI queries by 33x in one year

Shader Academy: Learn computer graphics by solving challenges

I Run a Full Linux Desktop in Docker Just Because I Can

Top Secret: Automatically filter sensitive information

Japan city drafts ordinance to cap smartphone use at 2 hours per day

Glyn: Type-safe PubSub and Registry for Gleam actors with distributed clustering

FFmpeg 8.0

Computer fraud laws used to prosecute leaking air crash footage to CNN

Popular Japanese smartphone games have introduced external payment systems

Developer gets 4 years for activating network "kill switch" to avenge his firing

Why is this hard?

My tips for using LLM agents to create software

Bluesky Goes Dark in Mississippi over Age Verification Law

From M1 MacBook to Arch Linux: A month-long experiment that became permanenent

Transcribe music in abc with syntax highlighting

Launch HN: BlankBio (YC S25) - Making RNA Programmable

LabPlot: Free, open source and cross-platform Data Visualization and Analysis

Leaving Gmail for Mailbox.org

The use of LLM assistants for kernel development

The issue of anti-cheat on Linux (2024)

U.S. government takes 10% stake in Intel

Mail Carriers Pause US Deliveries as Tariff Shift Sows Confusion

Closing the Nix gap: From environments to packaged applications for rust

It’s not wrong that "\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7 (2019)