GLiNER2-PII: 0.3B open-source PII model outperforms OpenAI's Privacy Filter

https://pioneer.ai/research/gliner2-pii-a-multilingual-model-for-personally-identifiable-information-extraction

2•neon_share1•1h ago

Comments

neon_share1•1h ago

Hi HackerNews,

We’re Ash and George from Fastino Labs, and today we’re releasing GLiNER2-PII, an 0.3B parameter open source encoder model for PII detection.

Removing personal identifiable information (PII) from documentation and data sources continues to be a challenge. Since PII can look different depending on the country, context, and document type, it’s difficult for most models to keep up.

GLiNER2-PII overcomes this with a compact 0.3B parameter encoder architecture that is outperforming OpenAI's Privacy Filter and all existing GLiNER PII variants

In addition to supporting zero-shot extraction of unseen entity types, it was also fine-tuned on 42 fine-grained entity types across seven semantic categories:

- API keys, Passwords and Credentials - Person & Identity - Contact & Location - Government & Tax Identifiers - Banking & Payment - Digital Identity - Sensitive Dates

On the SPY benchmark, GLiNER2-PII achieves the highest span-level F1 (0.471) across legal and medical documents, outperforming OpenAI's Privacy Filter and all existing GLiNER PII variants. Notably, it maintains high recall (0.722 legal / 0.681 medical) while preserving competitive precision.

Training data was generated synthetically using our Pioneer Agent framework, producing multilingual annotated examples across document types, locales, and entity distributions.

GLiNER2-PII is part of the GLiNER family of models for named entity recognition, text classification, and structured extraction: (link to gliner page maybe?)

We are happy to release GLiNER2-PII to the open source community under the Apache 2.0 license.

Model weights are available now on Hugging Face.

Model: https://huggingface.co/fastino/gliner2-privacy-filter-PII-mu... Read the blog: https://pioneer.ai/research/gliner2-pii-a-multilingual-model...

GitHub Copilot's new desktop app

Bun's Rust rewrite has been merged

The founder's playbook: Building an AI-native startup – Claude

AI, open code and vulnerability risk in the public sector (UK)

How the Bird Eye Was Pushed to an Evolutionary Extreme

Why Do We Interface?

Jane Street Interview Simulator

A Single Infusion Could Suppress HIV for Years

Discover Crosspad the best finger drumming web app

Physics Guarantees the Datasphere Keeps Expanding (and What It Means for Agents)

Show HN: BlitzGraph – Supabase for graphs, designed for LLM agents

Ambient Intents

Cannabis and driving? Studies reveal big risks

AI models are being used to predict conflict

Entire - How We Improved Agentic Search

Claude Code cost observability to prevent tokenmaxxing

Which programming language is fastest?

Synthetic evaluation datasets for testing AI agents before production deployment

What's in a GGUF, besides the weights – and what's still missing?

The coming AI jobs-pocalypse

Show HN: Pokémon SVG Generation LLM Benchmark

New Nginx Exploit

Gemini Android App User Hostile Behavior

Neanderthals Mastered Dentistry

SED_Model – Observation <-> Theory Machine

Catch Flakes on Main

One engine, many tools – Introducing Rubydex

Software Engineers Are Obsolete

Google says it disrupted an AI-driven effort to exploit a software bug

I used acoustic physics and Whisper to automate video editing