Cross-Lingual News Dedup at $100/Month – Embeddings, Pgvector, and UnionFind

https://yingjiezhao.com/en/articles/Cross-Lingual-News-Dedup-at-100-Dollar-a-Month/

2•ethan_zhao•1h ago

Comments

ethan_zhao•1h ago

Author here. I built this for 3mins.news, an AI news aggregator covering 180+ sources in 17 languages. The trickiest part was figuring out that articles in different languages about the same event share zero tokens — MinHash/LSH gives you Jaccard similarity of 0.

Happy to answer questions about the pgvector setup, Cloudflare Workers constraints, or the clustering algorithm tuning.

yugoru•1h ago

its harder than it first appears. Even with good embeddings, semantic similarity across languages often breaks when articles include local context or idioms. Curious whether you found a threshold strategy that works reliably across languages, or if it still needs manual tuning.

ethan_zhao•58m ago

Good question. The short answer: a single global threshold (cosine similarity ≥ 0.7) works surprisingly well for news, but it's not because embeddings handle idioms perfectly — it's because news articles are structurally constrained.

News articles about the same event tend to share named entities (people, places, organizations), numbers, and factual structure even across languages. "EU approves AI regulation" is a factual statement that embeds similarly regardless of language. This is very different from, say, opinion pieces or cultural commentary where idioms and local framing would diverge more.

That said, similarity alone isn't enough. The real reliability comes from non-semantic constraints layered on top:

- Time gap ≤ 18 hours between article and story — prevents "same topic, different month" false merges

- Story age ≤ 36 hours — old stories stop absorbing new articles

- Two-pass design — matching against refined story embeddings (average of recent articles) is more stable than raw article-to-article comparison

Where it does break: regional stories with heavy local context. A Japanese domestic politics article and an English wire service summary of the same event sometimes land just below threshold because the framing is so different. I accept some missed merges there rather than lowering the threshold and getting false positives.

No per-language thresholds so far — the embedding model (Qwen3) seems to normalize well across the languages I cover. But I wouldn't be surprised if that changes when adding languages with less training data representation.

Qordinate – AI that talks for you

A History of CSS

Upside Robotics is reducing fertilizer use and waste in corn crops

How Google Is Killing Independent Sites Like Ours

Whisper Anywhere – ChatGPT-level dictation in every Mac app

Life on Peptides Feels Amazing

Study finds AI chose nuclear signalling in 95% of simulated crises

Show HN: Retrievo – In-memory hybrid search for .NET AI agents

Show HN: An IntelliJ plugin to test MyBatis dynamic SQL

Show HN: Go-TUI – a framework for building declarative terminal UIs in Go

Polar Factor Beyond Newton-Schulz – Fast Matrix Inverse Square Root

Show HN: MRR Take-Home Calculator for Bootstrapped Founders

Popular prayer program becomes propaganda pusher after reported Israeli hack

Accessing inactive union members through char

Biggest French Tracker YGG shuts down by hacker that leaked database

Bootleg Windows Office scheme crashes triggers 22-month lockup for Florida woman

E-Invoice – Simple Mobile Invoicing for Freelancers (iOS)

Remote Firmware Injection in Popular Solar Inverters

Beginner's guide to the Amiga E language

HN – Browse Hacker News from the Terminal (CLI and TUI)

China's Initiative to Regulate Anthropomorphic AI

US Supreme Court's Republicans seized most dangerous power in constitutional law

Show HN: MeshCore SAR – Voice, Maps, and Messaging Without Cell Coverage

Show HN: Skill for structured deep research with Claude Code and Obsidian

First Known Mass iOS Attack

EU proposes "Made in EU" rules for strategic sectors to limit China reliance

Amiga C Tutorial

Running Llama Inference on Intel Itanium

Mdenc – Diff-friendly Markdown encryption for Git

How to protect your privacy at a protest