frontpage.

We’ve just released SemHash v0.3.0, a major rework of our open-source text pre-processing library. We’ve added two new functionalities: outlier filtering & representative sampling. The core API has been reworked to make sure all of these features can be used together in an intuitive way. Our new features use the existing approximate nearest neighbors index that we already used for semantic deduplication, so they can be ran very quickly after building the index on your dataset. The core package can now be used for:

- Semantic Deduplication: Remove semantic duplicates from your dataset. This can prevent train/test set overlap in classification tasks, or prevent duplicate samples in RAG/semantic search.

- Outlier Filtering: Surface and filter the most anomalous samples from your dataset. This can help with automated removal of low quality data, or data that should not be in your dataset.

- Representative Sampling: Select the most central and diverse examples using Maximal Marginal Relevance. This can help you quickly explore and understand a dataset, or even build a small, diverse, high quality dataset, for example for LLM finetuning.

We’ve designed these features in the same way as our semantic deduplication: CPU friendly, lightweight, and explainable.

We hope these features help you create cleaner datasets, or simply understand your data better. We’re curious to hear your feedback, and whether there are any other features you think would improve SemHash further!

Is open-world design making games worse, or am I just getting old?

Many Cultures Borrow. Japan Transforms

Tokyo Metropolitan Area Outer Underground Discharge Channel

What Gen Z's Tumblr revival may tell us about the future of social networks

'Going to the cloud' could also mean locking into a forever sub-contractor

New APNIC director general steps up to steer the internet for 4B users

In Memoriam: SF and Fine Artist David Schleinkofer

More than one million readers

"The last couple GPT-4o updates have made it too sycophant-y, working on fixes"

The Valley of My Dreams: Why Silicon Valley Left Boston's Route 128 in the Dust

Restoring a Sinclair C5

Step1X-Edit: SOTA image editing model (alternative to GPT-4o)

My Notes on CRAQ – Chain Replication with Apportioned Queries

Agentic File Explorer

Re.green and Microsoft

Show HN: I Made a tool to turn your waitlist into a viral referral game

From HyperCard to Vibe Coding

Wikipedia: Sandbox

OpenAI o4-mini is Rainbolt-level Geogussr

Competing theory to 'dark energy' suggests the universe has different time zones

Microsoft says it will buy 8M tons of carbon offsets (2024)

What's on my Home Server 2025 – NixOS Edition [video]

What Healthcare Has Learned About Risk and Incentives that Marketing Hasn’t

Cloud Backed SQLite

JPMC: An open letter to third-party suppliers

Here's how to get ChatGPT to stop being an overly flattering yes man

Flight MH370 Incident Potential Leak

Tech Workers Are Just Like the Rest of Us: Miserable at Work

AI Helps Find a Cause of Alzheimer's Disease and Identify Therapeutic Candidate

Beating the Crowd

Show HN: SemHash – Semantic Text Deduplication, Outlier Filtering and Sampling