frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: SemHash – Semantic Text Deduplication, Outlier Filtering and Sampling

https://github.com/MinishLab/semhash
7•Tananon•9mo ago
We’ve just released SemHash v0.3.0, a major rework of our open-source text pre-processing library. We’ve added two new functionalities: outlier filtering & representative sampling. The core API has been reworked to make sure all of these features can be used together in an intuitive way. Our new features use the existing approximate nearest neighbors index that we already used for semantic deduplication, so they can be ran very quickly after building the index on your dataset. The core package can now be used for:

- Semantic Deduplication: Remove semantic duplicates from your dataset. This can prevent train/test set overlap in classification tasks, or prevent duplicate samples in RAG/semantic search.

- Outlier Filtering: Surface and filter the most anomalous samples from your dataset. This can help with automated removal of low quality data, or data that should not be in your dataset.

- Representative Sampling: Select the most central and diverse examples using Maximal Marginal Relevance. This can help you quickly explore and understand a dataset, or even build a small, diverse, high quality dataset, for example for LLM finetuning.

We’ve designed these features in the same way as our semantic deduplication: CPU friendly, lightweight, and explainable.

We hope these features help you create cleaner datasets, or simply understand your data better. We’re curious to hear your feedback, and whether there are any other features you think would improve SemHash further!

Show HN: SafeClaw – a way to manage multiple Claude Code instances in containers

https://github.com/ykdojo/safeclaw
1•ykdojo•2m ago•0 comments

The Future of the Global Open-Source AI Ecosystem: From DeepSeek to AI+

https://huggingface.co/blog/huggingface/one-year-since-the-deepseek-moment-blog-3
1•gmays•2m ago•0 comments

The Evolution of the Interface

https://www.asktog.com/columns/038MacUITrends.html
1•dhruv3006•4m ago•0 comments

Azure: Virtual network routing appliance overview

https://learn.microsoft.com/en-us/azure/virtual-network/virtual-network-routing-appliance-overview
1•mariuz•4m ago•0 comments

Seedance2 – multi-shot AI video generation

https://www.genstory.app/story-template/seedance2-ai-story-generator
1•RyanMu•8m ago•1 comments

Πfs – The Data-Free Filesystem

https://github.com/philipl/pifs
1•ravenical•11m ago•0 comments

Go-busybox: A sandboxable port of busybox for AI agents

https://github.com/rcarmo/go-busybox
2•rcarmo•12m ago•0 comments

Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery [pdf]

https://research.nvidia.com/labs/nemotron/files/NVFP4-QAD-Report.pdf
1•gmays•13m ago•0 comments

xAI Merger Poses Bigger Threat to OpenAI, Anthropic

https://www.bloomberg.com/news/newsletters/2026-02-03/musk-s-xai-merger-poses-bigger-threat-to-op...
1•andsoitis•13m ago•0 comments

Atlas Airborne (Boston Dynamics and RAI Institute) [video]

https://www.youtube.com/watch?v=UNorxwlZlFk
1•lysace•14m ago•0 comments

Zen Tools

http://postmake.io/zen-list
1•Malfunction92•16m ago•0 comments

Is the Detachment in the Room? – Agents, Cruelty, and Empathy

https://hailey.at/posts/3mear2n7v3k2r
1•carnevalem•17m ago•0 comments

The purpose of Continuous Integration is to fail

https://blog.nix-ci.com/post/2026-02-05_the-purpose-of-ci-is-to-fail
1•zdw•19m ago•0 comments

Apfelstrudel: Live coding music environment with AI agent chat

https://github.com/rcarmo/apfelstrudel
1•rcarmo•20m ago•0 comments

What Is Stoicism?

https://stoacentral.com/guides/what-is-stoicism
3•0xmattf•20m ago•0 comments

What happens when a neighborhood is built around a farm

https://grist.org/cities/what-happens-when-a-neighborhood-is-built-around-a-farm/
1•Brajeshwar•20m ago•0 comments

Every major galaxy is speeding away from the Milky Way, except one

https://www.livescience.com/space/cosmology/every-major-galaxy-is-speeding-away-from-the-milky-wa...
2•Brajeshwar•20m ago•0 comments

Extreme Inequality Presages the Revolt Against It

https://www.noemamag.com/extreme-inequality-presages-the-revolt-against-it/
2•Brajeshwar•21m ago•0 comments

There's no such thing as "tech" (Ten years later)

1•dtjb•22m ago•0 comments

What Really Killed Flash Player: A Six-Year Campaign of Deliberate Platform Work

https://medium.com/@aglaforge/what-really-killed-flash-player-a-six-year-campaign-of-deliberate-p...
1•jbegley•22m ago•0 comments

Ask HN: Anyone orchestrating multiple AI coding agents in parallel?

1•buildingwdavid•23m ago•0 comments

Show HN: Knowledge-Bank

https://github.com/gabrywu-public/knowledge-bank
1•gabrywu•29m ago•0 comments

Show HN: The Codeverse Hub Linux

https://github.com/TheCodeVerseHub/CodeVerseLinuxDistro
3•sinisterMage•30m ago•2 comments

Take a trip to Japan's Dododo Land, the most irritating place on Earth

https://soranews24.com/2026/02/07/take-a-trip-to-japans-dododo-land-the-most-irritating-place-on-...
2•zdw•30m ago•0 comments

British drivers over 70 to face eye tests every three years

https://www.bbc.com/news/articles/c205nxy0p31o
43•bookofjoe•30m ago•15 comments

BookTalk: A Reading Companion That Captures Your Voice

https://github.com/bramses/BookTalk
1•_bramses•31m ago•0 comments

Is AI "good" yet? – tracking HN's sentiment on AI coding

https://www.is-ai-good-yet.com/#home
3•ilyaizen•32m ago•1 comments

Show HN: Amdb – Tree-sitter based memory for AI agents (Rust)

https://github.com/BETAER-08/amdb
1•try_betaer•33m ago•0 comments

OpenClaw Partners with VirusTotal for Skill Security

https://openclaw.ai/blog/virustotal-partnership
2•anhxuan•33m ago•0 comments

Show HN: Seedance 2.0 Release

https://seedancy2.com/
2•funnycoding•34m ago•0 comments