- Semantic Deduplication: Remove semantic duplicates from your dataset. This can prevent train/test set overlap in classification tasks, or prevent duplicate samples in RAG/semantic search.
- Outlier Filtering: Surface and filter the most anomalous samples from your dataset. This can help with automated removal of low quality data, or data that should not be in your dataset.
- Representative Sampling: Select the most central and diverse examples using Maximal Marginal Relevance. This can help you quickly explore and understand a dataset, or even build a small, diverse, high quality dataset, for example for LLM finetuning.
We’ve designed these features in the same way as our semantic deduplication: CPU friendly, lightweight, and explainable.
We hope these features help you create cleaner datasets, or simply understand your data better. We’re curious to hear your feedback, and whether there are any other features you think would improve SemHash further!