Show HN: Chonky – a neural text semantic chunking goes multilingual

https://huggingface.co/mirth/chonky_mmbert_small_multilingual_1

43•hessdalenlight•3mo ago

TLDR: I’m expanding the family of text-splitting Chonky models with new multilingual model.

You can learn more about this neural approach in a previous post: https://news.ycombinator.com/item?id=43652968

Since the release of the first distilbert-based model I’ve released two more models based on a ModernBERT. All these models were pre-trained and fine-tuned primary on English texts.

But recently mmBERT(https://huggingface.co/blog/mmbert) has been released. This model pre-trained on massive dataset that contains 1833 languages. So I had an idea of fine-tuning a new multilingual Chonky model.

I’ve expanded training dataset (that previously contained bookcorpus and minipile datasets) with Project Gutenberg dataset which contains books in some widespread languages.

To make the model more robust for real-world data I’ve removed punctuation for last word for every training chunk with probability of 0.15 (no ablation was made for this technique though).

The hard part is evaluation. The real-world data are typically OCR'ed markdown, transcripts of calls, meeting notes etc. and not a clean book paragraphs. I didn’t find such labeled datasets. So I used what I had: already mentioned bookcorpus and Project Gutenberg validation, Paul Graham essays, concatenated 20_newsgroups.

I also tried to fine-tune the bigger mmBERT model (mmbert-base) but unfortunately it didn’t go well — metrics are weirdly lower in comparison with a small model.

Please give it a try. I'll appreciate a feedback.

The new multilingual model: https://huggingface.co/mirth/chonky_mmbert_small_multilingua...

All the Chonky models: https://huggingface.co/mirth

Chonky wrapper library: https://github.com/mirth/chonky

Comments

kamranjon•3mo ago

This is interesting! I once trained a t5 model by removing newlines from Wikipedia text and it worked surprisingly well / at the time the context length was the biggest issue.

Another, not so easy to solve issue was conversational dialogue type data, which wasn’t super well represented in the training data.

I’ve always wanted to come back to working on the problem again, because I think it’s very interesting and we will have a bunch of unstructured text as a result of STT models like whisper that do a great job of transcribing/translating but generally don’t format anything.

nvdnadj92•3mo ago

In case you need conversational data for the experiment you want to try, I developed an open-source cli tool [1] that create transcripts from voice chats on discord. Feel free to try it out!

[1] https://github.com/naveedn/audio-transcriber

CjHuber•3mo ago

Took me a minute to realize this is not about Chonkie. I would be interested in how this compares to the other's semantic chunking approach

jimmySixDOF•3mo ago

you can read the labels this (-y) uses modernBERT and even has an eval comparison to the (-ie) in it's GitHub so you can see the improvement as tested -- although if you want to do vanilla rules based chinking for whatever reason your data needs then (-ie) is still good.

TZubiri•3mo ago

That example looks terribly useless. Maybe there's an actually useful application you had in mind? I don't know say

Chonk("Hey I forgot my password, this is Tom from X Company") = ("Hey", "I forgot my password", "this is Tom from X Company")

Even then it doesn't quite look helpful.

freakynit•3mo ago

This is absolutely useless. Tried a few examples yesterday using hf demo. Fcking retarded af.

It literally splitted the text in-between of related texts while at the same time kept unrelated texts together, even though the embedding limit was far off.

I genuinely wanted this to work. I mean this. But nop. This shit did not work at all.

RAG is still fcked because if chunking issues. GraphRAG doesn't work correctly either unless you are willing to throw a lot of money during ingestion time.

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Show HN: I built a <400ms latency voice agent that runs on a 4gb vram GTX 1650"

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: If you lose your memory, how to regain access to your computer?

Show HN: I spent 4 years building a UI design tool with only the features I use

Show HN: Stacky – certain block game clone

Show HN: A toy compiler I built in high school (runs in browser)

Show HN: Smooth CLI – Token-efficient browser for AI agents

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Show HN: Nginx-defender – realtime abuse blocking for Nginx

Show HN: Slack CLI for Agents

Show HN: BioTradingArena – Benchmark for LLMs to predict biotech stock movements

Show HN: Artifact Keeper – Open-Source Artifactory/Nexus Alternative in Rust

Show HN: ARM64 Android Dev Kit

Show HN: MCP App to play backgammon with your LLM

Show HN: Gigacode – Use OpenCode's UI with Claude Code/Codex/Amp

Show HN: Which chef knife steels are good? Data from 540 Reddit tread

Show HN: I'm 75, building an OSS Virtual Protest Protocol for digital activism

Show HN: I built Divvy to split restaurant bills from a photo

Show HN: Micropolis/SimCity Clone in Emacs Lisp

Show HN: I Hacked My Family's Meal Planning with an App

Show HN: I built a free UCP checker – see if AI agents can find your store

Show HN: XAPIs.dev – Twitter API Alternative at 90% Lower Cost

Show HN: Daily-updated database of malicious browser extensions

Show HN: Falcon's Eye (isometric NetHack) running in the browser via WebAssembly

Show HN: Slop News – HN front page now, but it's all slop

Show HN: Horizons – OSS agent execution engine

Show HN: Compile-Time Vibe Coding

Show HN: Local task classifier and dispatcher on RTX 3080