Show HN: Chonky – a neural approach for text semantic chunking

169•hessdalenlight•10mo ago

TLDR: I’ve made a transformer model and a wrapper library that segments text into meaningful semantic chunks.

The current text splitting approaches rely on heuristics (although one can use neural embedder to group semantically related sentences).

I propose a fully neural approach to semantic chunking.

I took the base distilbert model and trained it on a bookcorpus to split concatenated text paragraphs into original paragraphs. Basically it’s a token classification task. Model fine-tuning took day and a half on a 2x1080ti.

The library could be used as a text splitter module in a RAG system or for splitting transcripts for example.

The usage pattern that I see is the following: strip all the markup tags to produce pure text and feed this text into the model.

The problem is that although in theory this should improve overall RAG pipeline performance I didn’t manage to measure it properly. Other limitations: the model only supports English for now and the output text is downcased.

Please give it a try. I'll appreciate a feedback.

The Python library: https://github.com/mirth/chonky

The transformer model: https://huggingface.co/mirth/chonky_distilbert_base_uncased_...

Comments

jaggirs•10mo ago

Did you evaluate it on a RAG benchmark?

hessdalenlight•10mo ago

No I didn't it yet. I would be grateful if you could advise me such a benchmark.

jaggirs•10mo ago

Not sure, havent done so myself but I think you can use MTEB maybe. Or otherwise a llm benchmark on large inputs (and compare your chunking with naive chunking)

suddenlybananas•10mo ago

I feel you could improve your README.md considerably just by showing the actual output of the little snippet you show.

HeavyStorm•10mo ago

Came here to write exactly that. The author includes a large sentence in the sample, so it should show us the output.

hessdalenlight•10mo ago

Just fixed it.

mentalgear•10mo ago

I applaud the FOSS initiative but as with anything ml: benchmarks please so we can see what test cases are covered and how well they align with a project's needs.

petesergeant•10mo ago

Love that people are trying to improve chunkers, but just some examples of how it chunked some input text in the README would go a long way here!

mathis-l•10mo ago

You might want to take a look at https://github.com/segment-any-text/wtpsplit

It uses a similar approach but the focus is on sentence/paragraph segmentation generally and not specifically focused on RAG. It also has some benchmarks. Might be a good source of inspiration for where to take chonky next.

vunderba•10mo ago

This is the library that I use, mainly around very noisy IRC chat transcripts and it works pretty well. OP I'd love to see a paragraph matching comparison benchmark against wtpsplit to see how well Chonky stacks up.

oezi•10mo ago

Just to understand: The model is trained to put paragraph breaks into text. The training dataset is books (in contrast for instance to scientific articles or advertising flyers).

It shouldn't break sentences at commas, right?

hessdalenlight•10mo ago

No it shouldn't but since it neural net there is a small chance though.

sushidev•10mo ago

So I could use this to index i.e. a fiction book in a vector db, right? And the semantic chunking will possibly provide better results at query time for rag, did I understand that correctly?

hessdalenlight•10mo ago

Yes and yes you are correct!

acstorage•10mo ago

You mention that the fine tuning time took half a day, have you ever thought to reduce that time?

hessdalenlight•10mo ago

Actually day and a half :). I'm all for it but unfortunately I have pretty old hardware.

dmos62•10mo ago

Pretty cool. What use case did you have for this? Text with paragraph breaks missing seems fairly exotic.

cckolon•10mo ago

This would be useful when chunking PDFs scanned with OCR. I've done that before and paragraph breaks were detected pretty inconsistently.

cmenge•10mo ago

> I took the base distilbert model

I read "the base Dilbert model", all sorts of weird ideas going through my head, concluded I should re-read and made the same mistake again XD

Guess I better take a break and go for a walk now...

michaelmarkell•10mo ago

It seems to me like chunking (or some higher order version of it like chunking into knowledge graphs) is the highest leverage thing someone can work on right now if trying to improve intelligence of AI systems like code completion, PDF understanding etc. I’m surprised more people aren’t working on this.

serjester•10mo ago

Chunking is less important in the long context era with most people just pulling in top 20 K. You obviously don’t want to butcher it, but you’ve got a lot of room for error.

lmeyerov•10mo ago

Yeah exactly

We still want chunking in practice to avoid LLM confusion, undifferentiated embeddings, and handling large datasets at lower cost + large volumes. Large context means we can now tolerate multi-paragraph/page, so more like chunk by coherent section.

In theory we can do entire chapter/book, but those other concerns come in, so I only see more niche tools or talk-to-your-PDF do that.

At the same time, embedding is often a significant cost in above scenarios, so I'm curious about the semantic chunking overheads..

michaelmarkell•10mo ago

In our use-case we have many gigabytes of PDFs that contain some qualitative data but also many pages of inline pdf tables. In an ideal world we’d be “compressing” those embedded tables into some text that says “there’s a table here with these columns, if you want to analyze it you can use this <tool>, but basically the table is talking about X, here are the relevant stats like mean, sum, cardinality.”

In the naive chunking approach, we would grab random sections of line items from these tables because they happen to reference some similar text to the search query, but there’s no guarantee the data pulled into context is complete.

DeveloperErrata•10mo ago

Trueish - for orgs that can't use API models for regulatory or security reasons, or that just need really efficient high throughput models, setting up your own infra for long context models can still be pretty complicated and expensive. Careful chunking and thoughtful design of the RAG system often still matters a lot in that context.

J_Shelby_J•9mo ago

“Performance is less important in an era of multi-core CPUs.”

J_Shelby_J•9mo ago

That makes me feel better about spending so much time implementing this balanced text chunker last year. https://github.com/ShelbyJenkins/llm_utils

It splits an input text into equal sized chunks using DFS and parallelization (rayon) to do so relatively quickly.

However, the goal for me is to use a n LLM to split text by topic. I’m thinking I will implement it as an API saas service on top of it being OSS. Do you think it’s a viable business? You send a library of text, and receive a library of single topic context chunks as output.

olavfosse•10mo ago

Does it work on other languages?

andai•10mo ago

Training a splitter based on existing paragraph conventions is really cool. Actually, that's a task I run into frequently (trying to turn YouTube auto-transcript blob of text into readable sentences). LLMs tend to rewrite the text a bit too much instead of just adding punctuation.

As for RAG, I haven't noticed LLMs struggling with poorly structured text (e.g. the YouTube wall of text blob can just be fed directly into LLMs), though I haven't measured this.

In fact my own "webgrep" (convert top 10 search results into text and run grep on them, optionally followed by LLM summary) works on the byte level (gave up chunking words, sentences and paragraphs entirely): I just shove the 1kb before and after the match into the context. This works fine because LLMs just ignore the "mutilated" word parts at the beginning and end.

The only downside of this approach is that if I was the LLM, I would probably be unhappy with my job!

As for semantic chunking (in the context of, maximize the relevance of stuff that goes into the LLM, or indeed as a semantic search for the user), I haven't solved it yet, but I can share one amusing experiment: to find the relevant part of the text (having already returned a mostly-relevant big chunk of text), chop off one sentence at a time and re-run the similarity check! So you "distil" the text down to that which is most relevant (according to the embedding model) to the user query.

This is very slow and stupid, especially in real-time (though kinda fun to watch), but kinda works for the "approximately one sentence answers my question" scenario. A much cheaper approximation here would just be to embed at the sentence level as well as the page/paragraph level.

fareesh•10mo ago

The non english space in these fields is so far behind in terms of accuracy and reliability, it's crazy

legel•10mo ago

Very cool!

The training objective is clever.

The 50+ filters at Ecodash.ai for 90,000 plants came from a custom RAG model on top of 800,000 raw web pages. Because LLM’s are expensive, chunking and semantic search for figuring out what to feed into the LLM for inference is a key part of the pipeline nobody talks about. I think what I did was: run all text through the cheapest OpenAI embeddings API… then, I recall that nearest neighbor vector search wasn’t enough to catch all relevant information, for a given query to be answered by an LLM. So, I remember generating a large number of diverse queries, which mean the same thing (e.g. “plant prefers full sun”, “plant thrives in direct sunlight”, “… requires at least 6 hours of light per day”, …) and then doing nearest neighbor vector search on all queries, and using the statistics to choose what to semantically feed into RAG.

throwaway7783•9mo ago

Have you tried the bm25 + vector search + reranking pipeline for this?

searchguy•9mo ago

Hey, thanks for unpacking what you did at ecodash.ai.

Did you manually curate the queries that you did LLM query expansion on (generating a large number of diverse queries), or did you simply use the query log?

rekovacs•10mo ago

Really amazing and impressive work!

kamranjon•10mo ago

Interesting! I worked previously for a company that did automatic generation of short video clips from long videos. I fine-tuned a t5 model by taking many Wikipedia articles and removing the new line characters and training it to insert them.

The idea was that paragraphs are naturally how we segment distinct thoughts in text, and would translate well to segmenting long video clips. It actually worked pretty well! It was able to predict the paragraph breaks in many texts that it wasn’t trained on at all.

The problems at the time were around context length and dialog style formatting.

I wanted to try and approach the problem in a less brute force way by maybe using sentence embedding and calculating the probability of a sentence being a “paragraph ending” sentence - which would likely result in a much smaller model.

Anyway this is really cool! I’m excited to dive in further to what you’ve done!

rybosome•9mo ago

Interesting idea - is the chunking deterministic? It would have to be to be useful, but I’m wondering how that interacts with the neural net.

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Show HN: I spent 4 years building a UI design tool with only the features I use

Show HN: If you lose your memory, how to regain access to your computer?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Show HN: Smooth CLI – Token-efficient browser for AI agents

Show HN: MCP App to play backgammon with your LLM

Show HN: XAPIs.dev – Twitter API Alternative at 90% Lower Cost

Show HN: I'm 75, building an OSS Virtual Protest Protocol for digital activism

Show HN: I built Divvy to split restaurant bills from a photo

Show HN: Slack CLI for Agents

Show HN: BioTradingArena – Benchmark for LLMs to predict biotech stock movements

Show HN: Artifact Keeper – Open-Source Artifactory/Nexus Alternative in Rust

Show HN: ARM64 Android Dev Kit

Show HN: Gigacode – Use OpenCode's UI with Claude Code/Codex/Amp

Show HN: I Hacked My Family's Meal Planning with an App

Show HN: I built a free UCP checker – see if AI agents can find your store

Show HN: Compile-Time Vibe Coding

Show HN: Micropolis/SimCity Clone in Emacs Lisp

Show HN: Slop News – HN front page now, but it's all slop

Show HN: Falcon's Eye (isometric NetHack) running in the browser via WebAssembly

Show HN: Daily-updated database of malicious browser extensions

Show HN: Horizons – OSS agent execution engine

Show HN: Local task classifier and dispatcher on RTX 3080

Show HN: Fitspire – a simple 5-minute workout app for busy people (iOS)

Show HN: I built a RAG engine to search Singaporean laws

Show HN: Sem – Semantic diffs and patches for Git

Show HN: A password system with no database, no sync, and nothing to breach

Show HN: Craftplan – I built my wife a production management tool for her bakery

Show HN: GitClaw – An AI assistant that runs in GitHub Actions

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Show HN: I spent 4 years building a UI design tool with only the features I use

Show HN: If you lose your memory, how to regain access to your computer?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Show HN: Smooth CLI – Token-efficient browser for AI agents

Show HN: MCP App to play backgammon with your LLM

Show HN: XAPIs.dev – Twitter API Alternative at 90% Lower Cost

Show HN: I'm 75, building an OSS Virtual Protest Protocol for digital activism

Show HN: I built Divvy to split restaurant bills from a photo

Show HN: Slack CLI for Agents

Show HN: BioTradingArena – Benchmark for LLMs to predict biotech stock movements

Show HN: Artifact Keeper – Open-Source Artifactory/Nexus Alternative in Rust

Show HN: ARM64 Android Dev Kit

Show HN: Gigacode – Use OpenCode's UI with Claude Code/Codex/Amp

Show HN: I Hacked My Family's Meal Planning with an App

Show HN: I built a free UCP checker – see if AI agents can find your store

Show HN: Compile-Time Vibe Coding

Show HN: Micropolis/SimCity Clone in Emacs Lisp

Show HN: Slop News – HN front page now, but it's all slop

Show HN: Falcon's Eye (isometric NetHack) running in the browser via WebAssembly

Show HN: Daily-updated database of malicious browser extensions

Show HN: Horizons – OSS agent execution engine

Show HN: Local task classifier and dispatcher on RTX 3080

Show HN: Fitspire – a simple 5-minute workout app for busy people (iOS)

Show HN: I built a RAG engine to search Singaporean laws

Show HN: Sem – Semantic diffs and patches for Git

Show HN: A password system with no database, no sync, and nothing to breach

Show HN: Craftplan – I built my wife a production management tool for her bakery

Show HN: GitClaw – An AI assistant that runs in GitHub Actions

Show HN: Chonky – a neural approach for text semantic chunking

Comments