frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: Chonky – a neural text semantic chunking goes multilingual

https://huggingface.co/mirth/chonky_mmbert_small_multilingual_1
18•hessdalenlight•18h ago•1 comments

Show HN: Diagram as code tool with draggable customizations

https://github.com/RohanAdwankar/oxdraw
161•RohanAdwankar•10h ago•37 comments

Show HN: Shadcn/UI theme editor – Design and share Shadcn themes

https://shadcnthemer.com
96•miketromba•11h ago•31 comments

Show HN: LLM Rescuer – Fixing the billion dollar mistake in Ruby

https://github.com/barodeur/llm_rescuer
73•barodeur•1d ago•12 comments

Show HN: Zoto – low-level audio playback in Zig

https://github.com/braheezy/zoto
3•braheezy•2h ago•0 comments

Show HN: Piping in and Out of Emacs

https://github.com/agzam/mx-piper
3•iLemming•2h ago•0 comments

Show HN: Random Makers – Show HN and Product Hunt, but Faster and Not Corporate

https://makers.random.gg/
14•waynerd•12h ago•1 comments

Show HN: Status of my favorite bike share stations

https://blog.alexboden.ca/toronto-bike-share-status/
12•alexboden•10h ago•4 comments

Show HN: Dictly – Local, real‑time voice‑to‑text for macOS (sub‑100ms, no cloud)

https://dictly.app/
5•JannikJung•8h ago•0 comments

Show HN: NickelJoke – Pay a Nickel to Get a Joke Using X402 Micropayments

https://github.com/btahir/nickeljoke
2•bilater•7h ago•2 comments

Show HN: MacOS Live Screensaver – A screensaver that plays live video streams

https://github.com/hauxir/macos-live-screensaver
61•hauxir•4d ago•40 comments

Show HN: Sempress – 2× better compression for numeric data

https://sempress.net
4•jalyper•9h ago•1 comments

Show HN: LeafTok – Applied TikTok's Swipe UX to ePub/PDF Reading

https://leaftok.github.io/site/
3•iago-cavalcante•9h ago•1 comments

Show HN: Path-security – Comprehensive path validation with 62 attack vectors

https://github.com/redasgard/path-security
2•redasgard•10h ago•0 comments

Show HN: Git for LLMs – A context management interface

https://twigg.ai
98•jborland•2d ago•36 comments

Show HN: Circalify – 10KB circular timeline library for annual planning

https://mahmoodseoud.github.io/circalify/
3•Matooize•10h ago•0 comments

Show HN: I created a small 2D game about an ant

https://github.com/aanthonymax/ant-and-apples
4•aanthonymax•11h ago•2 comments

Show HN: A fast, privacy-first image converter that runs in browser

https://imageconverter.dev/
44•wainguo•1d ago•36 comments

Show HN: Deta Surf – An open source and local-first AI notebook

https://github.com/deta/surf
134•mxek•2d ago•39 comments

Show HN: Tommy – Turn ESP32 devices into through-wall motion sensors

https://www.tommysense.com
101•mike2872•2d ago•78 comments

Show HN: OpenSnowcat – A fork of Snowplow to keep open analytics alive

https://opensnowcat.io/
75•joaocorreia•2d ago•18 comments

Show HN: Nostr Web – decentralized website hosting on Nostr

https://nweb.shugur.com
101•karihass•2d ago•27 comments

Show HN: Centia.io – Open PostgreSQL/PostGIS back end for developers

https://centia.io/
4•mhoegh•19h ago•0 comments

Show HN: Pyxis CodeCanvas a lightweight, client-side IDE for iPad and browsers

https://github.com/Stasshe/Pyxis-CodeCanvas
2•Stasshe•14h ago•0 comments

Show HN: Sqlite3-dump - a fast SQLite to CSV and parquet

https://github.com/i64/sqlite3-dump
16•Gave4655•1d ago•3 comments

Show HN: Gisia – A Lightweight Self-Hosted DevOps Platform

https://github.com/gisiahq/gisia
2•okoddcat•15h ago•1 comments

Show HN: I built a tech news aggregator that works the way my brain does

https://deadstack.net/recent
184•dreadsword•2d ago•97 comments

Show HN: Cuq – Formal Verification of Rust GPU Kernels

https://github.com/neelsomani/cuq
93•nsomani•3d ago•63 comments

Show HN: I made an anagram word game for mobile in C++ and Go

https://www.anagramarena.com/
3•ribach•18h ago•1 comments

Show HN: Katakate – Dozens of VMs per node for safe code exec

https://github.com/Katakate/k7
122•gbxk•4d ago•51 comments
Open in hackernews

Show HN: Chonky – a neural text semantic chunking goes multilingual

https://huggingface.co/mirth/chonky_mmbert_small_multilingual_1
18•hessdalenlight•18h ago
TLDR: I’m expanding the family of text-splitting Chonky models with new multilingual model.

You can learn more about this neural approach in a previous post: https://news.ycombinator.com/item?id=43652968

Since the release of the first distilbert-based model I’ve released two more models based on a ModernBERT. All these models were pre-trained and fine-tuned primary on English texts.

But recently mmBERT(https://huggingface.co/blog/mmbert) has been released. This model pre-trained on massive dataset that contains 1833 languages. So I had an idea of fine-tuning a new multilingual Chonky model.

I’ve expanded training dataset (that previously contained bookcorpus and minipile datasets) with Project Gutenberg dataset which contains books in some widespread languages.

To make the model more robust for real-world data I’ve removed punctuation for last word for every training chunk with probability of 0.15 (no ablation was made for this technique though).

The hard part is evaluation. The real-world data are typically OCR'ed markdown, transcripts of calls, meeting notes etc. and not a clean book paragraphs. I didn’t find such labeled datasets. So I used what I had: already mentioned bookcorpus and Project Gutenberg validation, Paul Graham essays, concatenated 20_newsgroups.

I also tried to fine-tune the bigger mmBERT model (mmbert-base) but unfortunately it didn’t go well — metrics are weirdly lower in comparison with a small model.

Please give it a try. I'll appreciate a feedback.

The new multilingual model: https://huggingface.co/mirth/chonky_mmbert_small_multilingua...

All the Chonky models: https://huggingface.co/mirth

Chonky wrapper library: https://github.com/mirth/chonky

Comments

kamranjon•1h ago
This is interesting! I once trained a t5 model by removing newlines from Wikipedia text and it worked surprisingly well / at the time the context length was the biggest issue.

Another, not so easy to solve issue was conversational dialogue type data, which wasn’t super well represented in the training data.

I’ve always wanted to come back to working on the problem again, because I think it’s very interesting and we will have a bunch of unstructured text as a result of STT models like whisper that do a great job of transcribing/translating but generally don’t format anything.