frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Chonky – a neural text semantic chunking goes multilingual

https://huggingface.co/mirth/chonky_mmbert_small_multilingual_1
9•hessdalenlight•11h ago
TLDR: I’m expanding the family of text-splitting Chonky models with new multilingual model.

You can learn more about this neural approach in a previous post: https://news.ycombinator.com/item?id=43652968

Since the release of the first distilbert-based model I’ve released two more models based on a ModernBERT. All these models were pre-trained and fine-tuned primary on English texts.

But recently mmBERT(https://huggingface.co/blog/mmbert) has been released. This model pre-trained on massive dataset that contains 1833 languages. So I had an idea of fine-tuning a new multilingual Chonky model.

I’ve expanded training dataset (that previously contained bookcorpus and minipile datasets) with Project Gutenberg dataset which contains books in some widespread languages.

To make the model more robust for real-world data I’ve removed punctuation for last word for every training chunk with probability of 0.15 (no ablation was made for this technique though).

The hard part is evaluation. The real-world data are typically OCR'ed markdown, transcripts of calls, meeting notes etc. and not a clean book paragraphs. I didn’t find such labeled datasets. So I used what I had: already mentioned bookcorpus and Project Gutenberg validation, Paul Graham essays, concatenated 20_newsgroups.

I also tried to fine-tune the bigger mmBERT model (mmbert-base) but unfortunately it didn’t go well — metrics are weirdly lower in comparison with a small model.

Please give it a try. I'll appreciate a feedback.

The new multilingual model: https://huggingface.co/mirth/chonky_mmbert_small_multilingua...

All the Chonky models: https://huggingface.co/mirth

Chonky wrapper library: https://github.com/mirth/chonky

Baker – language-agnostic project scaffolder with hooks (Rust)

https://github.com/aliev/baker
1•aliev•12m ago•0 comments

Dylan (programming language)

https://en.wikipedia.org/wiki/Dylan_(programming_language)
2•mitchbob•14m ago•0 comments

NewPipe Is Turning 10

https://newpipe.net/blog/pinned/announcement/newpipe-turns-10/
3•birdculture•15m ago•1 comments

WorkOS

https://workos.com
1•Bogdanp•15m ago•0 comments

Did Cheating Accusations Have Anything to Do with Death of Daniel Naroditsky?

https://www.nytimes.com/2025/10/25/style/chess-cheating-death.html
2•georgecmu•17m ago•1 comments

Show HN: Lightweight Directory Bookmarks for the Terminal

https://github.com/tomertouitoumail-ops/cd-bookmark
1•twilto•18m ago•0 comments

D2: Diagram Scripting Language

https://d2lang.com/tour/intro/
2•benzguo•22m ago•0 comments

Writing reliable and maintainable metaprograms in pure C99

https://github.com/hirrolot/metalang99
2•beckford•33m ago•1 comments

We Put a Distributed Database in the Browser – and Made a Game of It

https://tigerbeetle.com/blog/2023-07-11-we-put-a-distributed-database-in-the-browser/
1•ibobev•38m ago•0 comments

Haiku 4.5 Playing Text Adventures

https://entropicthoughts.com/haiku-4-5-playing-text-adventures
1•ibobev•38m ago•0 comments

LibreWolf – Ratfactor

https://ratfactor.com/cards/librewolf
3•ibobev•38m ago•1 comments

Context Switches

https://matklad.github.io/2025/09/19/context-switches.html
1•SchwKatze•39m ago•0 comments

SymSpell C99: Building the Fastest Spell Checker in Pure C

https://suman-pokhrel.com.np/symspell-c99.html
1•ashvardanian•46m ago•0 comments

Matilda, Mars and Markup: The Curious Case of Mrs. Agnes Zevens

https://vinayprabhu.substack.com/p/matilda-mars-and-markup-the-curious
1•VinayUPrabhu•50m ago•0 comments

Nio shocks with fully active suspension – but how does it work?

https://www.autocar.co.uk/car-news/technology/nio-shocks-fully-active-suspension-how-does-it-work
4•breve•52m ago•0 comments

I translated my book for $7 using OpenAI

https://andrewpwheeler.com/2025/10/25/i-translated-my-book-for-7-using-openai/
2•apwheele•54m ago•0 comments

Windows Server 2016 in Termux

https://old.reddit.com/r/termux/comments/1of8zxc/windows_server_2016_in_termux/
1•sipofwater•56m ago•2 comments

Misinformation About the End of Life Is Harming Organ Donation

https://undark.org/2025/10/23/opinion-misinformation-organ-donation/
5•EA-3167•56m ago•0 comments

The roots of software development in the textile industry (2024)

https://asawicki.info/news_1776_what_does_software_have_to_do_with_the_linen_industry
2•Bogdanp•1h ago•1 comments

FAL Flashpack: High-throughput tensor loading for PyTorch

https://github.com/fal-ai/flashpack
1•dvrp•1h ago•0 comments

Show HN: Dictly – Local, real‑time voice‑to‑text for macOS (sub‑100ms, no cloud)

https://dictly.app/
1•JannikJung•1h ago•0 comments

Terahertz Tech Sets Stage for "Wireless Wired" Chips

https://spectrum.ieee.org/terahertz-chip-room-temperature
2•FromTheArchives•1h ago•0 comments

The Lotus Evija's mad torque tech is the holy grail of EV engineering

https://www.autocar.co.uk/car-news/new-cars/lotus-evijas-mad-torque-tech-holy-grail-ev-engineering
8•breve•1h ago•0 comments

Some "Silicon Valley" moments I experienced in my 31 year career as a programmer

https://old.reddit.com/r/bayarea/comments/1ofdap5/some_silicon_valley_moments_i_experienced_in_my/
2•bobbiechen•1h ago•1 comments

Safety of SARS-CoV-2 DNA-encoded monoclonal antibodies in healthy adults

https://www.nature.com/articles/s41591-025-03969-0
1•bookofjoe•1h ago•0 comments

Bacterial RNA promotes proteostasis in C. elegans

https://www.nature.com/articles/s41467-025-63987-x
2•PaulHoule•1h ago•0 comments

Last in Clojure (2024)

https://grishaev.me/clojure-last/
25•1659447091•1h ago•0 comments

ReasoningBank Explained: How AI Agents Are Finally Learning to Remember

https://rewire.it/blog/reasoningbank-explained-how-ai-agents-are-finally-learning-to-remember/
2•timini•1h ago•0 comments

Show HN: Sempress – 2× better compression for numeric data

https://sempress.net
2•jalyper•1h ago•1 comments

Hurricane Melissa poised to become catastrophic major hurricane, head to Jamica

https://yaleclimateconnections.org/2025/10/hurricane-melissa-poised-to-rapidly-intensify-as-it-he...
5•WarOnPrivacy•1h ago•1 comments