frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Chonky – a neural text semantic chunking goes multilingual

https://huggingface.co/mirth/chonky_mmbert_small_multilingual_1
9•hessdalenlight•10h ago
TLDR: I’m expanding the family of text-splitting Chonky models with new multilingual model.

You can learn more about this neural approach in a previous post: https://news.ycombinator.com/item?id=43652968

Since the release of the first distilbert-based model I’ve released two more models based on a ModernBERT. All these models were pre-trained and fine-tuned primary on English texts.

But recently mmBERT(https://huggingface.co/blog/mmbert) has been released. This model pre-trained on massive dataset that contains 1833 languages. So I had an idea of fine-tuning a new multilingual Chonky model.

I’ve expanded training dataset (that previously contained bookcorpus and minipile datasets) with Project Gutenberg dataset which contains books in some widespread languages.

To make the model more robust for real-world data I’ve removed punctuation for last word for every training chunk with probability of 0.15 (no ablation was made for this technique though).

The hard part is evaluation. The real-world data are typically OCR'ed markdown, transcripts of calls, meeting notes etc. and not a clean book paragraphs. I didn’t find such labeled datasets. So I used what I had: already mentioned bookcorpus and Project Gutenberg validation, Paul Graham essays, concatenated 20_newsgroups.

I also tried to fine-tune the bigger mmBERT model (mmbert-base) but unfortunately it didn’t go well — metrics are weirdly lower in comparison with a small model.

Please give it a try. I'll appreciate a feedback.

The new multilingual model: https://huggingface.co/mirth/chonky_mmbert_small_multilingua...

All the Chonky models: https://huggingface.co/mirth

Chonky wrapper library: https://github.com/mirth/chonky

California invests in battery energy storage, leaving rolling blackouts behind

https://www.latimes.com/environment/story/2025-10-17/california-made-it-through-another-summer-wi...
91•JumpCrisscross•2h ago•31 comments

The Journey Before main()

https://amit.prasad.me/blog/before-main
99•amitprasad•3h ago•29 comments

Show HN: Diagram as code tool with draggable customizations

https://github.com/RohanAdwankar/oxdraw
66•RohanAdwankar•2h ago•14 comments

How programs get run: ELF binaries (2015)

https://lwn.net/Articles/631631/
26•st_goliath•1h ago•0 comments

Show HN: Shadcn/UI theme editor – Design and share Shadcn themes

https://shadcnthemer.com
54•miketromba•2h ago•15 comments

An Update on TinyKVM

https://fwsgonzo.medium.com/an-update-on-tinykvm-7a38518e57e9
27•ingve•1h ago•3 comments

An Efficient Implementation of SELF (1989) [pdf]

https://courses.cs.washington.edu/courses/cse501/15sp/papers/chambers.pdf
17•todsacerdoti•1h ago•3 comments

Agent Lightning: Train agents with RL (no code changes needed)

https://github.com/microsoft/agent-lightning
26•bakigul•2h ago•3 comments

ARM Memory Tagging: how it improves C/C++ memory safety (2018) [pdf]

https://llvm.org/devmtg/2018-10/slides/Serebryany-Stepanov-Tsyrklevich-Memory-Tagging-Slides-LLVM...
27•fanf2•2h ago•5 comments

In memory of the Christmas Island shrew

https://news.mongabay.com/2025/10/in-memory-of-the-christmas-island-shrew/
30•hexhowells•2h ago•5 comments

Rock Tumbler Instructions

https://rocktumbler.com/tips/rock-tumbler-instructions/
130•debo_•6h ago•68 comments

Honda's ASIMO (2021)

https://www.robotsgottalents.com/post/asimo
23•nothrowaways•2h ago•4 comments

Belittled Magazine: Thirty years after the Sokal affair

https://thebaffler.com/salvos/belittled-magazine-robbins
9•Hooke•1h ago•0 comments

Testing out BLE beacons with BeaconDB

https://blog.matthewbrunelle.com/testing-out-ble-beacons-with-beacondb/
23•zdw•2h ago•4 comments

"Learn APL" Notes

https://luksamuk.codes/pages/learn-apl.html
17•todsacerdoti•2h ago•5 comments

AI, Wikipedia, and uncorrected machine translations of vulnerable languages

https://www.technologyreview.com/2025/09/25/1124005/ai-wikipedia-vulnerable-languages-doom-spiral/
29•kawera•2h ago•13 comments

WebDAV isn't dead yet

https://blog.feld.me/posts/2025/09/webdav-isnt-dead-yet/
61•toomuchtodo•1d ago•23 comments

Show HN: LLM Rescuer – Fixing the billion dollar mistake in Ruby

https://github.com/barodeur/llm_rescuer
33•barodeur•1d ago•4 comments

Passwords and Power Drills

https://google.github.io/building-secure-and-reliable-systems/raw/ch01.html#on_passwords_and_powe...
34•harporoeder•4d ago•7 comments

Load-time relocation of shared libraries (2011)

https://eli.thegreenplace.net/2011/08/25/load-time-relocation-of-shared-libraries/
15•saltypal•2h ago•0 comments

Project Amplify: Powered footwear for running and walking

https://about.nike.com/en/newsroom/releases/nike-project-amplify-official-images
28•justinmayer•2h ago•17 comments

The Cooperative National Geologic Map

https://ngmdb.usgs.gov/nationalgeology/
6•rob•2d ago•0 comments

Tarmageddon: RCE vulnerability highlights challenges of open source abandonware

https://edera.dev/stories/tarmageddon
43•vsgherzi•3d ago•13 comments

ProEnergy repurposes jet engines to power data centers

https://www.datacenterdynamics.com/en/news/proenergy-offers-repurposed-jet-engines-to-data-cent/
16•JumpCrisscross•2h ago•16 comments

Making a micro Linux distro (2023)

https://popovicu.com/posts/making-a-micro-linux-distro/
139•turrini•9h ago•25 comments

Jacqueline – A minimal i386 kernel written in Pascal (2019)

https://github.com/danirod/jacqueline
62•peter_d_sherman•3d ago•15 comments

Global key-value metadata storage for Scryer Prolog

https://github.com/jjtolton/environment.pl
9•triska•2h ago•0 comments

The future of Python web services looks GIL-free

https://blog.baro.dev/p/the-future-of-python-web-services-looks-gil-free
166•gi0baro-dev•6d ago•67 comments

Unlocking free WiFi on British Airways

https://www.saxrag.com/tech/reversing/2025/06/01/BAWiFi.html
570•vinhnx•1d ago•135 comments

Torchcomms: A modern PyTorch communications API

https://pytorch.org/blog/torchcomms/
7•paladin314159•2h ago•1 comments