frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: Holy Grail: Open-Source Autonomous Development Agent

https://github.com/dakotalock/holygrailopensource
1•Moriarty2026•7m ago•1 comments

Show HN: Minecraft Creeper meets 90s Tamagotchi

https://github.com/danielbrendel/krepagotchi-game
1•foxiel•14m ago•1 comments

Show HN: Termiteam – Control center for multiple AI agent terminals

https://github.com/NetanelBaruch/termiteam
1•Netanelbaruch•14m ago•0 comments

The only U.S. particle collider shuts down

https://www.sciencenews.org/article/particle-collider-shuts-down-brookhaven
1•rolph•17m ago•1 comments

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

1•solarisos•18m ago•2 comments

Show HN: Remotion directory (videos and prompts)

https://www.remotion.directory/
1•rokbenko•19m ago•0 comments

Portable C Compiler

https://en.wikipedia.org/wiki/Portable_C_Compiler
2•guerrilla•22m ago•0 comments

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

1•Ginsabo•22m ago•0 comments

Software Engineering Transformation 2026

https://mfranc.com/blog/ai-2026/
1•michal-franc•23m ago•0 comments

Microsoft purges Win11 printer drivers, devices on borrowed time

https://www.tomshardware.com/peripherals/printers/microsoft-stops-distrubitng-legacy-v3-and-v4-pr...
3•rolph•24m ago•1 comments

Lunch with the FT: Tarek Mansour

https://www.ft.com/content/a4cebf4c-c26c-48bb-82c8-5701d8256282
2•hhs•27m ago•0 comments

Old Mexico and her lost provinces (1883)

https://www.gutenberg.org/cache/epub/77881/pg77881-images.html
1•petethomas•30m ago•0 comments

'AI' is a dick move, redux

https://www.baldurbjarnason.com/notes/2026/note-on-debating-llm-fans/
4•cratermoon•32m ago•0 comments

The source code was the moat. But not anymore

https://philipotoole.com/the-source-code-was-the-moat-no-longer/
1•otoolep•32m ago•0 comments

Does anyone else feel like their inbox has become their job?

1•cfata•32m ago•1 comments

An AI model that can read and diagnose a brain MRI in seconds

https://www.michiganmedicine.org/health-lab/ai-model-can-read-and-diagnose-brain-mri-seconds
2•hhs•35m ago•0 comments

Dev with 5 of experience switched to Rails, what should I be careful about?

1•vampiregrey•37m ago•0 comments

AlphaFace: High Fidelity and Real-Time Face Swapper Robust to Facial Pose

https://arxiv.org/abs/2601.16429
1•PaulHoule•38m ago•0 comments

Scientists discover “levitating” time crystals that you can hold in your hand

https://www.nyu.edu/about/news-publications/news/2026/february/scientists-discover--levitating--t...
2•hhs•40m ago•0 comments

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2019) [video]

https://www.youtube.com/watch?v=3VReIuv1GFo
1•erickhill•41m ago•0 comments

Tell HN: Yet Another Round of Zendesk Spam

4•Philpax•41m ago•0 comments

Postgres Message Queue (PGMQ)

https://github.com/pgmq/pgmq
1•Lwrless•45m ago•0 comments

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

https://github.com/kjnez/django-rclone
2•cui•47m ago•1 comments

NY lawmakers proposed statewide data center moratorium

https://www.niagara-gazette.com/news/local_news/ny-lawmakers-proposed-statewide-data-center-morat...
2•geox•49m ago•0 comments

OpenClaw AI chatbots are running amok – these scientists are listening in

https://www.nature.com/articles/d41586-026-00370-w
3•EA-3167•49m ago•0 comments

Show HN: AI agent forgets user preferences every session. This fixes it

https://www.pref0.com/
6•fliellerjulian•51m ago•0 comments

Introduce the Vouch/Denouncement Contribution Model

https://github.com/ghostty-org/ghostty/pull/10559
2•DustinEchoes•53m ago•0 comments

Show HN: SSHcode – Always-On Claude Code/OpenCode over Tailscale and Hetzner

https://github.com/sultanvaliyev/sshcode
1•sultanvaliyev•54m ago•0 comments

Microsoft appointed a quality czar. He has no direct reports and no budget

https://jpcaparas.medium.com/microsoft-appointed-a-quality-czar-he-has-no-direct-reports-and-no-b...
3•RickJWagner•55m ago•0 comments

Multi-agent coordination on Claude Code: 8 production pain points and patterns

https://gist.github.com/sigalovskinick/6cc1cef061f76b7edd198e0ebc863397
1•nikolasi•56m ago•0 comments
Open in hackernews

Show HN: Chonky – a neural text semantic chunking goes multilingual

https://huggingface.co/mirth/chonky_mmbert_small_multilingual_1
43•hessdalenlight•3mo ago
TLDR: I’m expanding the family of text-splitting Chonky models with new multilingual model.

You can learn more about this neural approach in a previous post: https://news.ycombinator.com/item?id=43652968

Since the release of the first distilbert-based model I’ve released two more models based on a ModernBERT. All these models were pre-trained and fine-tuned primary on English texts.

But recently mmBERT(https://huggingface.co/blog/mmbert) has been released. This model pre-trained on massive dataset that contains 1833 languages. So I had an idea of fine-tuning a new multilingual Chonky model.

I’ve expanded training dataset (that previously contained bookcorpus and minipile datasets) with Project Gutenberg dataset which contains books in some widespread languages.

To make the model more robust for real-world data I’ve removed punctuation for last word for every training chunk with probability of 0.15 (no ablation was made for this technique though).

The hard part is evaluation. The real-world data are typically OCR'ed markdown, transcripts of calls, meeting notes etc. and not a clean book paragraphs. I didn’t find such labeled datasets. So I used what I had: already mentioned bookcorpus and Project Gutenberg validation, Paul Graham essays, concatenated 20_newsgroups.

I also tried to fine-tune the bigger mmBERT model (mmbert-base) but unfortunately it didn’t go well — metrics are weirdly lower in comparison with a small model.

Please give it a try. I'll appreciate a feedback.

The new multilingual model: https://huggingface.co/mirth/chonky_mmbert_small_multilingua...

All the Chonky models: https://huggingface.co/mirth

Chonky wrapper library: https://github.com/mirth/chonky

Comments

kamranjon•3mo ago
This is interesting! I once trained a t5 model by removing newlines from Wikipedia text and it worked surprisingly well / at the time the context length was the biggest issue.

Another, not so easy to solve issue was conversational dialogue type data, which wasn’t super well represented in the training data.

I’ve always wanted to come back to working on the problem again, because I think it’s very interesting and we will have a bunch of unstructured text as a result of STT models like whisper that do a great job of transcribing/translating but generally don’t format anything.

nvdnadj92•3mo ago
In case you need conversational data for the experiment you want to try, I developed an open-source cli tool [1] that create transcripts from voice chats on discord. Feel free to try it out!

[1] https://github.com/naveedn/audio-transcriber

CjHuber•3mo ago
Took me a minute to realize this is not about Chonkie. I would be interested in how this compares to the other's semantic chunking approach
jimmySixDOF•3mo ago
you can read the labels this (-y) uses modernBERT and even has an eval comparison to the (-ie) in it's GitHub so you can see the improvement as tested -- although if you want to do vanilla rules based chinking for whatever reason your data needs then (-ie) is still good.
TZubiri•3mo ago
That example looks terribly useless. Maybe there's an actually useful application you had in mind? I don't know say

Chonk("Hey I forgot my password, this is Tom from X Company") = ("Hey", "I forgot my password", "this is Tom from X Company")

Even then it doesn't quite look helpful.

freakynit•3mo ago
This is absolutely useless. Tried a few examples yesterday using hf demo. Fcking retarded af.

It literally splitted the text in-between of related texts while at the same time kept unrelated texts together, even though the embedding limit was far off.

I genuinely wanted this to work. I mean this. But nop. This shit did not work at all.

RAG is still fcked because if chunking issues. GraphRAG doesn't work correctly either unless you are willing to throw a lot of money during ingestion time.