frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Fast and Quality Code Chunking with Chonkie

1•snyy•1y ago
Hi HN,

We’re Chonkie (https://github.com/chonkie-inc/chonkie) — we build open source tools that help split documents into meaningful chunks for use with AI models.

When you use LLMs over large documents or codebases, you often need to break them into smaller parts to fit the model’s context window. Our chunkers do this in a smart way: they preserve structure and meaning, so only the most relevant pieces are passed into the model. This reduces hallucinations, avoids confusion, and improves performance and accuracy.

Today we’re launching our Code Chunker — a fast, structure-aware way to break down source code into high-quality, token-aware chunks.

How it works:

(See the code: https://github.com/chonkie-inc/chonkie/blob/main/src/chonkie...)

Code Chunker uses tree-sitter (https://tree-sitter.github.io/tree-sitter/) to parse your code into an abstract syntax tree (AST). It then recursively merges and groups nodes in a way that respects both code structure and token limits.

It supports all languages that tree-sitter supports, and is designed to preserve formatting and semantics. Large functions or class definitions won’t be split in the middle of a block — instead, we dive recursively into the AST to produce clean, coherent chunks that fit your configured token budget.

What it’s useful for:

  - Embedding-based code search

  - RAG (retrieval-augmented generation) over codebases

  - Long-context analysis of code

  - Preparing repos for fine-tuning or pretraining
Try it out:

  - Open source package: https://docs.chonkie.ai/chunkers/code-chunker

  - Hosted playground (free with account): https://cloud.chonkie.ai
Happy Chonking!

TextIndex

https://mattgemmell.scot/textindex/
1•Tomte•50s ago•0 comments

The Surprising Divide over What Counts as True

https://reason.com/2026/05/15/the-surprising-divide-over-what-counts-as-true/
1•stared•1m ago•0 comments

Nvidia unveils its spreading language model, "Nemotron-Labs-Diffusion"

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B
1•sofumel•2m ago•0 comments

Why I think Go is a Terrible Language

https://notashelf.dev/posts/go-sucks
2•Lunar5227•2m ago•0 comments

AMD's New Ryzen AI Max Pro 400 with 192GB LPDDR5X Memory

https://www.servethehome.com/amd-reveals-ryzen-ai-max-pro-400-series-192gb-ram-for-ai-systems/
1•calcifer•5m ago•0 comments

I Taught an AI to Be Our On-Call Engineer

https://medium.com/pipedrive-engineering/scooby-how-i-taught-an-ai-to-be-our-on-call-engineer-163...
1•devuo•7m ago•0 comments

VCs invested $300B in agentic infrastructure in Q1 2026

https://www.hitechies.com/venture-capital-q1-2026-300-billion-agentic-infrastructure-founders/
1•dhakalster•7m ago•0 comments

Value creation, bullshit jobs and the future of work

https://seths.blog/2026/05/value-creation-bullshit-jobs-and-the-future-of-work/
1•swolpers•9m ago•0 comments

The Cache Aware Scheduling Looks Like It Will Land for Linux 7.2

https://www.phoronix.com/news/Linux-7.2-Likely-CAS
1•rbanffy•10m ago•1 comments

Show HN: Visual timezone converter for remote teams

https://fluttertime.com/
1•dbecks•12m ago•0 comments

Mummy Brown

https://en.wikipedia.org/wiki/Mummy_brown
1•thunderbong•12m ago•0 comments

No Slop Grenade

https://noslopgrenade.com/
2•napolux•14m ago•0 comments

Show HN: I am making a cat-based gamified productivity app

https://store.steampowered.com/app/4704810/Junebug/
1•egretfx•15m ago•0 comments

X-Plane 12 Citation-X Checklist

https://www.wedesoft.de/simulation/2026/05/10/x-plane-citation-x-checklist/
1•wedesoft•16m ago•1 comments

The Beatles – On Their Old Sound

https://medium.com/the-hitmagist/the-beatles-on-their-old-sound-af380e576227
1•bryanrasmussen•17m ago•0 comments

Engineering Manager Interview Preparation

https://yusufaytas.com/engineering-manager-interview-preparation
7•hunter_coder•18m ago•0 comments

Gauss List Sieve for Lattices

https://leetarxiv.substack.com/p/gauss-lll-sieve
2•theanonymousone•19m ago•1 comments

AI token streaming isn't about SSE vs. WebSockets

https://zknill.io/posts/ai-token-streaming-isnt-about-sse-vs-websockets/
1•zknill•20m ago•0 comments

The Largest Sewer-Heat Recovery System in North America

https://nationalwesterncenter.com/about/what-is-the-nwc/sustainability-regen/energy/
1•geox•21m ago•0 comments

Nvidia raises video encoder limit to 12 on consumer GPUs

https://developer.nvidia.com/video-encode-decode-support-matrix
2•andrewstuart•23m ago•0 comments

Show HN: Rmux – A programmable terminal multiplexer with a Playwright-style SDK

https://github.com/helvesec/rmux
6•shideneyu•23m ago•2 comments

Hardware LLM Taalas Reaches >14,000 TPS on Llama 3.1 8B

https://taalas.com/products/
1•nullbio•24m ago•1 comments

Mind Citadel: Quiz RPG. New trivia game with RPG taste

https://play.google.com/store/apps/details?id=com.sektor.mindcitadel&hl=en_US
1•xSeKToRx•24m ago•1 comments

NASA's Psyche spacecraft returns unfamiliar views of a familiar world

https://arstechnica.com/space/2026/05/nasas-psyche-spacecraft-returns-unfamiliar-views-of-a-famil...
1•rbanffy•24m ago•0 comments

Gembokwarkop: Base64-Vigenere vs. AIs

https://github.com/altilunium/gembokwarkop
1•altilunium•25m ago•0 comments

Managers Have Been Vibe Coding All Along

https://yusufaytas.com/managers-have-been-vibe-coding-all-along
7•wyajmd•26m ago•0 comments

Anthropic on track for first profitable quarter

https://www.ft.com/content/a67248e7-f819-4dba-b0f7-3847df0a75f3
2•throwaway2037•28m ago•0 comments

Show HN: Real-time virtual try-on using hand gestures and live video diffusion

https://github.com/manas15/try-on
9•manas95•29m ago•1 comments

Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multiview Captures

https://apple.github.io/ml-headsup/
2•epaga•29m ago•0 comments

AI Engineering from Scratch

https://aiengineeringfromscratch.com
2•rippeltippel•31m ago•0 comments