Show HN: Searchable compression for JSON – ~99% page skip and sub-ms lookups

https://github.com/kodomonocch1/see_proto

13•kodomonocch1•2h ago

Problem JSON/NDJSON is everywhere in data platforms, but compression usually breaks searchability. You either keep queryable raw stores (high I/O/egress) or compress into gz/zstd blobs (cheap to store, painful to probe). The “cloud tax” shows up as wasted reads.

What I built (SEE — Semantic Entropy Encoding) A schema-aware, searchable compression codec for JSON that keeps exists/pos lookups fast while still compressing. Internals: structure-aware delta + dictionaries, a PageDir + mini-index to jump to relevant pages, and a tuned Bloom filter that skips ~99% of pages. AutoPage (131/262 KiB) balances seek vs throughput.

Benchmarks (apples-to-apples, FULL) - size ratio: str ≈ 0.168–0.170, combined ≈ 0.194–0.196 - Bloom density ≈ 0.30; skip: present ≈ 0.99, absent ≈ 0.992 - lookup (ms): present p50/p95/p99 ≈ 0.18/0.28/0.37; absent ≈ 1.16–1.88/1.36–2.11/1.58–2.41 Numbers are stable on a commodity desktop (i7-13700K/96GB/Windows).

Try it in 10 minutes (no build) 1) pip install see_proto 2) python samples/quick_demo.py It prints size ratios, Bloom density, skip %, and lookup p50/p95/p99 on a packaged sample.

Why not “just zstd”? We sometimes lose pure size vs zstd alone. The win is searchable compression: Bloom + PageDir avoids touching most pages, so selective probes pay less I/O/egress and finish faster. On large log scans this often wins on TCO even with similar raw ratios.

Link (README + quick demo + one-pager) https://github.com/kodomonocch1/see_proto

Comments

kodomonocch1•2h ago

Happy to answer design details (page layout, Bloom tuning, codec selection, failure modes). Minimal Python examples for exists(key) and positions(key) are in the repo. If anyone needs deeper materials (reproducible FULL benches, wheel artifacts, and design notes) we have an NDA-gated VDR; I can share the form on request.

duanhjlt•2h ago

Congrats on the release. The SEE approach—schema-aware delta, dictionaries, PageDir, and tuned Bloom filters—seems thoughtfully engineered. The tradeoff versus pure zstd makes sense if selective probes dominate TCO. I’ll try the quick demo; curious about failure modes and Bloom tuning across varied schemas.

esafak•1h ago

It looks like you want to make money off this file format? That seems difficult. You would need to build a product around it first. I suppose some kind of a search or observability company could get funded if you have a demo. But be warned that running a company involves a lot more than developing a secret sauce.

The easiest thing is to popularize it and get a well-paying job from your fame. Make some friends and start your company together.

zahlman•1h ago

It doesn't exactly inspire confidence observing that the .see "archive" included in the zip distribution apparently gets further compressed by more than 2:1 within the zip archive....

throwuxiytayq•1h ago

“Millisecond lookups” sounds funny when you work in game dev. Anyway, interesting idea, thanks for sharing. Where the code at, though?

stuartjohnson12•1h ago

From OP's Github: "I am a 20-year-old university student living in Japan. Although I'm a liberal arts major, I aspire to become an engineer."

Just FYI - this is most likely vibe coding that a sycophantic AI has persuaded OP is cutting edge research.

Andrej Karpathy – AGI is still a decade away

Live Stream from the Namib Desert

Scientists discover intercellular nanotubular communication system in brain

Ruby core team takes ownership of RubyGems and Bundler

EVs are depreciating faster than gas-powered cars

Meow.camera

I built an F5 QKview scanner for CISA ED 26-01

Migrating from AWS to Hetzner

The Rapper 50 Cent, Adjusted for Inflation

AI has a cargo cult problem

Resizeable Bar Support on the Raspberry Pi

4Chan Lawyer publishes Ofcom correspondence

You did no fact checking, and I must scream

Dead or Alive creator Tomonobu Itagaki, 58 passes away

OpenAI Needs $400B In The Next 12 Months

Cartridge Chaos: The Official Nintendo Region Converter and More

Let's write a macro in Rust

MIT physicists improve the precision of atomic clocks

How I bypassed Amazon's Kindle web DRM

Ask HN: How to stop an AWS bot sending 2B requests/month?

Trap the Critters with Paint

Read your way through Hà Nội

Show HN: OnlyJPG – Client-Side PNG/HEIC/AVIF/PDF/etc to JPG

Email bombs exploit lax authentication in Zendesk

Stinkbug Leg Organ Hosts Symbiotic Fungi That Protect Eggs from Parasitic Wasps

Next steps for BPF support in the GNU toolchain

Amazon-backed, nuclear facility for Washington state

Metropolis 1998 lets you design every building in an isometric, pixel-art city (2024)

New computer model helps reveal how the brain both adapts and misfires

Your data model is your destiny