Show HN: Searchable compression for JSON – ~99% page skip and sub-ms lookups

https://github.com/kodomonocch1/see_proto

15•kodomonocch1•3mo ago

Problem JSON/NDJSON is everywhere in data platforms, but compression usually breaks searchability. You either keep queryable raw stores (high I/O/egress) or compress into gz/zstd blobs (cheap to store, painful to probe). The “cloud tax” shows up as wasted reads.

What I built (SEE — Semantic Entropy Encoding) A schema-aware, searchable compression codec for JSON that keeps exists/pos lookups fast while still compressing. Internals: structure-aware delta + dictionaries, a PageDir + mini-index to jump to relevant pages, and a tuned Bloom filter that skips ~99% of pages. AutoPage (131/262 KiB) balances seek vs throughput.

Benchmarks (apples-to-apples, FULL) - size ratio: str ≈ 0.168–0.170, combined ≈ 0.194–0.196 - Bloom density ≈ 0.30; skip: present ≈ 0.99, absent ≈ 0.992 - lookup (ms): present p50/p95/p99 ≈ 0.18/0.28/0.37; absent ≈ 1.16–1.88/1.36–2.11/1.58–2.41 Numbers are stable on a commodity desktop (i7-13700K/96GB/Windows).

Try it in 10 minutes (no build) 1) pip install see_proto 2) python samples/quick_demo.py It prints size ratios, Bloom density, skip %, and lookup p50/p95/p99 on a packaged sample.

Why not “just zstd”? We sometimes lose pure size vs zstd alone. The win is searchable compression: Bloom + PageDir avoids touching most pages, so selective probes pay less I/O/egress and finish faster. On large log scans this often wins on TCO even with similar raw ratios.

Link (README + quick demo + one-pager) https://github.com/kodomonocch1/see_proto

Comments

kodomonocch1•3mo ago

Happy to answer design details (page layout, Bloom tuning, codec selection, failure modes). Minimal Python examples for exists(key) and positions(key) are in the repo. If anyone needs deeper materials (reproducible FULL benches, wheel artifacts, and design notes) we have an NDA-gated VDR; I can share the form on request.

duanhjlt•3mo ago

Congrats on the release. The SEE approach—schema-aware delta, dictionaries, PageDir, and tuned Bloom filters—seems thoughtfully engineered. The tradeoff versus pure zstd makes sense if selective probes dominate TCO. I’ll try the quick demo; curious about failure modes and Bloom tuning across varied schemas.

esafak•3mo ago

It looks like you want to make money off this file format? That seems difficult. You would need to build a product around it first. I suppose some kind of a search or observability company could get funded if you have a demo. But be warned that running a company involves a lot more than developing a secret sauce.

The easiest thing is to popularize it and get a well-paying job from your fame. Make some friends and start your company together.

zahlman•3mo ago

It doesn't exactly inspire confidence observing that the .see "archive" included in the zip distribution apparently gets further compressed by more than 2:1 within the zip archive....

throwuxiytayq•3mo ago

“Millisecond lookups” sounds funny when you work in game dev. Anyway, interesting idea, thanks for sharing. Where the code at, though?

stuartjohnson12•3mo ago

From OP's Github: "I am a 20-year-old university student living in Japan. Although I'm a liberal arts major, I aspire to become an engineer."

Just FYI - this is most likely vibe coding that a sycophantic AI has persuaded OP is cutting edge research.

India's Sarvan AI LLM launches Indic-language focused models

Show HN: CryptoClaw – open-source AI agent with built-in wallet and DeFi skills

ShowHN: Make OpenClaw Respond in Scarlett Johansson’s AI Voice from the Film Her

CReact Version 0.3.0 Released

Show HN: CReact – AI Powered AWS Website Generator

The rocky 1960s origins of online dating (2025)

Show HN: Agent-fetch – Sandboxed HTTP client with SSRF protection for AI agents

Why there is no official statement from Substack about the data leak

Effects of Zepbound on Stool Quality

Show HN: Seedance 2.0 – The Most Powerful AI Video Generator

Ask HN: Do we need "metadata in source code" syntax that LLMs will never delete?

Pentagon cutting ties w/ "woke" Harvard, ending military training & fellowships

Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? [pdf]

Kessler Syndrome Has Started [video]

Complex Heterodynes Explained

EVs Are a Failed Experiment

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

CCC (Claude's C Compiler) on Compiler Explorer

Homeland Security Spying on Reddit Users

Actors with Tokio (2021)

Can graph neural networks for biology realistically run on edge devices?

Deeper into the shareing of one air conditioner for 2 rooms

Weatherman introduces fruit-based authentication system to combat deep fakes

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

A Curated List of ML System Design Case Studies

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

Open Problems in Mechanistic Interpretability

Bye Bye Humanity: The Potential AMOC Collapse

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation