Show HN: 1B Embeddings

3•INVARIAN•4h ago

We built a vector search engine based on Quantized Tensor Train (QTT) decomposition. Instead of approximate nearest neighbor (ANN) indices like HNSW or IVF, we factorize the entire dataset into a compressed tensor format and serve exact cosine similarity queries directly from the compressed representation. The headline: 1 billion vectors on a single H100, 38ms query, 100% recall, 66 GB serving.

Recall improves with scale at fp16: 96% at 400M → 98% at 500M → 99% at 600M → 100% at 1B. This is the opposite of ANN indices, where recall degrades with scale. More data helps the decomposition converge.

Every number below is measured, not projected. Full benchmark suite across 4 GPUs at 3 precision tiers. H100 80GB, 384-dim embeddings, rank=32.

fp16 (Scale tier) — H100 80GB:

  100M:  5.87ms p50,  6.6 GB serving, 100% recall, 46.5x compression
  500M: 20.54ms p50, 33.0 GB serving,  98% recall, 46.5x compression
    1B: 38.51ms p50, 66.0 GB serving, 100% recall, 46.5x compression

fp32 (Production tier) — H100 80GB: 100M: 18.96ms p50, 13.2 GB serving, 100% recall, 23.3x compression 300M: 46.29ms p50, 39.6 GB serving, 100% recall, 23.3x compression 500M: 76.53ms p50, 66.0 GB serving, 100% recall, 23.3x compression fp64 (Exact tier) — H100 80GB: 10M: 2.68ms p50, 2.6 GB serving, 100% recall, 11.6x compression 100M: 20.40ms p50, 26.4 GB serving, 100% recall, 11.6x compression 200M: 40.40ms p50, 51.6 GB serving, 100% recall, 11.6x compression

Hardware portability — same codebase, different GPUs: P4000 8 GB: fp16 50M, 26ms p50, 100% recall, $0.07/hr A100 40 GB: fp16 200M, 3.1ms p50, 98% recall, $0.70/hr H100 80 GB: fp16 500M, 3.1ms p50, 98% recall, $2.09/hr B200 192 GB: fp16 500M, 1.4ms p50, 99% recall, est. ~$5/hr Recall is hardware-invariant. Same math, same results, P4000 through H100. The 2B run on B200 is in progress. How it works: The dataset X (N×D) is factored as X ≈ Z · V_T where Z is (N×r) and V_T is (r×D), with r=32. Query is a single GEMV: scores = Z · (V_T · q). Bytes per entry: 2r bytes at fp16 = 64 bytes regardless of embedding dimension. A 1536-dim OpenAI ada-002 embedding compresses 23.6x at fp32 with zero recall loss.

Compression is dimension-independent:

  384-dim  MiniLM:     11.6x, 100% recall
  768-dim  E5-large:   11.8x, 100% recall
  1024-dim Cohere v3:  15.8x, 100% recall
  1536-dim ada-002:    23.6x, 100% recall

Operational details (H100, 10M vectors): QPS: 317 single client, 183 at 100 concurrent Cold start: 8.88s from snapshot to first query 24h soak: 2.9M queries, 8.6M inserts, zero data corruption Insert-under-query: 885 inserts/s concurrent with 101 QPS All artifacts (JSON + logs) available Build uses streaming randomized SVD — peak VRAM equals serving size, not dataset size. The 2B run on B200 uses streaming coefficient regeneration so the 512 GB coefficient matrix is never fully allocated in RAM. brad@holonomx.com

Comments

INVARIAN•3h ago

B200 2B fp16 — complete.

Metric Value

N 2,000,000,000

GPU NVIDIA B200 (191.5 GB)

Build 1794.2s (Pass 1: 637s, Pass 2: 1157s)

Serving 132.00 GB (Z=[2B, 32] fp16)

Compression 46.5× fp64, 23.3× fp32

Query p50 60.89 ms

Query p99 62.58 ms

R@10 mean 98.0%

R@10 min 90.0%

VRAM serving 131 GB

VRAM query peak 142 GB

CPU RAM post 9 GB

Total wall 2693.6s (~45 min)

INVARIAN•3h ago

2.5B: Incoming

A all CLIs tokens and context reducer by 97%

How we feel about AI (2025)

Show HN: Gecit – DPI bypass using eBPF sock_ops, no proxy or VPN

How to Get Better at Guitar

Iran internet blackout now longest nation-scale shutdown on record

Show HN: Stablemount, a response to EmDash, a prototype for a future CMS

Watch 'S4 – The Bob Lazar Story' online: Here's where to watch the UFO doc

Show HN: YardSard – Inventory Management

Show HN: Imladri – Cryptographic enforcement and semantic monitoring for your AI

AST vs. Bytecode: Interpreters in the Age of Meta-Compilation [pdf]

Codex is switching to API pricing based usage for all users

Francis Li

Show HN: Regression-dog – A 20-line skill that reviews your code for regressions

OpenRockets Archive New Submission(Autoscript)

Open source voice cloning TTS models worth trying

Claude AI powered trading bot turns $1 into $3.3M on Polymarket

Microsoft terms say Copilot is for entertainment purposes only, not serious use

We are facing the most significant days and weeks in world history since 1945

iCloud appears to be down for some users

Computational Physics (2nd Edition)

Open Source Elixir Personel Health Management

Agile Development Is Dead Reckoning

Inference Arena – new benchmark of local inference and training

Show HN: Identa – CLI to calibrate prompts across local LLMs

The Disposable Tools Manifesto

Dh

StackOverflow: Retiring the Beta Site

Background Jobs in Go with Asynq and Valkey

Developers using LLM APIs, what are your biggest frustrations?

Program analysis using random interpretation (2005) [pdf]