fp.

The Problem: Everyone is using HNSW (graph indexes) for vector search. It works great for servers, but it introduces build-time latency, memory overhead (edges), and random access patterns that kill performance on consumer hardware.

The Project: QingMing is a header-only C++ engine that implements exact brute-force search. Instead of pruning the search space, I optimized the memory access pattern to saturate the HBM/GDDR6 bandwidth of consumer GPUs.

Benchmarks (Consumer Hardware):

  Desktop (NVIDIA RTX 5090D - 24GB)
  ---------------------------------
  Dataset:     SIFT-1M (128-dim)
  Recall:      99.2% @ 1 (FP32 variance), 100% @ 10
  Throughput:  9,354 QPS (Batch=10k)
  Latency:     ~5.5ms (P99)
  Build Time:  0 seconds

  Desktop (AMD Radeon 7900 XTX - 24GB)
  ------------------------------------
  Dataset:     SIFT-1M (128-dim)
  Recall:      99.2% @ 1, 100% @ 10
  Throughput:  6,275 QPS (Batch=10k)
  Latency:     ~11.2ms (P99)
  Note:        Running via HIP/ROCm 6.2 on Ubuntu

  Mobile (Snapdragon 8 Gen 5)
  ---------------------------
  Scenario:    100k Vectors (128d) for personal knowledge base
  Latency:     ~8ms per query
  Endurance:   Ran 10k consecutive queries with ZERO thermal throttling
               (due to L3/System Cache residency optimization)

Why use this? 1. Local RAG: Run high-quality retrieval on your gaming PC or phone. 2. Simplicity: No hyperparameters to tune (ef_search, M, nprobe). 3. Deterministic: No approximation errors for critical data.

Happy to answer questions about the NEON/CUDA/HIP memory coalescing details!