The Problem:
Everyone is using HNSW (graph indexes) for vector search. It works great for servers, but it introduces build-time latency, memory overhead (edges), and random access patterns that kill performance on consumer hardware.
The Project:
QingMing is a header-only C++ engine that implements exact brute-force search. Instead of pruning the search space, I optimized the memory access pattern to saturate the HBM/GDDR6 bandwidth of consumer GPUs.
Benchmarks (Consumer Hardware):
Desktop (NVIDIA RTX 5090D - 24GB)
---------------------------------
Dataset: SIFT-1M (128-dim)
Recall: 99.2% @ 1 (FP32 variance), 100% @ 10
Throughput: 9,354 QPS (Batch=10k)
Latency: ~5.5ms (P99)
Build Time: 0 seconds
Desktop (AMD Radeon 7900 XTX - 24GB)
------------------------------------
Dataset: SIFT-1M (128-dim)
Recall: 99.2% @ 1, 100% @ 10
Throughput: 6,275 QPS (Batch=10k)
Latency: ~11.2ms (P99)
Note: Running via HIP/ROCm 6.2 on Ubuntu
Mobile (Snapdragon 8 Gen 5)
---------------------------
Scenario: 100k Vectors (128d) for personal knowledge base
Latency: ~8ms per query
Endurance: Ran 10k consecutive queries with ZERO thermal throttling
(due to L3/System Cache residency optimization)
Why use this?
1. Local RAG: Run high-quality retrieval on your gaming PC or phone.
2. Simplicity: No hyperparameters to tune (ef_search, M, nprobe).
3. Deterministic: No approximation errors for critical data.
Happy to answer questions about the NEON/CUDA/HIP memory coalescing details!
uulong•1h ago
The Project: QingMing is a header-only C++ engine that implements exact brute-force search. Instead of pruning the search space, I optimized the memory access pattern to saturate the HBM/GDDR6 bandwidth of consumer GPUs.
Benchmarks (Consumer Hardware):
Why use this? 1. Local RAG: Run high-quality retrieval on your gaming PC or phone. 2. Simplicity: No hyperparameters to tune (ef_search, M, nprobe). 3. Deterministic: No approximation errors for critical data.Happy to answer questions about the NEON/CUDA/HIP memory coalescing details!