Demo: https://mxp.co/r/nga
Stack: SigLIP (768-dim embeddings), Ray on 2× L4 GPUs, Qdrant. ~2 hours to process, <100ms queries.
Why SigLIP over CLIP: sigmoid loss instead of softmax means embeddings live in a global semantic space—similarity scores stay consistent at scale instead of being batch-relative.
The interesting part is the retriever. One stage, three optional inputs:
- text → encode → kNN - image → encode → kNN - document_id → lookup stored embedding → kNN
Pass any combination. If multiple, fuse with reciprocal rank fusion (RRF). No score normalization needed—RRF only cares about rank position.
Killer query: pass a document_id + text like "but wearing blue." RRF combines structural similarity with the text constraint.
Blog with full config: https://mixpeek.com/blog/visual-search-rrf/