We work a lot with multimodal embeddings with semantic search and image-to-image retrieval over massive datasets for CCTV Data. We've had 200M+ CLIP vectors indexed in vector DBs.
On the other side SigLIP smokes it. Approx. 5-10% better recall@1 on some test datasets. But re-embedding? Weeks of GPUs is hugely expensive for re-embedding all of this data.
So we made vector Rosetta. 50M-param adapter translates CLIP to SigLIP purely in embedding space. 41x faster, zero images.
teocalin37•1h ago
On the other side SigLIP smokes it. Approx. 5-10% better recall@1 on some test datasets. But re-embedding? Weeks of GPUs is hugely expensive for re-embedding all of this data.
So we made vector Rosetta. 50M-param adapter translates CLIP to SigLIP purely in embedding space. 41x faster, zero images.
Numbers:
90.9% cosine sim preserved 94.3% Rank@1 (10K pool), 84.4% (100K) COCO photos: 90.1%; WikiArt: 85.7%
Added the link to the model, we thought it may be useful for other people.