All this tells me is that: the “multiple embeddings” are probably mostly overlapping and the marginal value of each additional one is probably low, if you can represent them with a single embedding.
I don’t otherwise see how you can keep comparable performance without breaking information theory.
This is the point of the paper. Specifically, that single embedding vectors are sparse enough that you can compact more data from additional vectors together to improve retrieval performance.
Essentially, full multi vector comparison is challenging performance wise. Tools and performance for single vectors are much better. To compromise, cluster into k chunks and concatenate. Then you can do k-vector comparison at once with single-vector tooling and performance.
Ultimately the fixed length vector comes from having a fixed number of partitions, so this is kind of just k-means style clustering of the token level embeddings.
Presumably a dynamic clustering of the tokens could be even better, though that would leave you with a variable number of embeddings per document.
trengrj•7h ago
When looking at multi-vector / ColBERT style approaches, the embedding per token approach can massively increase costs. You might go from a single 768 dimension vector to 128 x 130 = 16,640 dimensions. Even with better results from a multi-vector model this can make it unfeasible for many use-cases.
Muvera, converts the multiple vectors into a single fixed dimension (usually net smaller) vector that can be used by any ANN index. As you now have a single vector you can use all your existing ANN algorithms and stack other quantization techniques for memory savings. In my opinion it is a much better approach than PLAID because it doesn't require specific index structures or clustering assumptions and can achieve lower latency.