Are these sort of similarity searches useful for classifying text?
OutOfHere•39m ago
It altogether depends on the quality and suitability of the provided embedding vector that you provide. Even with a long embedding vector using a recent model, my estimation is that the classification will be better than random but not too accurate. You would typically do better by asking a large model directly for a classification. The good thing is that it is often easy to create a small human labeled dataset and estimate the error confusion matrix via each approach.
CuriouslyC•27m ago
Embeddings are good at partitioning document stores at a coarse grained level, and they can be very useful for documents where there's a lot of keyword overlap and the semantic differentiation is distributed. They're definitely not a good primary recall mechanism, and they often don't even fully pull weight for their cost in hybrid setups, so it's worth doing evals for your specific use case.
esafak•17m ago
You could assign the cluster based on what the k nearest neighbors are, if there is a clear majority. The quality will depend on the suitability of your embeddings.
_pdp_•18m ago
I thought you need memory for these things and CPU is not the bottleneck?
clemlesne•2h ago