A common pattern I kept seeing is to split the problem into two stages:
1. Retrieve a small set of relevant candidates
2. Re-rank them using a model
Instead of doing brute-force inference across all items, I built a small prototype around this idea.
The flow looks like this:
- Store embeddings in a vector database (ChromaDB)
- Retrieve the Top-K most similar items/users based on vector similarity
- Run a TensorFlow.js model to re-rank the candidates
The goal is to reduce the search space before applying inference, which seems necessary when latency and scale matter.
What I found interesting is that once you move to this approach, a lot of the complexity shifts from the model itself to the retrieval layer:
- choosing K
- filtering candidates
- embedding quality
- latency vs recall trade-offs
Curious how others approach this in real systems:
- How do you decide on K?
- Do you rely purely on vector similarity or add heuristics?
- How do you handle re-ranking at scale?
Project: https://github.com/ftonato/recommendation-system-chromadb-tf...