Ask HN: How to serve inference as we do with containes with cached token
1•elesbao•2h ago
I've been reading and experimenting with vLLM but it seems that each day there are more and more articles and AI generated long form about each part of the stack. I have a few GPUs and work for a private education group. I want to run models internally and distribute access to a research team. I don't want to have one (or more) GPU per user neither train models. CUrrently I am doing well with a local Qwen on my own single server but I can't wrap my head around on which part to tackle - right now I am looking to KV caches and building over vLLM but I wanted something simple and secure to not leak data from one session to another.