EDIT: -- Ok, it's legit, here is an example of it put to use by the makers of the Dolphin OpenSource series of FineTunes:
> Here I implement in nano-vllm, efficient sample-K logit extraction, as described in "Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs" by Anshumann et. al. Sampling occurs on the GPU, the non-sampled logits do not get copied out of GPU space. I tried to implement this in @vllm_project, but it was a bit too heavy for me to figure out.
vllm is optimized to serve many requests at one time.
If you were to fine tune a model and wanted to serve it to many users, you would use vllm, not llama.cpp
But having the CUDA packages four times in different layers is questionable! [3]
Yet again, as a college mate of mine used to say, "Don't change it. It works."
--
[1]: https://hub.docker.com/r/vllm/vllm-openai/tags
[2]: https://github.com/vllm-project/vllm/issues/13306
[3]: These kinds of workarounds tend to end up accumulating and never get reviewed back:
- https://github.com/vllm-project/vllm/commit/b07d741661570ef1...
- https://github.com/vllm-project/vllm/commit/68d37809b9b52f4d... (this one in particular probably accounts for +3Gb)
unwind•7mo ago
msephton•7mo ago