I built this to understand how vLLM works internally.
Comments
zahlman•1mo ago
I'm not familiar with the thing you're recreating (I gather it's something to do with getting better responses out of LLMs by manipulating the context or something like that?) but I appreciate that you haven't, like so many others, dropped ten paragraphs of Markdown-formatted press release (without bothering to check whether the formatting even works here) on us echoing a bunch of marketing-speak in a README.
ubermenchh•1mo ago
Haha, i just wanted my repo to be out here.
If someone finds it interesting they can always just check the repo.
And you're close, its about getting faster responses from the model by manipulating the request queues and memory.
dmarwicke•1mo ago
does this do continuous batching or just static? couldn't tell from the code
ubermenchh•1mo ago
yes it does continous batching along with paged attention and prefix caching.
i am also goint to be adding some more inference techniques
zahlman•1mo ago
ubermenchh•1mo ago