Nano-Vllm: Lightweight vLLM implementation built from scratch

https://github.com/GeeeekExplorer/nano-vllm

125•simonpure•7mo ago

Comments

unwind•7mo ago

Meta: the Title Casing in the title is pretty obnoxious, "Vllm" is exactly the inverse, casing-wise, of how the project spells its name.

msephton•7mo ago

Fwiw op has a small window of time to correct the casing after posting

futurecliff•7mo ago

how did u do it? which portion of vllm refactoring allowed u to get such gains.

zackify•7mo ago

Will this end up getting an open ai compatible web server or is that out of scope.

jimmySixDOF•7mo ago

Little sparse on the documentation side can't tell at a glance if there is a 1:1 hyperperameter tuneability or if this is an opinionated single path locked soft fpga eval-hacking kind of thing.

EDIT: -- Ok, it's legit, here is an example of it put to use by the makers of the Dolphin OpenSource series of FineTunes:

> Here I implement in nano-vllm, efficient sample-K logit extraction, as described in "Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs" by Anshumann et. al. Sampling occurs on the GPU, the non-sampled logits do not get copied out of GPU space. I tried to implement this in @vllm_project, but it was a bit too heavy for me to figure out.

https://github.com/GeeeekExplorer/nano-vllm/pull/34

baalimago•7mo ago

So... It's a language model..? As in, not "large"? I'm a bit unsure of the magnitudes here, but surely "nano" and "large" cancel out

IanCal•7mo ago

No, vLLM is a thing for serving language models: https://github.com/vllm-project/vllm

barrenko•7mo ago

Is it more like llama.cpp then? I don't have access to the good hardware.

jasonjmcghee•7mo ago

llama.cpp is optimized to serve one request at a time.

vllm is optimized to serve many requests at one time.

If you were to fine tune a model and wanted to serve it to many users, you would use vllm, not llama.cpp

jasonjmcghee•7mo ago

Here's a super relevant comment from another post https://news.ycombinator.com/item?id=44366418

barrenko•7mo ago

Appreciate it!

fractorial•7mo ago

Did anyone else click in excitedly after misreading ‘Vllm’ as ‘LLVM?’

omneity•7mo ago

This is an incredible achievement for a solo developer. The dev is from the Deepseek team by the way.

Imustaskforhelp•7mo ago

That is crazy! This is so cool ngl.

tt726259•7mo ago

After seeing the Docker image for vllm jump +5Gb (to 10Gb!) over the past five months, I grew suspicious of vllm's development practices [1]. It's not easy, for sure, to deal with all those flaky python modules [2].

But having the CUDA packages four times in different layers is questionable! [3]

Yet again, as a college mate of mine used to say, "Don't change it. It works."

[1]: https://hub.docker.com/r/vllm/vllm-openai/tags

[2]: https://github.com/vllm-project/vllm/issues/13306

[3]: These kinds of workarounds tend to end up accumulating and never get reviewed back:

- https://github.com/vllm-project/vllm/commit/b07d741661570ef1...

- https://github.com/vllm-project/vllm/commit/68d37809b9b52f4d... (this one in particular probably accounts for +3Gb)

mountainriver•7mo ago

Love this project, we need more simplifications like this in the current ML environment

SectorC: A C Compiler in 512 bytes

Tiny C Compiler

The F Word

Brookhaven Lab's RHIC concludes 25-year run with final collisions

FDA Intends to Take Action Against Non-FDA-Approved GLP-1 Drugs

Speed up responses with fast mode

Software factories and the agentic moment

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

You Are Here

Italy Railways Sabotaged

The Waymo World Model

First Proof

Vocal Guide – belt sing without killing yourself

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

I write games in C (yes, C)

Start all of your commands with a comma (2009)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Reinforcement Learning from Human Feedback

Selection Rather Than Prediction

72M Points of Interest

The AI boom is causing shortages everywhere else

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

Coding agents have replaced every framework I used

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev