frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Nano-vLLM: How a vLLM-style inference engine works

https://neutree.ai/blog/nano-vllm-part-1
64•yz-yu•2h ago

Comments

jbarrow•55m ago
~The whole thing feels AI written, generated from the codebase.~

*edited in response to the author’s comments, my apologies.

For instance, it goes into (nano)vLLM internals and doesn’t mention PagedAttention once (one of the core ideas that vLLM is based on)[1].

Also mentions that Part 2 will cover dense vs MoE’s, which is weird because nanovllm hardcodes a dense Qwen3 into the source.

Here are better (imo) explainers about how vLLM works:

- https://hamzaelshafie.bearblog.dev/paged-attention-from-firs...

- https://www.aleksagordic.com/blog/vllm

- https://huggingface.co/blog/continuous_batching

Aleksa’s blog is a bit in the weeds for my taste but it’s really worth working through.

A lot of the magic of vLLM happens in the PagedAttention kernels, which are really succinctly implanted in nanovllm. And the codebase is great and readable by itself!

—

1. https://arxiv.org/abs/2309.06180

lukax•16m ago
Not really in the PagedAttention kernels. Paged attention was integrated into FlashAttention so that FlashAttention kernels can be used both for prefill and decoding with paged KV. The only paged attention specific kernels are for copying KV blocks (device to device, device to host and host to device). At least for FA2 and FA3, vLLM maintained a fork of FA with paged attention patches.
yz-yu•7m ago
Hi jbarrow, thanks for your feedback and the links you shared—they're great readings for me (and likely others too).

That said, I need to clarify: the content was not written by AI, and certainly not generated from a database in one shot. If there's some agent + prompt that can produce what I wrote, I'd love to learn it—it would've saved me two weekends :)

Before addressing your questions further, some context: I'm a developer with no ML background but plenty of Cloud Infra experience. I'm currently building an open-source AI Infra project, which is why I studied nano-vllm. So my writing reflects some gaps in ML knowledge.

To your specific points:

> it goes into (nano)vLLM internals and doesn't mention PagedAttention once

I didn't find any explicit "paged attention" naming in nano-vllm. After reading the first article you linked—specifically the "Paged KV Caching" section—I believe the block management logic and CPU/GPU block mapping it describes is exactly what I covered in both posts. It may not be the full picture of paged attention, but I interpreted what I saw in the code and captured the core idea. I think that's a reasonable outcome.

> Part 2 will cover dense vs MoE's, which is weird because nanovllm hardcodes a dense Qwen3 into the source

This reflects my learning approach and background. Same as point 1—I may not have realized the block design was the famous PagedAttention implementation, so I didn't name it as such. For point 2, seeing a dense Qwen3 naturally made me wonder how it differs from the xx-B-A-yy-B MoE models I'd seen on Hugging Face—specifically what changes in the decoder layers. That curiosity led me to learn about MoE and write it up for others with the same questions.

---

I completely understand that in this era, people care more about whether what they're reading is AI-generated—no one wants to waste time on low-effort slop with no human involvement.

But as I explained above—and as my hand-drawn Excalidraw diagrams show (I haven't seen an LLM produce diagrams with logic that satisfies me)—this is the result of learning shaped by my own knowledge background and preferences.

Nano-vLLM: How a vLLM-style inference engine works

https://neutree.ai/blog/nano-vllm-part-1
64•yz-yu•2h ago•3 comments

My fast zero-allocation webserver using OxCaml

https://anil.recoil.org/notes/oxcaml-httpz
53•noelwelsh•4h ago•12 comments

Defeating a 40-year-old copy protection dongle

https://dmitrybrant.com/2026/02/01/defeating-a-40-year-old-copy-protection-dongle
700•zdw•17h ago•215 comments

Termux

https://github.com/termux/termux-app
201•tosh•4h ago•98 comments

Geologists may have solved mystery of Green River's 'uphill' route

https://phys.org/news/2026-01-geologists-mystery-green-river-uphill.html
4•defrost•1h ago•0 comments

Hypergrowth isn’t always easy

https://tailscale.com/blog/hypergrowth-isnt-always-easy
36•usrme•2d ago•17 comments

MaliciousCorgi: AI Extensions send your code to China

https://www.koi.ai/blog/maliciouscorgi-the-cute-looking-ai-extensions-leaking-code-from-1-5-milli...
41•tatersolid•2h ago•32 comments

My iPhone 16 Pro Max produces garbage output when running MLX LLMs

https://journal.rafaelcosta.me/my-thousand-dollar-iphone-cant-do-math/
364•rafaelcosta•18h ago•173 comments

Claude Code is suddenly everywhere inside Microsoft

https://www.theverge.com/tech/865689/microsoft-claude-code-anthropic-partnership-notepad
91•Anon84•3h ago•101 comments

4x faster network file sync with rclone (vs rsync) (2025)

https://www.jeffgeerling.com/blog/2025/4x-faster-network-file-sync-rclone-vs-rsync/
14•indigodaddy•3d ago•1 comments

Apple's MacBook Pro DFU port documentation is wrong

https://lapcatsoftware.com/articles/2026/2/1.html
150•zdw•11h ago•59 comments

Show HN: Apate API mocking/prototyping server and Rust unit test library

https://github.com/rustrum/apate
21•rumatoest•1d ago•8 comments

Show HN: Wikipedia as a doomscrollable social media feed

https://xikipedia.org
316•rebane2001•15h ago•114 comments

Ratchets in software development (2021)

https://qntm.org/ratchet
77•nvader•3d ago•24 comments

Show HN: NanoClaw – “Clawdbot” in 500 lines of TS with Apple container isolation

https://github.com/gavrielc/nanoclaw
446•jimminyx•16h ago•166 comments

Library of Juggling

https://libraryofjuggling.com/
48•tontony•7h ago•6 comments

Best Gas Masks

https://www.theverge.com/policy/868571/best-gas-masks
259•cdrnsf•3d ago•57 comments

Ian's Shoelace Site

https://www.fieggen.com/shoelace/
283•righthand•20h ago•46 comments

Adventure Game Studio: OSS software for creating adventure games

https://www.adventuregamestudio.co.uk/
362•doener•1d ago•77 comments

Apple I Advertisement (1976)

http://apple1.chez.com/Apple1project/Gallery/Gallery.htm
256•janandonly•21h ago•140 comments

Actors: A Model of Concurrent Computation [pdf] (1985)

https://apps.dtic.mil/sti/tr/pdf/ADA157917.pdf
113•kioku•14h ago•55 comments

Contracts in Nix

https://sraka.xyz/posts/contracts.html
80•todsacerdoti•1d ago•16 comments

EU launches government satcom program in sovereignty push

https://spacenews.com/eu-launches-government-satcom-program-in-sovereignty-push/
100•benkan•6h ago•46 comments

Board Games in Ancient Fiction: Egypt, Iran, Greece

https://reference-global.com/article/10.2478/bgs-2022-0016
29•bryanrasmussen•3d ago•10 comments

Rev up the viral factories

https://www.science.org/content/blog-post/rev-viral-factories
32•etiam•3d ago•1 comments

Leaked Chats Expose the Daily Life of a Scam Compound's Enslaved Workforce

https://www.wired.com/story/the-red-bull-leaks/
206•smurda•10h ago•111 comments

Microsoft is walking back Windows 11's AI overload

https://www.windowscentral.com/microsoft/windows-11/microsoft-is-reevaluating-its-ai-efforts-on-w...
123•jsheard•3h ago•170 comments

Building Your Own Efficient uint128 in C++

https://solidean.com/blog/2026/building-your-own-u128/
101•PaulHoule•18h ago•43 comments

Efficient String Compression for Modern Database Systems

https://cedardb.com/blog/string_compression/
142•jandrewrogers•2d ago•41 comments

Two kinds of AI users are emerging

https://martinalderson.com/posts/two-kinds-of-ai-users-are-emerging/
276•martinald•15h ago•255 comments