frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Tiny C Compiler

https://bellard.org/tcc/
132•guerrilla•4h ago•58 comments

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

https://github.com/localgpt-app/localgpt
14•yi_wang•1h ago•1 comments

SectorC: A C Compiler in 512 bytes

https://xorvoid.com/sectorc.html
218•valyala•8h ago•41 comments

Speed up responses with fast mode

https://code.claude.com/docs/en/fast-mode
126•surprisetalk•8h ago•132 comments

Software factories and the agentic moment

https://factory.strongdm.ai/
150•mellosouls•11h ago•309 comments

Brookhaven Lab's RHIC concludes 25-year run with final collisions

https://www.hpcwire.com/off-the-wire/brookhaven-labs-rhic-concludes-25-year-run-with-final-collis...
49•gnufx•7h ago•51 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
893•klaussilveira•1d ago•271 comments

Show HN: Craftplan – Elixir-based micro-ERP for small-scale manufacturers

https://puemos.github.io/craftplan/
13•deofoo•4d ago•1 comments

Stories from 25 Years of Software Development

https://susam.net/twenty-five-years-of-computing.html
143•vinhnx•11h ago•16 comments

Hoot: Scheme on WebAssembly

https://www.spritely.institute/hoot/
170•AlexeyBrin•14h ago•30 comments

FDA intends to take action against non-FDA-approved GLP-1 drugs

https://www.fda.gov/news-events/press-announcements/fda-intends-take-action-against-non-fda-appro...
80•randycupertino•4h ago•146 comments

First Proof

https://arxiv.org/abs/2602.05192
109•samasblack•11h ago•69 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
277•jesperordrup•18h ago•89 comments

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

https://github.com/Momciloo/fun-with-clip-path
61•momciloo•8h ago•11 comments

Show HN: A luma dependent chroma compression algorithm (image compression)

https://www.bitsnbites.eu/a-spatial-domain-variable-block-size-luma-dependent-chroma-compression-...
31•mbitsnbites•3d ago•2 comments

Al Lowe on model trains, funny deaths and working with Disney

https://spillhistorie.no/2026/02/06/interview-with-sierra-veteran-al-lowe/
90•thelok•10h ago•18 comments

The F Word

http://muratbuffalo.blogspot.com/2026/02/friction.html
103•zdw•3d ago•52 comments

Start all of your commands with a comma (2009)

https://rhodesmill.org/brandon/2009/commands-with-comma/
557•theblazehen•3d ago•206 comments

Eigen: Building a Workspace

https://reindernijhoff.net/2025/10/eigen-building-a-workspace/
8•todsacerdoti•4d ago•2 comments

Selection rather than prediction

https://voratiq.com/blog/selection-rather-than-prediction/
28•languid-photic•4d ago•8 comments

The AI boom is causing shortages everywhere else

https://www.washingtonpost.com/technology/2026/02/07/ai-spending-economy-shortages/
262•1vuio0pswjnm7•15h ago•423 comments

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

https://www.windowscentral.com/microsoft/windows-11/windows-locked-me-out-of-notepad-is-the-thin-...
103•josephcsible•6h ago•125 comments

I write games in C (yes, C) (2016)

https://jonathanwhiting.com/writing/blog/games_in_c/
175•valyala•8h ago•165 comments

Reinforcement Learning from Human Feedback

https://rlhfbook.com/
114•onurkanbkrc•13h ago•5 comments

Where did all the starships go?

https://www.datawrapper.de/blog/science-fiction-decline
132•speckx•4d ago•208 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
140•videotopia•4d ago•46 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
221•limoce•4d ago•124 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
296•isitcontent•1d ago•39 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
578•todsacerdoti•1d ago•279 comments

72M Points of Interest

https://tech.marksblogg.com/overture-places-pois.html
50•marklit•5d ago•10 comments
Open in hackernews

Life of an inference request (vLLM V1): How LLMs are served efficiently at scale

https://www.ubicloud.com/blog/life-of-an-inference-request-vllm-v1
175•samaysharma•7mo ago

Comments

0xjunhao•7mo ago
Hi, I'm the author of this post. Writing it was a great learning experience. I gained a lot of insight into vLLM. If you have any feedback or questions, feel free to drop a comment below!
criemen•7mo ago
Thanks for writing the article!

I didn't quite get

Note that during the prefill phase, all prompt tokens from a request can be processed in one batch. This is possible because the query (Q) tensors, calculated from the tokens immediately before them, are available for each prompt token position.

I know that in practice prefill is much faster than inference. Would watching the 2h video from Karpathy help me understand why?

criemen•7mo ago
And on the topic of prefill: Do you know what the role of GPUs is vs. in inference?
animan•7mo ago
Prefill is part of Inference. It's the first major step where you calculate all the keys and values for the input tokens.

Decode is the next major step where you start generating output tokens one at a time.

Both run on GPUs but have slightly different workloads

1. Prefill has very little I/o from VRAM to HBM and more compute 2. Decode is light on compute but have to I/o the keys and values computed in the prefill stage for every output token

dist-epoch•7mo ago
Doesn't decode also need to stream in the whole of the model weights, thus very I/O heavy?
0xjunhao•7mo ago
Yes, decoding is very I/O heavy. It has to stream in the whole of the model weights from HBM for every token decoded. However, that cost can be shared between the requests in the same batch. So if the system has more GPU RAM to hold larger batches, the I/O cost per request can be lowered.
animan•7mo ago
That snippet is trying to say that you can calculate KV for all the input tokens at once, and you don't need to loop over them since you have them all available.

Instead for decode, you need to sequentially generate each token.

longbeachbass•7mo ago
Thanks for this! Learnt a lot.

Curious to understand how do we ensure that the same model instance gets requests from the same client/user? Since conversations are stateful and the model needs context from previous turns of the conversation.

Is this happening at the load balancer layer?

cyanf•7mo ago
It's either sticky sessions or an lb that keeps track of prior sequences and route to the instance with the largest match. https://docs.sglang.ai/router/router.html
hhh•7mo ago
They’re not stateful, you submit the entire history with every call. Caching of prompts etc makes it important for performance to have sticky sessions or smth at the load balancer layer
0xjunhao•7mo ago
Yes, typically users send the newest user message and the full conversation history. These combined become the prompt.

Our API endpoint will try to route requests that has the same prefix to the same vLLM instance (similar to longest prefix matching in networking), and hopefully there are still some KV caches for part of the prompt there.

3abiton•7mo ago
Great write up, it would be interesting to see a lot of those covered features in comparison to other frameworks!
zackangelo•7mo ago
In your forward pass section you give a lot of emphasis to FlashAttention, but it might be worth mentioning Paged Attention as well (which was the paper written by the vLLM authors and I believe was the genesis of the project). PA-style block tables are now supported in most fused attention kernels, but vLLM originally came up with it and it's the main reason why vLLM has such high throughput!
0xjunhao•7mo ago
Thank you! We have incorporated your suggestion.
mhlakhani•7mo ago
Thanks for writing this up! I learnt a bunch from it. I noticed this didn’t discuss additional layers of caching - I can see how it would fit in, but is prompt caching out of the scope of this system?
gdiamos•7mo ago
Great write up. We use vLLM kv cache and continuous batching as a foundation for requests in ScalarLM and also add batching optimizations in a centralized queue and by adding explicit batching support in our client.

https://www.scalarlm.com

There is more perf you can sqeeuze out of vLLM

r0b05•7mo ago
Great write up!

Does batching add data from multiple requests into the same context, potentially decreasing perplexity? If so, are we trading off perplexity for lower operating costs?

ethan_smith•7mo ago
Batching in vLLM doesn't combine prompts into the same context - it processes separate requests in parallel while sharing compute resources, so there's no perplexity tradeoff, just efficiency gains.
zettabomb•7mo ago
It's worth noting that reason this works is because basically every LLM architecture currently in use is severely limited by memory bandwidth, not by compute. So it's trivial to run several requests at a time, while waiting for the next weights to arrive from VRAM.
StochasticLi•7mo ago
I would like to know what inference speeds they are achieving exactly on what hardware. I skimmed and searched the article and didn't find that info.
geoffbp•7mo ago
Thanks, good read!