news newest ask show jobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Life of an inference request (vLLM V1): How LLMs are served efficiently at scale

https://www.ubicloud.com/blog/life-of-an-inference-request-vllm-v1

73•samaysharma•5h ago

Comments

0xjunhao•2h ago

Hi, I'm the author of this post. Writing it was a great learning experience. I gained a lot of insight into vLLM. If you have any feedback or questions, feel free to drop a comment below!

criemen•2h ago

Thanks for writing the article!

I didn't quite get

Note that during the prefill phase, all prompt tokens from a request can be processed in one batch. This is possible because the query (Q) tensors, calculated from the tokens immediately before them, are available for each prompt token position.

I know that in practice prefill is much faster than inference. Would watching the 2h video from Karpathy help me understand why?

criemen•2h ago

And on the topic of prefill: Do you know what the role of GPUs is vs. in inference?

animan•9m ago

Prefill is part of Inference. It's the first major step where you calculate all the keys and values for the input tokens.

Decode is the next major step where you start generating output tokens one at a time.

Both run on GPUs but have slightly different workloads

1. Prefill has very little I/o from VRAM to HBM and more compute 2. Decode is light on compute but have to I/o the keys and values computed in the prefill stage for every output token

animan•8m ago

That snippet is trying to say that you can calculate KV for all the input tokens at once, and you don't need to loop over them since you have them all available.

Instead for decode, you need to sequentially generate each token.

JavaScript Trademark Update

https://deno.com/blog/deno-v-oracle4

425•thebeardisred•5h ago•143 comments

MCP: An (Accidentally) Universal Plugin System

https://worksonmymachine.substack.com/p/mcp-an-accidentally-universal-plugin

467•Stwerner•9h ago•207 comments

Refurb weekend: Gremlin Blasto arcade board

http://oldvcr.blogspot.com/2025/06/refurb-weekend-gremlin-blasto-arcade.html

5•todsacerdoti•17m ago•0 comments

US Defense Department will stop providing satellite weather data

https://text.npr.org/nx-s1-5446120

39•drewr•26m ago•3 comments

BusyBeaver(6) Is Quite Large

https://scottaaronson.blog/?p=8972

168•bdr•7h ago•119 comments

Life of an inference request (vLLM V1): How LLMs are served efficiently at scale

https://www.ubicloud.com/blog/life-of-an-inference-request-vllm-v1

73•samaysharma•5h ago•6 comments

Solving `UK Passport Application` with Haskell

https://jameshaydon.github.io/passport/

12•jameshh•1h ago•2 comments

2025 ARRL Field Day

https://www.arrl.org/field-day

66•rookderby•5h ago•22 comments

Community Is Motivation on Tap

https://alanwu.xyz/posts/community/

9•lunw•3d ago•0 comments

We ran a Unix-like OS Xv6 on our home-built CPU with a home-built C compiler (2020)

https://fuel.edby.coffee/posts/how-we-ported-xv6-os-to-a-home-built-cpu-with-a-home-built-c-compiler/

218•AlexeyBrin•12h ago•18 comments

Addictions Are Being Engineered

https://masonyarbrough.substack.com/p/engineered-addictions

340•echollama•9h ago•218 comments

Show HN: Vet – A tool for safely running remote shell scripts

https://getvet.sh

38•a10r•4h ago•8 comments

Unheard works by Erik Satie to premiere 100 years after his death

https://www.theguardian.com/music/2025/jun/26/unheard-works-by-erik-satie-to-premiere-100-years-after-his-death

176•gripewater•14h ago•44 comments

Memory Safe Languages: Reducing Vulnerabilities in Modern Software Development [pdf]

https://media.defense.gov/2025/Jun/23/2003742198/-1/-1/0/CSI_MEMORY_SAFE_LANGUAGES_REDUCING_VULNERABILITIES_IN_MODERN_SOFTWARE_DEVELOPMENT.PDF

41•todsacerdoti•6h ago•3 comments

Show HN: AGL a toy language that compiles to Go

https://github.com/alaingilbert/agl

30•alain_gilbert•3d ago•8 comments

The Great Illusion: When We Believed BeOS Would Save the World

https://www.desktoponfire.com/haiku_inc/782/the-great-illusion-when-we-believed-beos-would-save-the-world-and-maybe-it-was-right/

26•naves•4h ago•28 comments

Show HN: I'm an airline pilot – I built interactive graphs/globes of my flights

https://jameshard.ing/pilot

1412•jamesharding•1d ago•189 comments

Sirius: A GPU-native SQL engine

https://github.com/sirius-db/sirius

74•qianli_cs•10h ago•8 comments

Parsing JSON in Forty Lines of Awk

https://akr.am/blog/posts/parsing-json-in-forty-lines-of-awk

73•thefilmore•8h ago•32 comments

A literary magazine accessible only via telnet

9•edent•3d ago•6 comments

NovaCustom – Framework Laptop alternative focusing on privacy

https://novacustom.com/

27•CHEF-KOCH•6h ago•36 comments

Finding Peter Putnam

https://nautil.us/finding-peter-putnam-1218035/

63•dnetesn•13h ago•59 comments

An Indoor Beehive in My Bedroom Wall

https://www.keepingbackyardbees.com/an-indoor-beehive-zbwz1810zsau/

30•gscott•7h ago•4 comments

The Book Cover Trend of Text on Old Paintings

https://www.nytimes.com/2025/06/21/books/review/book-cover-trends.html

10•zdw•3d ago•5 comments

The Death of the Middle-Class Musician

https://thewalrus.ca/the-death-of-the-middle-class-musician/

25•pseudolus•2h ago•16 comments

ZeQLplus: Terminal SQLite Database Browser

https://github.com/ZetloStudio/ZeQLplus

51•amadeuspagel•11h ago•11 comments

Why the moon shimmers with shiny glass beads

https://phys.org/news/2025-06-moon-shimmers-shiny-glass-beads.html

13•PaulHoule•4d ago•2 comments

Lago (Open-Source Usage Based Billing) is hiring for ten roles

https://www.ycombinator.com/companies/lago/jobs

1•AnhTho_FR•12h ago

Evaluating Long-Context Question and Answer Systems

https://eugeneyan.com/writing/qa-evals/

11•swyx•3d ago•0 comments

IDF officers ordered to fire at unarmed crowds near Gaza food distribution sites

https://www.haaretz.com/israel-news/2025-06-27/ty-article-magazine/.premium/idf-soldiers-ordered-to-shoot-deliberately-at-unarmed-gazans-waiting-for-humanitarian-aid/00000197-ad8e-de01-a39f-ffbe33780000

1047•ahmetcadirci25•16h ago•766 comments