frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Ask HN: How the same LLM "instance" serve multiple clients?

1•BiraIgnacio•9mo ago
I've been playing with running LLMs locally and only then realized I have no idea how to scale it (I don't really know how LLMs work internally).

I'm assuming context is everything but if the same LLM process can serve multiple clients, aren't there risks of mixing contexts? Does anyone have any ideas?

Comments

sherdil2022•9mo ago
Let me ChatGPT for you:

Good question. Let’s break it down carefully.

When you hear about a single LLM instance serving multiple clients at the same time, it usually works like this: • The LLM instance is stateless: Each client sends a request (prompt + settings), the model processes that one request independently, and returns the response. The LLM doesn’t “remember” between requests unless you explicitly include conversation history in the prompt. • Concurrency is handled by infrastructure: Even though the LLM is “one model,” it can handle many incoming requests because the backend (server) wraps the model with techniques like: • Asynchronous request handling (e.g., using async/await patterns) • Batching: multiple prompts are packed together into a single forward pass through the model (very common in high-traffic servers) • Parallelism: the server could have multiple workers/replicas of the model (copies or shared GPUs) running side-by-side. • Queueing: if too many clients at once, requests are queued and processed in order. • Memory isolation: Each request is kept separate in memory. No client’s data leaks into another client’s conversation unless you (the app developer) accidentally introduce a bug.

So:

It’s not that one model is “locked” into serving only one person at a time. It’s more like the model is a very fast function being called many times in parallel.

⸻

Study confirms experience beats youthful enthusiasm

https://www.theregister.com/2026/02/07/boomers_vs_zoomers_workplace/
1•Willingham•1m ago•0 comments

The Big Hunger by Walter J Miller, Jr. (1952)

https://lauriepenny.substack.com/p/the-big-hunger
1•shervinafshar•2m ago•0 comments

The Genus Amanita

https://www.mushroomexpert.com/amanita.html
1•rolph•7m ago•0 comments

We have broken SHA-1 in practice

https://shattered.io/
1•mooreds•8m ago•1 comments

Ask HN: Was my first management job bad, or is this what management is like?

1•Buttons840•9m ago•0 comments

Ask HN: How to Reduce Time Spent Crimping?

1•pinkmuffinere•10m ago•0 comments

KV Cache Transform Coding for Compact Storage in LLM Inference

https://arxiv.org/abs/2511.01815
1•walterbell•15m ago•0 comments

A quantitative, multimodal wearable bioelectronic device for stress assessment

https://www.nature.com/articles/s41467-025-67747-9
1•PaulHoule•17m ago•0 comments

Why Big Tech Is Throwing Cash into India in Quest for AI Supremacy

https://www.wsj.com/world/india/why-big-tech-is-throwing-cash-into-india-in-quest-for-ai-supremac...
1•saikatsg•17m ago•0 comments

How to shoot yourself in the foot – 2026 edition

https://github.com/aweussom/HowToShootYourselfInTheFoot
1•aweussom•17m ago•0 comments

Eight More Months of Agents

https://crawshaw.io/blog/eight-more-months-of-agents
3•archb•19m ago•0 comments

From Human Thought to Machine Coordination

https://www.psychologytoday.com/us/blog/the-digital-self/202602/from-human-thought-to-machine-coo...
1•walterbell•19m ago•0 comments

The new X API pricing must be a joke

https://developer.x.com/
1•danver0•20m ago•0 comments

Show HN: RMA Dashboard fast SAST results for monorepos (SARIF and triage)

https://rma-dashboard.bukhari-kibuka7.workers.dev/
1•bumahkib7•21m ago•0 comments

Show HN: Source code graphRAG for Java/Kotlin development based on jQAssistant

https://github.com/2015xli/jqassistant-graph-rag
1•artigent•26m ago•0 comments

Python Only Has One Real Competitor

https://mccue.dev/pages/2-6-26-python-competitor
3•dragandj•27m ago•0 comments

Tmux to Zellij (and Back)

https://www.mauriciopoppe.com/notes/tmux-to-zellij/
1•maurizzzio•28m ago•1 comments

Ask HN: How are you using specialized agents to accelerate your work?

1•otterley•29m ago•0 comments

Passing user_id through 6 services? OTel Baggage fixes this

https://signoz.io/blog/otel-baggage/
1•pranay01•30m ago•0 comments

DavMail Pop/IMAP/SMTP/Caldav/Carddav/LDAP Exchange Gateway

https://davmail.sourceforge.net/
1•todsacerdoti•31m ago•0 comments

Visual data modelling in the browser (open source)

https://github.com/sqlmodel/sqlmodel
1•Sean766•33m ago•0 comments

Show HN: Tharos – CLI to find and autofix security bugs using local LLMs

https://github.com/chinonsochikelue/tharos
1•fluantix•33m ago•0 comments

Oddly Simple GUI Programs

https://simonsafar.com/2024/win32_lights/
1•MaximilianEmel•33m ago•0 comments

The New Playbook for Leaders [pdf]

https://www.ibli.com/IBLI%20OnePagers%20The%20Plays%20Summarized.pdf
1•mooreds•34m ago•1 comments

Interactive Unboxing of J Dilla's Donuts

https://donuts20.vercel.app
1•sngahane•35m ago•0 comments

OneCourt helps blind and low-vision fans to track Super Bowl live

https://www.dezeen.com/2026/02/06/onecourt-tactile-device-super-bowl-blind-low-vision-fans/
1•gaws•37m ago•0 comments

Rudolf Vrba

https://en.wikipedia.org/wiki/Rudolf_Vrba
1•mooreds•37m ago•0 comments

Autism Incidence in Girls and Boys May Be Nearly Equal, Study Suggests

https://www.medpagetoday.com/neurology/autism/119747
1•paulpauper•38m ago•0 comments

Wellness Hotels Discovery Application

https://aurio.place/
1•cherrylinedev•39m ago•1 comments

NASA delays moon rocket launch by a month after fuel leaks during test

https://www.theguardian.com/science/2026/feb/03/nasa-delays-moon-rocket-launch-month-fuel-leaks-a...
1•mooreds•40m ago•0 comments