frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: MCP to get latest dependency package and tool versions

https://github.com/MShekow/package-version-check-mcp
1•mshekow•3m ago•0 comments

The better you get at something, the harder it becomes to do

https://seekingtrust.substack.com/p/improving-at-writing-made-me-almost
2•FinnLobsien•4m ago•0 comments

Show HN: WP Float – Archive WordPress blogs to free static hosting

https://wpfloat.netlify.app/
1•zizoulegrande•6m ago•0 comments

Show HN: I Hacked My Family's Meal Planning with an App

https://mealjar.app
1•melvinzammit•6m ago•0 comments

Sony BMG copy protection rootkit scandal

https://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootkit_scandal
1•basilikum•9m ago•0 comments

The Future of Systems

https://novlabs.ai/mission/
2•tekbog•9m ago•1 comments

NASA now allowing astronauts to bring their smartphones on space missions

https://twitter.com/NASAAdmin/status/2019259382962307393
2•gbugniot•14m ago•0 comments

Claude Code Is the Inflection Point

https://newsletter.semianalysis.com/p/claude-code-is-the-inflection-point
3•throwaw12•15m ago•1 comments

Show HN: MicroClaw – Agentic AI Assistant for Telegram, Built in Rust

https://github.com/microclaw/microclaw
1•everettjf•16m ago•2 comments

Show HN: Omni-BLAS – 4x faster matrix multiplication via Monte Carlo sampling

https://github.com/AleatorAI/OMNI-BLAS
1•LowSpecEng•16m ago•1 comments

The AI-Ready Software Developer: Conclusion – Same Game, Different Dice

https://codemanship.wordpress.com/2026/01/05/the-ai-ready-software-developer-conclusion-same-game...
1•lifeisstillgood•18m ago•0 comments

AI Agent Automates Google Stock Analysis from Financial Reports

https://pardusai.org/view/54c6646b9e273bbe103b76256a91a7f30da624062a8a6eeb16febfe403efd078
1•JasonHEIN•22m ago•0 comments

Voxtral Realtime 4B Pure C Implementation

https://github.com/antirez/voxtral.c
2•andreabat•24m ago•1 comments

I Was Trapped in Chinese Mafia Crypto Slavery [video]

https://www.youtube.com/watch?v=zOcNaWmmn0A
2•mgh2•30m ago•0 comments

U.S. CBP Reported Employee Arrests (FY2020 – FYTD)

https://www.cbp.gov/newsroom/stats/reported-employee-arrests
1•ludicrousdispla•32m ago•0 comments

Show HN: I built a free UCP checker – see if AI agents can find your store

https://ucphub.ai/ucp-store-check/
2•vladeta•37m ago•1 comments

Show HN: SVGV – A Real-Time Vector Video Format for Budget Hardware

https://github.com/thealidev/VectorVision-SVGV
1•thealidev•39m ago•0 comments

Study of 150 developers shows AI generated code no harder to maintain long term

https://www.youtube.com/watch?v=b9EbCb5A408
1•lifeisstillgood•39m ago•0 comments

Spotify now requires premium accounts for developer mode API access

https://www.neowin.net/news/spotify-now-requires-premium-accounts-for-developer-mode-api-access/
1•bundie•42m ago•0 comments

When Albert Einstein Moved to Princeton

https://twitter.com/Math_files/status/2020017485815456224
1•keepamovin•43m ago•0 comments

Agents.md as a Dark Signal

https://joshmock.com/post/2026-agents-md-as-a-dark-signal/
2•birdculture•45m ago•0 comments

System time, clocks, and their syncing in macOS

https://eclecticlight.co/2025/05/21/system-time-clocks-and-their-syncing-in-macos/
1•fanf2•46m ago•0 comments

McCLIM and 7GUIs – Part 1: The Counter

https://turtleware.eu/posts/McCLIM-and-7GUIs---Part-1-The-Counter.html
2•ramenbytes•49m ago•0 comments

So whats the next word, then? Almost-no-math intro to transformer models

https://matthias-kainer.de/blog/posts/so-whats-the-next-word-then-/
1•oesimania•50m ago•0 comments

Ed Zitron: The Hater's Guide to Microsoft

https://bsky.app/profile/edzitron.com/post/3me7ibeym2c2n
2•vintagedave•53m ago•1 comments

UK infants ill after drinking contaminated baby formula of Nestle and Danone

https://www.bbc.com/news/articles/c931rxnwn3lo
1•__natty__•54m ago•0 comments

Show HN: Android-based audio player for seniors – Homer Audio Player

https://homeraudioplayer.app
3•cinusek•54m ago•2 comments

Starter Template for Ory Kratos

https://github.com/Samuelk0nrad/docker-ory
1•samuel_0xK•56m ago•0 comments

LLMs are powerful, but enterprises are deterministic by nature

2•prateekdalal•59m ago•0 comments

Make your iPad 3 a touchscreen for your computer

https://github.com/lemonjesus/ipad-touch-screen
2•0y•1h ago•1 comments
Open in hackernews

Lossless LLM 3x Throughput Increase by LMCache

https://github.com/LMCache/LMCache
154•lihanc111•7mo ago

Comments

lihanc111•7mo ago
Our team has built this open source project, LMCache, to reduce repetitive computation in LLM inference and make systems serve more people (3x more throughput in chat applications) and it has been used in IBM's open source LLM inference stack.

In LLM serving, the input is computed into intermediate states called KV cache to further provide answers. These data are relatively large (~1-2GB for long context) and are often evicted when GPU memory is not enough. In these cases, when users ask a follow up question, the software needs to recompute for the same KV Cache. LMCache is designed to combat that by efficiently offloading and loading these KV cache to and from DRAM and disk.

Ask us anything!

dist-epoch•7mo ago
How is it possible to do non-prefix KV cache? I was under the impression that the V for one token potentially depends on the V of all previous ones.
da-x•7mo ago
Yes, there's KV cache 'Blending' see [1].

Future versions of LMCache are aiming to support this.

[1] CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion- https://arxiv.org/abs/2405.16444

pama•7mo ago
Is your aim targetting the inference at scale or specialized/new/simpler inference pipelines? Sglang and vllm have disaggregated prefix and decoding serving (eg https://docs.vllm.ai/examples/online_serving/disaggregated_s... or https://github.com/sgl-project/sglang/issues/3554 and https://github.com/sgl-project/sglang/issues/4655) — could your solution enable a model-agnostic cache store/server or is that orthogonal to what you are trying to achieve?
nativeit•7mo ago
Has it been used in IBM's inference stack, or used with IBM's inference stack? In other words, has this been merged into IBM's own repositories, or has someone just tested it using them?
lihanc111•7mo ago
It is in IBM's llm-d open source stack
behnamoh•7mo ago
> Our team

So this is something that might in the future turning to a commercial product? something like Langchain and thousands of open source projects that started as "open source" but then ended up implementing proprietary features for a cost.

Tokumei-no-hito•7mo ago
i don't see anything wrong with that approach, do you?
behnamoh•7mo ago
Give it time and you'll come to my conclusion.
0xjunhao•7mo ago
Hi, I had a quick question. Would it be correct to say the following?

1. For long inputs and short outputs, the inference can be arbitrarily number of times faster, as it avoids repeated KV computation.

2. Conversely, for short inputs and long outputs, it might be slightly slower, since loading and storing the KV cache are on the critical path of the execution.

lihanc111•7mo ago
It is almost true for both. Although for the second case you can just skip storing in these cases where there is little improvement.
iLoveOncall•7mo ago
Is this any different than prompt caching?
smcleod•7mo ago
Have you considered integrating it with the likes of llama.cpp?
m3kw9•7mo ago
How would it work if a user wants to do 1 of n tries?
kcorbitt•7mo ago
Looks cool! With vLLM v1, prefix caching is enabled by default and seems quite performant. Is the advantage of LMCache the fact that you can offload to CPU and disk as well? How much is throughput/latency affected if you need to pull a large KV cache from disk/cpu instead of GPU RAM?

Also, how realistic would it be to share the KV cache across vllm nodes within a data center? It would be really nice to be able to freely distribute requests to a pool of vLLM workers without worrying about prefix-aware routing, but maybe that isn't the right approach because moving the KV cache around would be too slow?

guywhocodes•7mo ago
This is exactly what llm-d is
ekianjo•7mo ago
wasn't this already implemented in llama.cpp?
sgammon•7mo ago
Hey LMCache team! Saw you guys at OSS N.A. but wasn’t able to set aside time to say hello. We’d love to chat about collaborating. Is there an email we can reach out to?
lihanc111•7mo ago
Please send to contact@lmcache.ai
refulgentis•7mo ago
Word to the wise:

"Lossless 3x Throughput Increase" == "Cache all inputs and output across everyone, in RAM and on disk, and if you assume the next request is covered by cache, its 3x faster!"

I'm more surprised it's only advertised as 3x under those conditions: my llama.cpp wrapper does the same -- caching in RAM while running locally seems fine to me -- and when input is cached, TTFT is ~instantaneous, modulo any add'l prompt you add.

I supposed it creates a little more distance, in that, instead of infinity times faster for latency, we measure throughput, and then our speedup can be adjusted as desired by adjusting output length, and thus we can pick a more reasonable-sounding metric like 3x. (though, the GitHub README still frames it in terms of latency / TTFT)

varispeed•7mo ago
Sometimes I think the entire engineering profession collectively underwent a lobotomy. Techniques like caching partial computation results to avoid repeating expensive work were so basic a few decades ago that no one would have bothered to dignify them with a paper, let alone brand them with a fancy acronym and announce them like the second coming of Turing. Now we get breathless blog posts and community calls over the mind-blowing discovery that storing KV caches of repeated text speeds things up. Next we'll get a paper on using hash tables to look things up faster. Meanwhile, actual difficult problems in large-scale distributed inference and model interpretability get hand-waved so we can posture about reinventing memoisation. Tech never fails to take the obvious, put a bow on it, and sell it back to us as groundbreaking.
vlovich123•7mo ago
Partial caching as a concept doesn’t matter. The hard part is figuring out how to make it work for cross attention which sets up a data dependency for every entry on every preceding entry. So prefix caching of KV cache is brain dead easy. Computing a KV cache for random bits of text and then combining unrelated text in a way that makes the LLM still work coherently and correctly? That to me seems much harder.

It seems to me like you’re easily hand waving away a hard problem in a different part of the stack you’re less familiar with.

varispeed•7mo ago
Let’s be honest: it’s fundamentally about analysing memory access patterns, spotting reuse opportunities, and orchestrating data flows. That’s classic systems engineering. Useful, yes. Rocket science, no. The real joke is how the profession has sunk so low that anything beyond a trivial for-loop becomes a grounds for whitepapers, corporate branding, and breathless conference talks. In the past, we’d have quietly shipped this and moved on. Frankly, I’m surprised they haven’t patented it yet.
vlovich123•7mo ago
Caching and reuse broadly yes. Getting cross attention to work mathematically correctly by stitching the pre computed KV cache for snippets of text is not that unless you’ve redefined what classical systems engineering is.

Again, the novelty is in getting cross attention to work correctly despite the fact that you’re stitching together arbitrary caches together. It’s akin to taking snippets of compressed portions of random compressed files and reconstructing a new correct plain text. That’s obviously not possible but clearly this has been accomplished with the KV cache for arbitrary models (ie not trained for it) despite the KV cache working like decompression where all the preceding bytes have to be computed correctly for the subsequent token to be correct.

varispeed•7mo ago
I get the argument, but let's be blunt: every serious cache system deals with consistency, partial reuse, and correctness. That’s standard engineering - regardless of how much intimidating jargon you layer over it. Useful, sure. But watching the industry throw a circus around basic cache management, complete with papers and corporate branding, is exactly why so much of modern tech feels like a hype-driven clown show rather than a disciplined craft.
bGl2YW5j•7mo ago
I’m with you. It’s a bit shocking.
vlovich123•7mo ago
I really don’t understand what you’re saying. This isn’t about consistency of the data. If you don’t figure out a mathematically valid way to combine the precomputed values of snippets of text, then the LLM just doesn’t work properly. Prefix cache management which is just normal systems engineering is not all this is doing. Stitching together cache fragments such that the LLM is actually still reasoning correctly about the text is hard. Have you read the paper?
imtringued•7mo ago
You're bragging about the easy parts and rolling your eyes at the hard parts.

Meanwhile the AI engineers are doing the exact opposite. Bragging about the hard parts and rolling their eyes at the easy parts.

notjoemama•7mo ago
I've noticed this too. I wonder if it is the difference in experience levels. It feels odd seeing excitement at rediscovering a (what you and I think of as well-known) solution. To be fair, I was that kid at one time too. Still, it feels a bit like these more simple things ought to be taught at university so new grads can focus more on solving domain problems.

I suppose, combine this with pressure from public or private investment, and the way to get ahead is to package anything into a prospect of revenue generation. I'm sure that's part of it too. Everything has to monetize because some business school graduate hasn't "made it" until they have a yacht like their ivy league friends.

Eh, probably comes across as curmudgeonly or "who moved my cheese". But if there is an area that can improve this longstanding problem in tech, my guess is teaching the right skills and concepts at the collegiate level. And that's not a simple thing either.

Edit > reading a bit more, this focuses on chat applications and seems to be a decent caching implementation tailored to that domain, of which, I'm guessing will allow AT&T and Verizon to save money on their gobsmackingly horrible AI chat bot in their mobile app. As an individual, it's unclear how this benefits me though. I don't think it does. ME: asks chat bot question about insurance coverage, CHATBOT: immediately serves canned response in no time about how that's covered in my individual insurance plan which I read more about on their website (pro-tip: no, I can't, those details are actually never on the website)

nativeit•7mo ago
It seems odd to me that so many of these projects are being launched by people who have only just discovered and/or joined HN. I'm worried this is just becoming LinkedIn for AI opportunists.
parpfish•7mo ago
I’ve got a side project that I may (someday) do a show HN with. However, I’d probably make a new account for that because the project is connected to my real name/portfolio and I don’t want that connected with my pseudonymous comments here
nativeit•7mo ago
I considered that, but then why would anyone obfuscate this really very reasonable scenario by choosing another ostensibly pseudonymous username?
fsmv•7mo ago
[deleted]
parpfish•7mo ago
I imagine that this is a common problem and it could be another cool “unlockable” on HN, like the downvotes at 500 karma.

Once you get X karma or account age >Y years, you can make one anonymous submissions each quarter that comes from an non-user but still get some sort of “verified” badge that proves it comes from a legit user.

refulgentis•7mo ago
You nailed it IMHO.

I quit my job at Google 2 years ago to do LLM stuff, was looking forward to having HN around, but discussions re: LLMs here are a minefield.

Why?

Everyone knows at least a little, and everyone has a strong opinion on it given the impact of it. People sharing stuff sell it way high, and as with any new thing where people are selling, there's a lot of skeptics. Then, throw in human bias towards disliking what seems like snark / complaining, so stuff with substance gets downvotes.

SNR ratio is continually decreasing.

Let's dig into why this one is weird:

My work inferences using either 3P provider, which do caching, or llama.cpp, in which I do caching. (basically, picture it as there's a super expensive step that you can skip by keeping Map<input string, gpu state>)

So I log into HN and see this and say to myself: 3x! throughput increase? This is either really clever or salesmanship, no way an optimization like that has been sitting around on the groud.

So I read the GitHub, see it's just "write everyones inputs and outputs to disk, you can then use them to cobble together what the GPU state would be for an incoming request!", and write a mostly-polite comment below flagging "hey, this means writing everything to disk"

Then I start replying to you...but then I throw away the comment, because I'm inviting drive-by downvotes. I.e. the minefield describe up top, and if you look like you're being mean, you'll eat downvotes, especially on a weekend.

And to your average reader, maybe I just don't understand vLLM, and am taking it out in good hackers just pushing code.

Then, when I go back, I immediately see a comment from someone who does use vLLM noting it already does caching.

Sigh.

nativeit•7mo ago
Thanks for sharing. You certainly aren't alone in your sentiments. I am seeing similar trends in arXiv submissions, as it seems it has become something of a means to inflate the value of one's own product(s) with a veneer of academic rigor. There seems to be a S.O.P. emerging for AI tools that follows many of the same trends as the less-than-reputable blockchain/crypto projects.
Twirrim•7mo ago
> I am seeing similar trends in arXiv submissions, as it seems it has become something of a means to inflate the value of one's own product(s) with a veneer of academic rigor

Unfortunately this isn't new. Almost as long as people have been publishing papers, people have been using them this way. arXiv, arguably, makes it even worse because the papers haven't even gone through the pretense of a peer review, that does serve to filter out at least some of them.

nativeit•7mo ago
Very true, the strategy just preys on a long-established logical fallacy, for as long as humans have been vulnerable to an appeal to authority, we will continue to fall for attempts at "research washing". I know I am subconsciously influenced by the aesthetic of a research paper, regardless of its source or status.
pama•7mo ago
I had related questions and checked out the project a bit deeper though I havent tested it seriously yet. The project did start work over a year ago based on relevant papers, before vllm or sglang had decent solutions; it might still be adding performance in some workflows though I havent tested it and some of the published measurements in the project are now stale. Caching LLM kv-cache to disk or external memory servers can be very helpful at scale. Cache management and figuring out cache invalidation is hard anyways and I am not sure at what level a tight integration with inference servers or specialized inference popelines can help vs a lose coupling that could advance each component separately. It would be nice if there were decent protocols used by all inference engines to help this decoupling.
hardwaresofton•7mo ago
> Then I start replying to you...but then I throw away the comment, because I'm inviting drive-by downvotes. I.e. the minefield describe up top, and if you look like you're being mean, you'll eat downvotes, especially on a weekend.

Don't self-censor for this reason -- "downvotes aren't real" in that they don't actually matter. Being afraid of getting downvoted is a silly way to live, and I also fall into the trap but try to avoid it.

If you're worries as coming off as mean, probably worth rephrasing!

nativeit•7mo ago
I'll just be unambiguous about this:

> Please don't use HN primarily for promotion. It's ok to post your own stuff part of the time, but the primary use of the site should be for curiosity.

https://news.ycombinator.com/newsguidelines.html

Aurornis•7mo ago
A couple months ago another project claimed to have sped up llama.cpp (IIRC) on the front page of HN, from another green name account.

It gathered hundreds of GitHub stars and was on the front page all day. When some of us finally had time to look at the code we discovered they didn't invent anything new at all. They took some existing command line options for llama.cpp and then changed the wording slightly to make them appear novel.

The strangest part was that everyone who pointed it out was downvoted at first. The first comment to catch it was even flagged away! You couldn't see it unless you had showdead turned on.

At first glance I don't see this repo as being in the same category, though the "3X throughput increase" claim is very clearly dependent on the level of caching for subsequent responses and the "lossless" claim doesn't hold up as analyzed by another top-level comment.

I think AI self-promoters have realized how easy it is to game Hacker News and GitHub stars if you use the right wording. You can make some big claims that are hard to examine in the quick turnaround times of a Hacker News front page cycle.

bGl2YW5j•7mo ago
Same. Maintain skepticism.
cchance•7mo ago
I mean a lot of people don't comment on HN, and just use it as a site for cool links lol, so you wouldn't see them posting often
wg0•7mo ago
Seems like snake oil to me. I mean lacks clear explanation of how exactly it works if at all.
ahmedhawas123•7mo ago
Like this a lot and thanks for making it open source. Does this support ollama today? I only saw vLLM
jbentley1•7mo ago
Is this the same as the prompt caching that other API's (Anthropc, OpenAI, etc) have had, just open source and for vLLM?
alyxya•7mo ago
I skimmed over a couple of the papers referenced to get an idea of what optimizations LMCache is doing.

* KV cache compression - compressing the bytes of the KV cache, taking advantage of patterns in the KV cache and with dynamic levels of compression

* KV cache blending - concatenating the KV caches of multiple reused prompts with minimal KV cache recomputation for use cases like RAG, where it's more performant than the standard lossless KV cache prefix optimization, and gives better results than naively concatenating the KV caches for the reused prompts

These optimizations are pretty cool and different than the standard KV cache optimizations. The title saying lossless seems misleading though.

tucnak•7mo ago
"Blending," or translating arbitrary substrings to prefixes, is a real curious one, & likely become a prerequisite for running dataset-scale LLM inferences at scale.

See https://arxiv.org/abs/2405.16444v3

> To speed up the prefill of the long LLM inputs, one can pre-compute the KV cache of a text and re-use the KV cache when the context is reused as the prefix of another LLM input. However, the reused text chunks are not always the input prefix, which makes precomputed KV caches not directly usable since they ignore the text’s cross-attention with the preceding texts. Thus, the benefits of reusing KV caches remain largely unrealized.

> This paper tackles just one challenge: when an LLM input contains multiple text chunks, how to quickly combine their precomputed KV caches in order to achieve the same generation quality as the expensive full prefill (i.e., without reusing KV cache)? [..] We present a scheme that reuses the pre-computed KV caches, regardless prefix or not, and selectively recomputes the KV values of a small subset of tokens to partially update each reused KV cache.

I had recently touched on benefits of compute-in-network for KV cache management https://news.ycombinator.com/item?id=44371227 largely making arguments contra Bluefield. The CacheBlend authors note that the delay from recomputing some tokens can be hidden by pipelining it with KV loads. Note that the various systolic array/NoC architectures are well-suited for accelerating string matching tasks. A compute-in-network FPGA could therefore manage the entire process: identify viable chunks by indexing and matching of the hot substrings, prefetch the corresponding KV caches from network storage, and stitch up a new prefix before passing it to the primary inference hardware. It may as well be one of those weird cases where hard-coding the algorithm is possible in theory, but intractable in practice—because the optimal paths would be highly-dependent on topology.

Nobody wants one-trick hardware.

In view of Xilinx acquisition, AMD's death in the AI space appears to be greatly exaggerated!

3abiton•7mo ago
I am curious about also the varying quantization of kv cache. It seems quantizing values yield better results than doing so to keys
PoignardAzur•7mo ago
KV cache blending sounds like it would be super useful for Copilot-style code completion models.

You could cache the contents of each file, the edits made so far, the project README, recent commits, etc, separately, and blend them dynamically depending on what the user is doing.

tom910•7mo ago
Where can I find more detailed explanations about how it works? A simple key/value solution based on the hash of the prompt will not work because almost every request will have a unique hash. How can I solve this problem and maintain quality?
hasanar1f•7mo ago
Is LMCache entirely lossless? Cuz, the kv cache streamig in the Cachegen paper was not lossless. Or is there any way to control the loss in LMCache?