Lossless LLM 3x Throughput Increase by LMCache

2025-06-2416:1814750github.com

Redis for LLMs. Contribute to LMCache/LMCache development by creating an account on GitHub.

You can’t perform that action at this time.


Read the original article

Comments

  • By lihanc111 2025-06-2416:184 reply

    Our team has built this open source project, LMCache, to reduce repetitive computation in LLM inference and make systems serve more people (3x more throughput in chat applications) and it has been used in IBM's open source LLM inference stack.

    In LLM serving, the input is computed into intermediate states called KV cache to further provide answers. These data are relatively large (~1-2GB for long context) and are often evicted when GPU memory is not enough. In these cases, when users ask a follow up question, the software needs to recompute for the same KV Cache. LMCache is designed to combat that by efficiently offloading and loading these KV cache to and from DRAM and disk.

    Ask us anything!

    • By pama 2025-06-2815:23

      Is your aim targetting the inference at scale or specialized/new/simpler inference pipelines? Sglang and vllm have disaggregated prefix and decoding serving (eg https://docs.vllm.ai/examples/online_serving/disaggregated_s... or https://github.com/sgl-project/sglang/issues/3554 and https://github.com/sgl-project/sglang/issues/4655) — could your solution enable a model-agnostic cache store/server or is that orthogonal to what you are trying to achieve?

    • By behnamoh 2025-06-2816:141 reply

      > Our team

      So this is something that might in the future turning to a commercial product? something like Langchain and thousands of open source projects that started as "open source" but then ended up implementing proprietary features for a cost.

    • By nativeit 2025-06-2816:031 reply

      Has it been used in IBM's inference stack, or used with IBM's inference stack? In other words, has this been merged into IBM's own repositories, or has someone just tested it using them?

      • By lihanc111 2025-06-294:04

        It is in IBM's llm-d open source stack

    • By dist-epoch 2025-06-2813:111 reply

      How is it possible to do non-prefix KV cache? I was under the impression that the V for one token potentially depends on the V of all previous ones.

      • By da-x 2025-06-2813:16

        Yes, there's KV cache 'Blending' see [1].

        Future versions of LMCache are aiming to support this.

        [1] CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion- https://arxiv.org/abs/2405.16444

  • By alyxya 2025-06-2818:303 reply

    I skimmed over a couple of the papers referenced to get an idea of what optimizations LMCache is doing.

    * KV cache compression - compressing the bytes of the KV cache, taking advantage of patterns in the KV cache and with dynamic levels of compression

    * KV cache blending - concatenating the KV caches of multiple reused prompts with minimal KV cache recomputation for use cases like RAG, where it's more performant than the standard lossless KV cache prefix optimization, and gives better results than naively concatenating the KV caches for the reused prompts

    These optimizations are pretty cool and different than the standard KV cache optimizations. The title saying lossless seems misleading though.

    • By PoignardAzur 2025-06-2914:27

      KV cache blending sounds like it would be super useful for Copilot-style code completion models.

      You could cache the contents of each file, the edits made so far, the project README, recent commits, etc, separately, and blend them dynamically depending on what the user is doing.

    • By 3abiton 2025-06-298:08

      I am curious about also the varying quantization of kv cache. It seems quantizing values yield better results than doing so to keys

    • By tucnak 2025-06-2819:53

      "Blending," or translating arbitrary substrings to prefixes, is a real curious one, & likely become a prerequisite for running dataset-scale LLM inferences at scale.

      See https://arxiv.org/abs/2405.16444v3

      > To speed up the prefill of the long LLM inputs, one can pre-compute the KV cache of a text and re-use the KV cache when the context is reused as the prefix of another LLM input. However, the reused text chunks are not always the input prefix, which makes precomputed KV caches not directly usable since they ignore the text’s cross-attention with the preceding texts. Thus, the benefits of reusing KV caches remain largely unrealized.

      > This paper tackles just one challenge: when an LLM input contains multiple text chunks, how to quickly combine their precomputed KV caches in order to achieve the same generation quality as the expensive full prefill (i.e., without reusing KV cache)? [..] We present a scheme that reuses the pre-computed KV caches, regardless prefix or not, and selectively recomputes the KV values of a small subset of tokens to partially update each reused KV cache.

      I had recently touched on benefits of compute-in-network for KV cache management https://news.ycombinator.com/item?id=44371227 largely making arguments contra Bluefield. The CacheBlend authors note that the delay from recomputing some tokens can be hidden by pipelining it with KV loads. Note that the various systolic array/NoC architectures are well-suited for accelerating string matching tasks. A compute-in-network FPGA could therefore manage the entire process: identify viable chunks by indexing and matching of the hot substrings, prefetch the corresponding KV caches from network storage, and stitch up a new prefix before passing it to the primary inference hardware. It may as well be one of those weird cases where hard-coding the algorithm is possible in theory, but intractable in practice—because the optimal paths would be highly-dependent on topology.

      Nobody wants one-trick hardware.

      In view of Xilinx acquisition, AMD's death in the AI space appears to be greatly exaggerated!

  • By nativeit 2025-06-2815:466 reply

    It seems odd to me that so many of these projects are being launched by people who have only just discovered and/or joined HN. I'm worried this is just becoming LinkedIn for AI opportunists.

    • By parpfish 2025-06-2815:552 reply

      I’ve got a side project that I may (someday) do a show HN with. However, I’d probably make a new account for that because the project is connected to my real name/portfolio and I don’t want that connected with my pseudonymous comments here

      • By nativeit 2025-06-2816:001 reply

        I considered that, but then why would anyone obfuscate this really very reasonable scenario by choosing another ostensibly pseudonymous username?

      • By fsmv 2025-06-2816:151 reply

        [deleted]

        • By parpfish 2025-06-2816:44

          I imagine that this is a common problem and it could be another cool “unlockable” on HN, like the downvotes at 500 karma.

          Once you get X karma or account age >Y years, you can make one anonymous submissions each quarter that comes from an non-user but still get some sort of “verified” badge that proves it comes from a legit user.

    • By Aurornis 2025-06-2821:47

      A couple months ago another project claimed to have sped up llama.cpp (IIRC) on the front page of HN, from another green name account.

      It gathered hundreds of GitHub stars and was on the front page all day. When some of us finally had time to look at the code we discovered they didn't invent anything new at all. They took some existing command line options for llama.cpp and then changed the wording slightly to make them appear novel.

      The strangest part was that everyone who pointed it out was downvoted at first. The first comment to catch it was even flagged away! You couldn't see it unless you had showdead turned on.

      At first glance I don't see this repo as being in the same category, though the "3X throughput increase" claim is very clearly dependent on the level of caching for subsequent responses and the "lossless" claim doesn't hold up as analyzed by another top-level comment.

      I think AI self-promoters have realized how easy it is to game Hacker News and GitHub stars if you use the right wording. You can make some big claims that are hard to examine in the quick turnaround times of a Hacker News front page cycle.

    • By nativeit 2025-06-2816:34

      I'll just be unambiguous about this:

      > Please don't use HN primarily for promotion. It's ok to post your own stuff part of the time, but the primary use of the site should be for curiosity.

      https://news.ycombinator.com/newsguidelines.html

    • By bGl2YW5j 2025-06-291:20

      Same. Maintain skepticism.

    • By cchance 2025-06-291:39

      I mean a lot of people don't comment on HN, and just use it as a site for cool links lol, so you wouldn't see them posting often

    • By refulgentis 2025-06-2816:043 reply

      You nailed it IMHO.

      I quit my job at Google 2 years ago to do LLM stuff, was looking forward to having HN around, but discussions re: LLMs here are a minefield.

      Why?

      Everyone knows at least a little, and everyone has a strong opinion on it given the impact of it. People sharing stuff sell it way high, and as with any new thing where people are selling, there's a lot of skeptics. Then, throw in human bias towards disliking what seems like snark / complaining, so stuff with substance gets downvotes.

      SNR ratio is continually decreasing.

      Let's dig into why this one is weird:

      My work inferences using either 3P provider, which do caching, or llama.cpp, in which I do caching. (basically, picture it as there's a super expensive step that you can skip by keeping Map<input string, gpu state>)

      So I log into HN and see this and say to myself: 3x! throughput increase? This is either really clever or salesmanship, no way an optimization like that has been sitting around on the groud.

      So I read the GitHub, see it's just "write everyones inputs and outputs to disk, you can then use them to cobble together what the GPU state would be for an incoming request!", and write a mostly-polite comment below flagging "hey, this means writing everything to disk"

      Then I start replying to you...but then I throw away the comment, because I'm inviting drive-by downvotes. I.e. the minefield describe up top, and if you look like you're being mean, you'll eat downvotes, especially on a weekend.

      And to your average reader, maybe I just don't understand vLLM, and am taking it out in good hackers just pushing code.

      Then, when I go back, I immediately see a comment from someone who does use vLLM noting it already does caching.

      Sigh.

      • By pama 2025-06-2817:30

        I had related questions and checked out the project a bit deeper though I havent tested it seriously yet. The project did start work over a year ago based on relevant papers, before vllm or sglang had decent solutions; it might still be adding performance in some workflows though I havent tested it and some of the published measurements in the project are now stale. Caching LLM kv-cache to disk or external memory servers can be very helpful at scale. Cache management and figuring out cache invalidation is hard anyways and I am not sure at what level a tight integration with inference servers or specialized inference popelines can help vs a lose coupling that could advance each component separately. It would be nice if there were decent protocols used by all inference engines to help this decoupling.

      • By nativeit 2025-06-2816:531 reply

        Thanks for sharing. You certainly aren't alone in your sentiments. I am seeing similar trends in arXiv submissions, as it seems it has become something of a means to inflate the value of one's own product(s) with a veneer of academic rigor. There seems to be a S.O.P. emerging for AI tools that follows many of the same trends as the less-than-reputable blockchain/crypto projects.

        • By Twirrim 2025-06-2823:00

          > I am seeing similar trends in arXiv submissions, as it seems it has become something of a means to inflate the value of one's own product(s) with a veneer of academic rigor

          Unfortunately this isn't new. Almost as long as people have been publishing papers, people have been using them this way. arXiv, arguably, makes it even worse because the papers haven't even gone through the pretense of a peer review, that does serve to filter out at least some of them.

      • By hardwaresofton 2025-06-296:50

        > Then I start replying to you...but then I throw away the comment, because I'm inviting drive-by downvotes. I.e. the minefield describe up top, and if you look like you're being mean, you'll eat downvotes, especially on a weekend.

        Don't self-censor for this reason -- "downvotes aren't real" in that they don't actually matter. Being afraid of getting downvoted is a silly way to live, and I also fall into the trap but try to avoid it.

        If you're worries as coming off as mean, probably worth rephrasing!

HackerNews