Prompt caching for cheaper LLM tokens

(ngrok.com)

306 points | by samwho 53 days ago

13 comments

est 50 days ago
This is a surprising good read of how LLM works in general.
[-]
- samwho 50 days ago
  It’s funny, I didn’t set out for that to be the case. When I pitched the idea internally, I wanted to scratch my own itch (what on earth is a cached token?) and produce a good post. But then I realised I had to go deeper and deeper to get to my answer and accidentally made a very long explainer.
  [-]
  - yomismoaqui 50 days ago
    Thanks for the post, it's near perfect in focus, detail and how it's written.
    EDIT: You have some minor typos in the post (psuedocode)
Havoc 50 days ago
Does anyone know whether the cache is segregated by user/API key for the big providers?
Was looking at modifying outgoing requests via proxy and wondering whether that's harming caching. Common coding tools presumably have a shared prompt across all their installs so universal cache would save a lot
[-]
- moebrowne 50 days ago
  For ChatGPT:
  > Prompt caches are not shared between organizations. Only members of the same organization can access caches of identical prompts.
  https://platform.openai.com/docs/guides/prompt-caching#frequ...
  [-]
  - maxloh 50 days ago
    I don't find it really viable. There are so many ways to express the same question, and context does matter: the same prompt becomes irrelevant if the previous prompts or LLM responses differ.
    With the cache limited to the same organization, the chances of it actually being reused would be extremely low.
    [-]
    - qeternity 50 days ago
      In a chat setting you hit the cache every time you add a new prompt: all historical question/answer pairs are part of the context and don’t need to be prefilled again.
      On the API side imagine you are doing document processing and have a 50k token instruction prompt that you reuse for every document.
      It’s extremely viable and used all the time.
      [-]
      - jonhohle 50 days ago
        I’m shocked that this hasn’t been a thing from the start. That seems like table stakes for automating repetitive tasks.
        [-]
        qeternity 50 days ago
        It has been a thing. In a single request, this same cache is reused for each forward pass.
        It took a while for companies to start metering it and charging accordingly.
        Also companies invested in hierarchical caches that allow longer term and cross cluster caching.
    - IanCal 50 days ago
      It gets used massively in a conversation, also anything that has a lot of explain actions in the system prompt means you have a large matching prefix.
    - babelfish 50 days ago
      Think of it as a very useful prefix match. If all of your threads start with the same system prompt, you will reap benefits from prompt caching.
- samwho 50 days ago
  I was wondering about this when I was reading around the topic. I can’t personally think of a reason you would need to segregate, though it wouldn’t surprise me if they do for some sort of compliance reasons. I’m not sure though, would love to hear something first-party.
  [-]
  - weird-eye-issue 50 days ago
    They absolutely are segregated
    With OpenAI at least you can specify the cache key and they even have this in the docs:
    Use the prompt_cache_key parameter consistently across requests that share common prefixes. Select a granularity that keeps each unique prefix-prompt_cache_key combination below 15 requests per minute to avoid cache overflow.
    [-]
    - ambicapter 49 days ago
      > Select a granularity that keeps each unique prefix-prompt_cache_key combination below 15 requests per minute to avoid cache overflow.
      Why below a certain number? Usually in caches a high number of requests keeps the cached bit from expiring or being replaced, no?
      [-]
      - weird-eye-issue 46 days ago
        It needs to go to the same machine and machines can only handle so many requests
    - psadri 50 days ago
      Does anyone actually compute / use this key feature? Or do you rely on implicit caching? I wish HN had a comment with a poll feature.
      [-]
      - weird-eye-issue 49 days ago
        It would be important to use for relatively high traffic use cases
        Let's say you have a chatbot with hundreds of active users, their requests could get routed to different machines which would mean the implicit caching wouldn't work
        If you set the cache key to a user id then it would be more likely each user's chat could get cached on subsequent requests
  - samwho 50 days ago
    The only thing that comes to mind is some kind of timing attack. Send loads of requests specific to a company you’re trying to spy on and if it comes back cached you know someone has sent that prompt recently. Expensive attack, though, with a large search space.
    [-]
    - gwern 50 days ago
      No, the search space is tiny: you can just attack 1 BPE at a time! Stuff like password guessing is almost trivial when you get to do a timing attack on each successive character. So that lets you quickly exfiltrate arbitrary numbers of prompts, especially if you have any idea what you are looking for. (Note that a lot of prompts are already public information, or you can already exfiltrate prompts quite easily from services and start attacking from there...)
      [-]
      - reitzensteinm 50 days ago
        Hill climbing a password would only be possible if intermediate KV cache entries were stored. To hillclimb "hunter2", you're going to try "a", "b", "c", etc, until you notice that "h" comes back faster. Then you try "ha", "hb" and so on.
        But that's only going to work if the cache looks like: "h", "hu", "hun", ..., "hunter2"
        If just "hunter2" is in the cache, you won't get any signal until you stumble on exactly that password. And that's before getting into the block size granularity of the caches discussed elsewhere in this thread.
        That's not to say timing attacks aren't possible. I haven't looked at Claude Code's prompt generation, but there's no intrinsic reason why you couldn't do things like figure out what open source code and research papers your competitors are loading into context.
        Sharing caches between orgs would be an incredible misstep.
        [-]
        jgeralnik 50 days ago
        Right, you can’t actually guess a letter (byte) at a time but you can guess a token at a time (I believe the vocabulary is 200000 possible tokens in gpt 5) So you could send each of the 200000 possible tokens, see which is cached, and then send 200000 more tokens to find the next cached token Certainly less efficient but well within the realm of a feasible attack
        [-]
        reitzensteinm 50 days ago
        It's a good call out re: tokens vs letters, but I think you might have misunderstood my point - you can't do it a token at a time unless the intermediate KV cache is stored after each token is generated.
        This won't be the case in any non toy implementation, as it would be unneccessary and slow.
        [-]
        jgeralnik 49 days ago
        Ah, fair enough. Anthropic caches at a block level (basically a single message) so for non-trivial messages this is really less of a concern, although I definitely understand why they still scope cache to a single tenant
      - IanCal 50 days ago
        Do any providers do this level of granularity? Anthropic require explicit cache markers, for example.
        [-]
        jgeralnik 50 days ago
        Anthropic requires explicit cache markers but will “look backwards” some amount, so you don’t need to fall on the exact split to get cached tokens
    - gunalx 50 days ago
      I habe come across turning on caching means the llm has a faint memory of what was in the cache, even to unrelated queries. If this is the case its fully unreasonable to share the cache, because of possibility of information leakage.
      [-]
      - weird-eye-issue 50 days ago
        This is absolutely 100% incorrect.
      - samwho 50 days ago
        How would information leak, though? There’s no difference in the probability distribution the model outputs when caching vs not caching.
        [-]
        sroussey 49 days ago
        the probability distribution the model outputs is identical under identical conditions.
        A local model running alone on your machine will 100% always return the exact same thing and the internal state will be exactly the same and you can checkpoint or cache that to avoid rerunning to that point.
        But… conditions can be different, and batching requests tends to affect other items in flight. I believe Thinking Machines had an article about how to make a request deterministic again without performance going to complete crap.
        I tend to think of things this way (completely not what happens though): what if you were to cache based on a tensor as the key? To generate a reasonably sized key what is an acceptable loss of precision to retrieve the same cache knowing that there is inherent jitter in the numbers of the tensor?
        And then the ever so slight leak of information. But also multiplied since there are internal kv caches for tokens and blah blah blah.
  - dustfinger 50 days ago
    I wonder if there is valuable information that can be learned by studying a companies prompts? There may be reasons why some companies want their prompts private.
    [-]
    - dustfinger 50 days ago
      I realize cache segregation is mainly about security/compliance and tenant isolation, not protecting secret prompts. Still, if someone obtained access to a company’s prompt templates/system prompts, analyzing them could reveal:
      - Product logic / decision rules, such as: when to refund, how to triage tickets
      - Internal taxonomies, schemas, or tool interfaces
      - Safety and policy guardrails (which adversaries could try to route around)
      - Brand voice, strategy, or proprietary workflows
      That is just off the top of my head.
duggan 50 days ago
It was a real facepalm moment when I realised we were busting the cache on every request by including date time near the top of the main prompt.
Even just moving it to the bottom helped move a lot of our usage into cache.
Probably went from something like 30-50% cached tokens to 50-70%.
willvarfar 50 days ago
A really clear explanation!
So if I were running a provider I would be caching popular prefixes for questions across all users. There must be so many questions that start 'what is' or 'who was' etc?
Also, can subsequences in the prompt be cached and reused? Or is it only prefixes? I mean, can you cache popular phrases that might appear in the middle of the prompt and reuse that somehow rather than needing to iterate through them token by token? E.g. must be lots of times that "and then tell me what" appears in the middle of a prompt?
[-]
- GeneralMayhem 50 days ago
  Really only prefixes, without a significant loss in accuracy. The point is that because later tokens can't influence earlier ones, the post-attention embeddings for those first tokens can't change. But the post-attention embeddings for "and then tell me what" would be wildly different for every prompt, because the embeddings for those tokens are affected by what came earlier.
  My favorite not-super-accurate mental model of what's going on with attention is that the model is sort of compressing the whole preceding context into each token. So the word "tell" would include a representation not just of the concept of telling, but also of what it is that's supposed to be told. That's explicitly what you don't want to cache.
  > So if I were running a provider I would be caching popular prefixes for questions across all users
  Unless you're injecting user context before the question. You can have a pre baked cache with the base system prompt, but not beyond that. Imagine that the prompt always starts with "SYSTEM: You are ChatGPT, a helpful assistant. The time is 6:51 ET on December 19, 2025. The user's name is John Smith. USER: Hi, I was wondering..." You can't cache the "Hi, I was wondering" part because it comes after a high-entropy component (timestamp and user name).
- samwho 50 days ago
  With KV caching as it’s described there it has to be a prefix match. OpenAI state in their docs they don’t cache anything below 1024 tokens long, and I’m sure I read somewhere that they only cache in 1024 token blocks (so 1024, 2048, 3072, etc) but I can’t find it now.
  There’s been some research into how to cache chunks in the middle, but I don’t think any of the providers are doing it yet because it needs the prompt to be structured in a very specific way.
  [-]
  - moebrowne 50 days ago
    https://platform.openai.com/docs/guides/prompt-caching#requi...
    > Caching is available for prompts containing 1024 tokens or more.
    No mention of caching being in blocks of 1024 tokens thereafter.
    [-]
    - IanCal 50 days ago
      At launch it was described as being in blocks of 128
      https://openai.com/index/api-prompt-caching/
WillAdams 50 days ago
When will Microsoft do this sort of thing?
It's a pain having to tell Copilot "Open in pages mode" each time it's launched, and then after processing a batch of files run into:
https://old.reddit.com/r/Copilot/comments/1po2cuf/daily_limi...
holbrad 50 days ago
I gave the table of inputs and outputs to both Gemini 3.0 flash and GPT 5.2 instant and they were stumped.
https://t3.chat/share/j2tnfwwful https://t3.chat/share/k1xhgisrw1
[-]
- samwho 50 days ago
  When I was writing this, GPT 5.1 was the latest and it got it right away. It’s the sequence of prime numbers fwiw :)
- andruby 50 days ago
  What is the function supposed to be? It’s not Celsius to Farenheit. (2C=35F, 206C=406F, …)
dangoodmanUT 50 days ago
But why is this posted on ngrok?
[-]
- toobulkeh 50 days ago
  They have an AI router they just released.
  ngrok.ai
who-shot-jr 50 days ago
What a fantastic article! How did you create the animations?
[-]
- samwho 50 days ago
  Thank you! <3
  These are all built with React and CSS animations (or the Web Animations API where I needed it). I’m not very good at React so the code is a real mess. 2 of the components also use threejs for the 3D bits.
  For the stuff on my personal site, which simonw graciously linked to in another reply, you can see all the code behind my work at https://github.com/samwho/visualisations
- simonw 50 days ago
  Sam has a long history of building beautiful visual explanations like this - I didn't realize he works for ngrok now, here's his previous independent collection: https://samwho.dev/
  [-]
  - samwho 50 days ago
    Simon, you’re too kind. Thank you. <3
aitchnyu 50 days ago
Took me a minute to see it is same Ngrok which provided freemium tunnels to localhost. How did they adapt to the AI revolution?
[-]
- samwho 50 days ago
  It is the same ngrok!
  The product has grown a lot since the mid 2010s. Still got free localhost tunnelling, but we also have a whole bunch of production-grade API gateway tooling and, as of recently, AI gateway stuff too.
tomhow 50 days ago
[under-the-rug stub]
[see https://news.ycombinator.com/item?id=45988611 for explanation]
[-]
- wesammikhail 50 days ago
  Amazing article. I was under the misapprehension that temp and other output parameters actually do affect caching. Turns out I was wrong and this explains why beautifully.
  Great work. Learned a lot!
  [-]
  - samwho 50 days ago
    Yay, glad I could help! The sampling process is so interesting on its own that I really want to do a piece on it as well.
    [-]
    - wesammikhail 50 days ago
      Looking forward to it!
  - stingraycharles 50 days ago
    I had a “somebody is wrong on the internet!!” discussion about exactly this a few weeks ago, and they proclaimed to be a professor in AI.
    Where do people get the idea from that temperature affects caching in any way? Temperature is about next token prediction / output, not input.
    [-]
    - wesammikhail 50 days ago
      Because in my mind, as a person not working directly on this kind of stuff, I figured that caching was done similar to any resource caching in a webserver environment.
      It´s a semantics issue where the word caching is overloaded depending on context. For people that are not familiar with the inner workings of llm models, this can cause understandable confusion.
    - semi-extrinsic 50 days ago
      Being wrong about details like this is exactly what I would expect from a professor. They are mainly grant writers and PhD herders, often they are good at presenting as well, but they mostly only have gut feelings about technical details of stuff invented after they became a professor.
- walterbell 50 days ago
  Excellent HN-esque innovation in moderation: immediate improvement in S/N ratio, unobtrusive UX, gentle feedback to humans, semantic signal to machines.
  How was the term "rug" chosen, e.g. in the historical context of newspaper folds?
- coderintherye 50 days ago
  Really well done article.
  I'd note, when I gave the input/output screenshot to ChatGPT 5.2 it failed on it (with lots of colorful chain of thought), though Gemini got it right away.
  [-]
  - samwho 50 days ago
    Huh, when I was writing the article it was GPT-5.1 and I remember it got it no problem.
- simedw 53 days ago
  Thanks for sharing; you clearly spent a lot of time making this easy to digest. I especially like the tokens-to-embedding visualisation.
  I recently had some trouble converting a HF transformer I trained with PyTorch to Core ML. I just couldn’t get the KV cache to work, which made it unusably slow after 50 tokens…
  [-]
  - samwho 52 days ago
    Thank you so much <3
    Yes, I recently wrote https://github.com/samwho/llmwalk and had a similar experience with cache vs no cache. It’s so impactful.
    [-]
    - mrgaro 50 days ago
      Hopefully you can write the teased next article about how Feedforward and Output layers work. The article was super helpful for me to get better understanding on how LLM GPTs work!
      [-]
      - samwho 50 days ago
        Yeah! It’s planned for sure. It won’t be the direct next one, though. I’m taking a detour into another aspect of LLMs first.
        I’m really glad you liked it, and seriously the resources I link at the end are fantastic.
- ThePyCoder 50 days ago
  What an excellent write-up. Thank you!
  [-]
  - samwho 50 days ago
    Thank you so much <3
NooneAtAll3 50 days ago
Blog starts loading and then gives "Something Went Wrong. D is not a function" error displayed
[-]
- samwho 50 days ago
  Could you tell me what browser/OS/device you’re using? A few people have said this and I haven’t been able to reproduce it.
  [-]
  - NooneAtAll3 48 days ago
    librewolf, fork of firefox, latest version
    f12 menu lists this:
    Loading failed for the <script> with source “https://global.ketchcdn.com/web/v2/config/ngrok/ngrok_ketch_...”. prompt-caching:1:356 Response { status: 404, type: "default", url: "", redirected: false, ok: false, statusText: "Not Found", headers: Headers(1), body: ReadableStream, bodyUsed: false }
    React Router caught the following error during render entry.client-BTJ7ChVH.js:8:64676 Response { status: 404, type: "default", url: "", redirected: false, ok: false, statusText: "Not Found", headers: Headers(1), body: ReadableStream, bodyUsed: false }
    Uncaught Error: Minified React error #520; visit https://react.dev/errors/520 for the full message or use the non-minified dev environment for full errors and additional helpful warnings. chunk-G3INQAYP-D7BZozYw.js:4:2490 Rm https://frontend-blog-ngrok.vercel.app/assets/entry.client-B... mu https://frontend-blog-ngrok.vercel.app/assets/entry.client-B... Lm https://frontend-blog-ngrok.vercel.app/assets/entry.client-B... t1 https://frontend-blog-ngrok.vercel.app/assets/entry.client-B... A1 https://frontend-blog-ngrok.vercel.app/assets/entry.client-B... Ba https://frontend-blog-ngrok.vercel.app/assets/entry.client-B... Caused by: Response { … }
- belter 50 days ago
  You should upgrade IE6. It has been out of support for a while...
Youden 52 days ago
Link seems to be broken: content briefly loads then is replaced with "Something Went Wrong" then "D is not a function". Stays broken with adblock disabled.
[-]
- samwho 50 days ago
  Another person had this problem as well and we couldn’t figure out what causes it. We suspect something to do with WebGL support. What browser/device are you using? Does it still break if you disable all extensions? I’d love to fix this.
  [-]
  - bkor 50 days ago
    It gives "D is not a function". This on Firefox 146. Various extensions including Ublock Origin but that doesn't seem to cause it. Also doesn't work in a private window.