Prompt Caching: Storing Computed Prompt Representations to Reduce Cost and Latency

Prompt caching is a technique used by large language model (LLM) providers to temporarily store the mathematical representation of a user's input so that it doesn't have to be recalculated if the same input is sent again. This drastically reduces the computational work required, which translates to faster response times and significantly lower costs for the user.

Prompt caching is a technique used by large language model (LLM) providers to temporarily store the mathematical representation of a user's input so that it doesn't have to be recalculated if the same input is sent again. When you send a long document or a complex set of instructions to an AI, the system converts that text into a massive grid of numbers. If you ask a follow-up question about that same document, prompt caching allows the AI to reuse the grid it already built, rather than starting from scratch. This drastically reduces the computational work required, which translates to faster response times and significantly lower costs for the user.

It is the computational equivalent of reading a textbook once and keeping the information fresh in your mind, rather than re-reading the entire book from cover to cover every single time someone asks you a question about chapter one. By holding onto that mathematical representation for a short period—usually 5 to 10 minutes—subsequent requests that share the same starting text can skip the heavy lifting.

‍

The Economics of Reusing Math

The impact of this technique on the economics of AI development is staggering. When an AI reads a prompt, it performs a computationally heavy process to build its internal memory. Because this initial reading phase is so expensive, skipping it saves the API providers a massive amount of GPU time. And in a rare win for the consumer, the major providers are passing those savings directly down to developers.

To appreciate the scale of these savings, it helps to understand how LLM pricing works. Providers typically charge by the "token," which is roughly equivalent to a word or a piece of a word. There are two types of tokens: input tokens (the prompt you send) and output tokens (the response the AI generates). Historically, input tokens were cheaper than output tokens, but if you were sending a massive prompt—say, a 100,000-token codebase—the cost of those input tokens would quickly dominate your bill. Every single time you asked a question about that codebase, you paid for all 100,000 tokens again.

As of early 2026, the cost reductions are dramatic. When a request hits the cache, the cost of those input tokens drops by up to 90% compared to processing them fresh. For example, if you are using a high-end model where input tokens normally cost $15 per million, cached tokens might cost just $1.50 per million (Anthropic, 2026).

This completely changes the math for building AI applications. Previously, developers had to be incredibly stingy with their system prompts. If you wanted to build a coding assistant, you might hesitate to include the entire codebase in the prompt because the per-query cost would be astronomical. With prompt caching, you can load a massive amount of context—entire books, massive code repositories, or exhaustive rule sets—into the prompt once, and then query against it repeatedly for pennies on the dollar.

The latency improvements are equally important. Because the model doesn't have to spend time crunching the prefix, the "time to first token" (the delay before the AI starts typing its response) drops significantly. For very long prompts, this can mean the difference between waiting 20 seconds for a response and waiting 2 seconds (Rose, 2025).

‍

Inside the KV Cache

To understand why this technique saves so much money, you have to look at how these models actually read text. When you hit "send" on a prompt, the AI doesn't just glance at the words and start typing. It performs a computationally heavy process called the "prefill" phase. During this phase, the model calculates the relationships between every single word in your prompt and every other word, creating a massive internal map of context known as the Key-Value (KV) cache.

This KV cache is the model's short-term memory, and understanding it requires a quick look under the hood of the transformer architecture. When an LLM reads text, it uses an "attention mechanism" to figure out which words are most important to each other. For every single token (a word or piece of a word) in your prompt, the model calculates a "Key" vector and a "Value" vector. The Key is essentially a label describing what that token is, and the Value is the actual content or meaning of that token. As the model generates a response, it constantly looks back at these Keys and Values to maintain context. It's what allows the AI to remember the first sentence of your prompt by the time it reaches the last sentence.

Calculating this cache is incredibly expensive. It requires a massive amount of GPU processing power to perform the matrix multiplications necessary to generate those Keys and Values for thousands of tokens. If you upload a 100-page PDF and ask a question, the model grinds through the entire document, performing billions of calculations to build the KV cache. If you then ask a second question about that same PDF, a system without prompt caching would throw away the first cache and recalculate the entire 100-page document all over again—repeating the exact same math to generate the exact same Keys and Values. Prompt caching simply stops this wasteful cycle by keeping the KV cache in memory.

‍

Caching Multi-Turn Conversations

While caching a massive static document is straightforward, things get tricky when you are dealing with a multi-turn conversation, like a chatbot. In a chat interface, the prompt grows with every single message. First, you send the system instructions and the user's first message. Then, the AI replies. Then, the user sends a second message. By the third turn, the prompt includes the system instructions, the first message, the first reply, and the second message.

If you aren't careful, this constantly growing prompt will break the cache every single time. Because the total length of the prompt is changing, the prefix is technically changing if you look at the prompt as a whole.

To solve this, developers have to use a technique called "sliding window caching" or rely on the automatic caching features provided by the API. When a user sends their third message, the system should recognize that the first 90% of the prompt (the system instructions and the first two messages) is identical to the prompt sent a few seconds ago. It can then load that 90% from the cache and only compute the prefill for the newest message. This is why you often see a slight delay on the very first message of a chat, but subsequent messages feel much snappier.

‍

Provider Implementations Compared

While the underlying concept of reusing the KV cache is the same across the industry, the major AI providers have implemented prompt caching in slightly different ways. The table below summarizes the key differences before we dig into the details.

Comparing Prompt Caching Implementations
Provider	Implementation Style	Cost Structure	Default Cache Lifetime
OpenAI	Automatic (Invisible)	Free to write, discounted to read	5-10 minutes (up to 24h extended)
Anthropic	Automatic or Explicit Breakpoints	Premium to write, 90% discount to read	5 minutes (up to 1h extended)
Google Gemini	Explicit API Object Creation	Billed by token count and storage duration	1 hour (customizable TTL)

‍

The simplest experience is probably OpenAI's. If you send a prompt longer than 1,024 tokens to a supported model like gpt-4o, the system automatically checks if the initial portion of that prompt matches anything in its recent memory. There are no code changes required and no extra fees to write to the cache. The cache lives in volatile GPU memory for 5 to 10 minutes of inactivity, though OpenAI also offers an "extended" retention policy that can keep the cache alive for up to 24 hours by offloading it to local storage (OpenAI, 2026).

Anthropic offers a bit more control. In addition to automatic caching, they allow developers to place explicit "cache breakpoints" within their prompts, telling the Claude API exactly which blocks of text should be cached. Anthropic charges a slight premium (usually 1.25x the base rate) to write data to the cache, but then offers a 90% discount when that cached data is read. Their default cache lifetime is 5 minutes, extendable to a full hour for an additional fee (Anthropic, 2026).

Google's approach for its Gemini models, called Context Caching, is the most explicit of the three. Developers must actively create a cache object via the API, upload their content, and set a specific Time-To-Live (TTL). Google then bills based on the number of tokens cached and the duration they are stored. This is ideal for scenarios where a massive, static dataset needs to be queried over a long period, since the developer has total control over the cache's lifespan (Google, 2026).

‍

Structuring Prompts for the Cache

To actually get these discounts, developers have to structure their prompts carefully. Prompt caching only works on exact prefix matches. The system reads the prompt from the very beginning, token by token. The moment it encounters a token that differs from the cached version, the cache hit ends, and the model has to compute the rest of the prompt from scratch.

This strict requirement for exact prefix matching is the most common stumbling block for developers new to the concept. You cannot simply throw a bunch of text into a prompt in any random order and expect the cache to catch it. The order matters immensely. If you have a massive block of text that you want to reuse, it must appear at the very beginning of the prompt, and it must be identical down to the last space and punctuation mark.

This means the golden rule of prompt caching is: Static content goes first, dynamic content goes last.

If you are building a customer service chatbot, you should put your massive, 5,000-word system instruction manual at the very top of the prompt. Below that, you put the history of the conversation. And at the very bottom, you put the user's latest message. Because the 5,000-word manual never changes, it will hit the cache every single time. If you were to put the user's name or a dynamic timestamp at the very top of the prompt, the prefix would change with every request, and you would get zero cache hits.

This also applies to tools and images. If you are providing the model with a list of functions it can call, that list needs to be identical and in the exact same order across requests to benefit from caching.

‍

Security Risks of Shared Memory

Whenever a system shares memory across different requests, security researchers start looking for vulnerabilities. Prompt caching is no exception. Because the KV cache is stored on the provider's servers and reused, there is a theoretical risk of "timing side-channel attacks."

Here is how it works: If an attacker knows that a specific, highly sensitive document (like an unreleased earnings report) might be in the cache, they could send a prompt containing the first few paragraphs of that document. If the API responds incredibly fast, the attacker knows their prompt hit the cache. This confirms that someone else recently queried that exact document, leaking the fact that the document exists and is actively being discussed (Gu et al., 2024).

To mitigate this, providers strictly isolate caches. OpenAI, for example, explicitly states that prompt caches are not shared between organizations. Only members of the same organization can access caches of identical prompts, preventing an external attacker from probing a company's cache status (OpenAI, 2026).

‍

Where the Cache Falls Short

As powerful as prompt caching is, it is not a silver bullet for every AI application. The most obvious limitation is the Time-To-Live (TTL). Because GPU memory is incredibly expensive and highly constrained, providers cannot keep your KV cache around forever. If a user uploads a document, asks a question, and then goes to lunch, the cache will almost certainly be evicted by the time they get back. When they ask their next question, they will have to pay the full price for the prefill phase again.

This makes prompt caching highly effective for rapid-fire, interactive sessions, but less useful for asynchronous or long-running tasks where requests are spaced hours apart.

Furthermore, the cache is highly sensitive to even the smallest changes. If you have a system prompt that includes a dynamic variable—like inserting the current time or a randomly generated session ID at the very top of the prompt—you will destroy your cache hit rate. The entire prefix must be byte-for-byte identical. This forces developers to rethink how they inject dynamic data, pushing all variables to the very end of the prompt where they won't disrupt the cached prefix.

‍

Rethinking Application Architecture

Prompt caching is more than just a discount code; it is a fundamental shift in how AI applications are architected. Before this technique became widespread, the prevailing wisdom was to use Retrieval-Augmented Generation (RAG) for everything. If you had a large dataset, you would chop it up, store it in a vector database, and only retrieve the few paragraphs most relevant to the user's query to save on token costs.

RAG is incredibly powerful, but it is also complex. It requires maintaining a separate database, managing embedding models, and tuning search algorithms to ensure the right snippets of text are retrieved. And even when it works perfectly, RAG still has a fundamental limitation: the LLM only gets to see the fragments of text you retrieve, not the entire document. It might miss subtle connections or overarching themes that are only apparent when reading the whole text.

With prompt caching, the math changes. For datasets under a few million tokens, it is often cheaper, faster, and more accurate to simply dump the entire dataset into the prompt cache and let the model read the whole thing. The model gets the full context rather than fragmented snippets, and the developer doesn't have to build and maintain a complex vector search pipeline.

It is a rare case in technology where the brute-force approach—just giving the AI everything—has suddenly become the most elegant and cost-effective solution.