A KV cache is a temporary storage system used by large language models to hold the mathematical representations of words they have already read or generated, allowing them to produce new text without having to reread the entire conversation from scratch every single time. When an AI generates a response, it does so one word at a time. Without a KV cache, the model would have to recalculate the relationships between every single word in the entire prompt and its own previous output just to figure out what the next word should be. By saving these intermediate calculations—specifically the "Key" and "Value" vectors—the model only has to perform the heavy math on the newest word, drastically speeding up the generation process and reducing the computational power required.
This mechanism is the unsung hero of modern AI performance. It is the reason a chatbot can spit out a 500-word essay in seconds rather than minutes. However, this speed comes at a steep cost in memory. As conversations get longer, the KV cache grows larger, eventually becoming the primary bottleneck that dictates how many users a single server can handle at once.
The Math Behind the Memory
To understand why the KV cache is so critical, we have to look at the engine that powers almost all modern language models: the transformer architecture. Introduced in the landmark 2017 paper "Attention Is All You Need" by researchers at Google, the transformer relies on a mechanism called "self-attention" (Vaswani et al., 2017).
Self-attention is how a model figures out which words in a sentence are most important to each other. It is a mathematical way of determining context. For example, in the sentence "The bank of the river," the word "bank" means something very different than it does in the sentence "The bank on the corner." The model uses self-attention to look at the surrounding words ("river" or "corner") to figure out the correct meaning.
To do this, when the model reads a word, it creates three distinct mathematical vectors for it: a Query (Q), a Key (K), and a Value (V). These vectors are essentially lists of numbers that capture different aspects of the word's meaning and its role in the sentence.
You can think of this like a filing system in a massive library. The Query is what the model is currently looking for—it represents the current word's search for context. The Key is the label on the outside of a file folder, describing what is inside—it represents what a word has to offer to other words. The Value is the actual document inside the folder—it represents the core meaning of the word that will be used if a match is found.
When the model wants to understand the context of a specific word, it takes that word's Query and compares it against the Keys of all the other words in the sentence. This comparison is done using a mathematical operation called a dot product, which calculates how similar the Query and the Key are. If a Query and a Key match well (resulting in a high score), the model pulls out the corresponding Value and uses it to shape its understanding of the original word.
During the initial reading phase—often called the prefill phase—the model calculates the Q, K, and V vectors for every single word in your prompt all at once. Because the entire prompt is available upfront, the model can process all the words simultaneously. This is a massive, parallel mathematical operation that modern Graphics Processing Units (GPUs) are incredibly good at executing quickly. The model reads your entire prompt, calculates all the relationships, and builds its initial understanding of the context in one massive burst of computation.
The Autoregressive Trap
The problem arises when the model actually starts talking. Language models are "autoregressive," meaning they generate text one token (a word or piece of a word) at a time, and each new token is based on all the tokens that came before it. Unlike the prefill phase, where everything is processed in parallel, the generation phase—often called the decode phase—is strictly sequential. The model cannot generate the third word until it has generated the second word.
Imagine the model has just generated the word "The" and needs to figure out the next word. It creates a Query for "The" and compares it against the Keys of all the words in your prompt. It then generates the next word, let's say "cat." Now it needs to generate the third word. It creates a Query for "cat" and compares it against the Keys of the prompt and the Key for the newly generated word "The."
If the model didn't have a memory system, it would have to recalculate the Keys and Values for the entire prompt, plus the word "The," just to generate the third word. To generate the fourth word, it would have to recalculate the Keys and Values for the prompt, "The," and "cat." Every time it adds a new word, the amount of historical context it has to process grows.
This means the amount of math required grows quadratically with every single word generated. If you ask a model to summarize a 10,000-word document, and it generates a 500-word response, a system without a memory would have to recalculate the Keys and Values for those 10,000 words 500 separate times. The model would spend 99% of its time recalculating things it already knew, grinding the generation process to an absolute halt.
The KV cache solves this by simply saving the Keys and Values after they are calculated the first time (Raschka, 2025). During the prefill phase, all the Keys and Values for the prompt are calculated and stored in the cache. When the model generates "The," it calculates the Q, K, and V for "The," uses the Q to search the cached Keys, and then saves the new K and V for "The" into the cache.
When it moves on to generate "cat," it only calculates the Q, K, and V for "cat." It doesn't recalculate anything for the prompt or for "The." It simply pulls those historical Keys and Values directly from the cache. This turns a quadratic computational nightmare into a linear, manageable process. The model only ever does the heavy math for the newest word, relying on its cached memory for everything else.
The Memory Wall
While the KV cache brilliantly solves the computational problem, it creates a massive physical memory problem. These Key and Value vectors are not small pieces of data. They are large arrays of high-precision floating-point numbers. For every single token in a conversation, the model has to store hundreds or thousands of these numbers. And they take up physical space on the GPU's highly specialized, incredibly expensive memory chips (High Bandwidth Memory, or HBM).
As the context window—the amount of text the model can remember at one time—grows, the size of the KV cache explodes. A few years ago, models could only remember about 2,000 tokens. Today, models routinely handle 128,000 tokens or more, allowing users to upload entire books or massive code repositories.
This capability comes at a staggering infrastructure cost. For a large model like Llama 3 70B, storing the KV cache for a single user with a 128,000-token context window requires about 40 gigabytes of GPU memory (NVIDIA, 2025). To put that in perspective, a top-tier enterprise GPU like the NVIDIA H100 typically has 80 gigabytes of memory. That means a single user asking a question about a long document consumes half the memory of a $30,000 piece of hardware just for their temporary cache.
If a company wants to serve thousands of users simultaneously, the GPU memory will run out long before the actual processing cores hit their limit. The system becomes "memory-bound." The powerful compute cores of the GPU end up sitting idle, waiting for data to be moved around, because there simply isn't enough room to store everyone's KV cache at the same time. This memory bottleneck is the primary reason why running large language models at scale is so incredibly expensive.
Architectural Fixes
Because the KV cache is such a severe bottleneck, AI researchers have spent the last few years redesigning the attention mechanism itself to make the cache smaller from the ground up.
The original transformer design used a structure called "Multi-Head Attention" (MHA). In MHA, the model splits its attention into multiple independent "heads." You can think of this like having a team of researchers reading the same document, where one researcher is looking for grammatical structure, another is looking for emotional tone, and a third is tracking the timeline of events. This multi-headed approach is what gives transformers their incredible ability to understand complex nuances. However, in MHA, each of these heads gets its own unique set of Keys and Values. If a model has 32 attention heads, it has to store 32 separate sets of Keys and Values for every single word. This multiplies the size of the KV cache enormously.
The first major attempt to fix this memory explosion was Multi-Query Attention (MQA). In MQA, the model still has multiple Query heads—so the "researchers" are still asking their different questions—but they all share a single, universal set of Keys and Values. Instead of 32 sets of files, there is only one master set of files that all the researchers have to use. This drastically shrinks the size of the KV cache, often by a factor of 10 or more, allowing the server to handle far more users at once. However, the tradeoff is quality. Forcing all the Query heads to share one set of Keys and Values can degrade the model's reasoning ability, as the shared representations might not capture all the specific nuances each head is looking for.
The current industry standard is a clever compromise called Grouped-Query Attention (GQA), introduced by researchers at Google in 2023 (Ainslie et al., 2023). GQA strikes a balance between the massive memory footprint of MHA and the potential quality loss of MQA. It divides the Query heads into a few distinct groups—say, 8 groups of 4 heads each. Each group shares a set of Keys and Values.
This approach cuts the KV cache size down significantly compared to standard MHA, freeing up valuable GPU memory, but it maintains much higher output quality than MQA because the model still has multiple distinct sets of representations to draw from. GQA has proven so effective at balancing speed, memory efficiency, and accuracy that it is now the default architecture for most modern open-weight models, including Meta's Llama 3 and IBM's Granite series (IBM, 2024).
Engineering the Cache
Beyond changing the fundamental architecture of the models, software engineers have developed sophisticated infrastructure techniques to manage the KV cache more efficiently in production environments.
One of the most important breakthroughs in recent years is a technique called PagedAttention. In older serving systems, the server would reserve a massive, contiguous block of memory for a user's KV cache the moment they started a conversation. The system had to guess the maximum possible length of the chat and reserve memory for that worst-case scenario. If the system reserved space for 8,000 tokens, but the user only asked a short question that took 100 tokens, the remaining 7,900 slots of memory sat completely empty and wasted. This problem, known as internal fragmentation, meant that up to 80% of a GPU's memory could be wasted on empty reservations.
PagedAttention, developed by researchers at UC Berkeley as part of the vLLM project, solves this by borrowing a concept from traditional computer operating systems. Instead of reserving a massive block upfront, it breaks the KV cache into small, fixed-size blocks (called pages). The system allocates these pages dynamically, one by one, only when the model actually generates new tokens and needs the space. This nearly eliminates wasted memory, dropping fragmentation from 80% down to under 4%, and allows servers to handle significantly more concurrent users on the same hardware (Kwon et al., 2023).
Another powerful approach is KV cache quantization. Just as developers compress the actual weights of the neural network to save space, they can compress the numbers stored in the KV cache itself. By default, these numbers are usually stored as 16-bit floating-point values. By mathematically compressing them down to 8-bit or even 4-bit integers, the physical size of the cache can be cut in half or more. While this does introduce a small amount of mathematical noise, modern quantization techniques are sophisticated enough that the hit to the model's overall accuracy is often negligible, making it a highly attractive tradeoff for the massive memory savings.
Finally, when the GPU memory is completely exhausted, systems can employ KV cache offloading. This involves moving older or less frequently accessed parts of the cache out of the expensive, highly constrained GPU memory and into the cheaper, much more abundant CPU memory (standard system RAM). When the model needs to look back at that older context to answer a question, the data is streamed back into the GPU over high-speed connections like NVIDIA's NVLink. While this data transfer is slower than keeping everything on the GPU natively, it prevents the system from crashing with an Out-Of-Memory error and allows for massive context windows that would otherwise be physically impossible to support.
The Future of AI Memory
The management of the KV cache is arguably the most critical engineering challenge in deploying AI today. As models get smarter and users demand the ability to upload entire books, massive codebases, and hours of video into their prompts, the battle to keep the memory footprint manageable will only intensify.
We are already seeing the next wave of innovations designed to tackle this problem. Researchers are experimenting with "sparse attention" mechanisms, where the model learns to selectively forget less important words in the cache, keeping only the most critical context. Others are exploring entirely new architectures, like state-space models (such as Mamba), which attempt to compress historical context into a fixed-size memory state rather than letting it grow linearly with every word.
Until those new architectures prove they can match the reasoning power of the transformer, the KV cache remains the unavoidable toll booth on the road to fast AI. Platforms like Sandgarden are increasingly focused on abstracting these complex infrastructure challenges, allowing developers to deploy models with advanced caching optimizations without having to manually manage GPU memory allocation or paging algorithms. Ultimately, the companies that figure out how to manage the KV cache most efficiently will be the ones that can offer the fastest, smartest, and most affordable AI tools to the world.


