Speculative decoding is a technique used to make artificial intelligence models generate text much faster. It works by pairing a massive, slow AI model with a tiny, fast "draft" model. The tiny model quickly guesses the next several words, and the massive model then checks all of those guesses at once. If the guesses are right, the system just generated several words in the time it normally takes to generate one.
This approach solves one of the most frustrating problems in generative AI: the fact that large language models are incredibly capable but inherently slow. When a standard model generates text, it produces one word at a time. Each new word requires the model to read its entire massive set of internal data—called weights, which are essentially the mathematical rules it learned during training—from memory. For a state-of-the-art model, this means moving terabytes of data just to output a single word. It is a sequential, bottlenecked process that leaves the actual computational power of modern hardware sitting idle while waiting for data to arrive from memory.
This is the fundamental physics problem of AI inference: it is not bound by compute; it is bound by memory bandwidth. The processors are starving for data, and the memory cannot feed them fast enough. Speculative decoding was developed specifically to break this bottleneck. By changing how we ask the model to generate text, we can achieve massive speedups without changing the model's architecture or degrading the quality of its output. It is one of the most elegant and impactful engineering solutions in modern artificial intelligence, fundamentally altering the economics of deploying large language models at scale.
The Memory Bandwidth Bottleneck
To understand why speculative decoding works, we first have to look at the hardware. Modern GPUs are highly parallel machines capable of performing hundreds of trillions of mathematical operations every second. However, their memory bandwidth—the speed at which they can move data from memory to the processing cores—is usually only a few terabytes per second (Google Research, 2024). This creates a massive imbalance between how fast the processor can calculate and how fast it can be fed the numbers it needs to calculate with.
When a large language model generates a token (which is just a piece of a word, like a syllable), it performs relatively few mathematical operations compared to the massive amount of weight data it has to read. For a model with 70 billion parameters, generating a single word requires moving roughly 140 gigabytes of data from the GPU's memory into its compute cores. Because standard text generation is strictly sequential—meaning the model cannot even begin calculating the third word until it has completely finished calculating the second word—the GPU finishes its math almost instantly and then sits idle. It spends the vast majority of its time simply waiting for the next batch of weights to load from memory so it can generate the next token.
This means that during normal operation, the vast majority of a GPU's computational capacity is completely wasted. It is like having a factory with a thousand incredibly fast workers, but only one conveyor belt delivering parts. The workers spend most of their shift standing around waiting. Speculative decoding reclaims this wasted compute by giving the GPU more work to do in parallel. If the workers are going to be standing around anyway while the conveyor belt slowly delivers the parts for the next step, we might as well have them work on the parts for the next five steps simultaneously.
The Draft and Verify Mechanism
The core mechanism of speculative decoding relies on a simple observation: not all tokens are equally difficult to predict. If a model generates the phrase "The square root of," the next token is almost certainly going to be a number. If it generates the phrase "In conclusion, we can see," the next word is highly likely to be "that." A massive, trillion-parameter model is not required to figure that out; a tiny, highly efficient model could guess it just as easily. The heavy lifting of the massive model is only truly needed for the complex, nuanced, or highly specific tokens that carry the core meaning of the sentence.
Speculative decoding capitalizes on this by pairing the massive, slow target model with a tiny, fast draft model. Instead of asking the target model to generate tokens one by one, the system asks the draft model to quickly guess the next several tokens—usually between three and twelve, depending on the specific configuration.
Because the draft model is so small—often just a few billion parameters compared to the target model's hundreds of billions—it can generate these guesses in a fraction of the time it takes the target model to generate a single token. It acts as a scout, running ahead and mapping out the most likely path. Once the draft model has produced its sequence of guesses, the entire sequence is passed to the target model.
Here is where the magic happens: the target model can verify all of those guessed tokens simultaneously in a single forward pass (one complete cycle of pushing data through the neural network). Because the target model is already loading its massive weights from memory into the compute cores, verifying five tokens takes almost exactly the same amount of time as generating one token. The memory bandwidth cost has already been paid. The target model evaluates the draft tokens, accepts the ones that match what it would have generated, and rejects the rest.
If the draft model guessed correctly, the system just generated five tokens in the time it normally takes to generate one. If the draft model guessed wrong on the third token, the target model accepts the first two, discards the third and everything after it, generates the correct third token itself, and the process starts over from that new point. Even in the absolute worst-case scenario where every single draft token is rejected, the system still successfully generated one token, meaning performance never drops below the baseline speed of standard decoding (Leviathan et al., 2023). The only cost is the negligible compute time spent running the tiny draft model.
The Mathematics of Acceptance
The success of speculative decoding hinges entirely on the acceptance rate, which is the percentage of draft tokens that the target model approves.
When the target model verifies the draft tokens, it does not just look for exact matches. It uses a technique called rejection sampling to compare the mathematical probabilities of the draft model against its own probabilities. If the draft model's proposed token falls within the acceptable probability range of the target model, it is accepted. For example, if the draft model predicts the word "apple" with a 90% probability, and the target model evaluates that position and determines "apple" should have an 85% probability, the token is accepted. If the draft model predicts "apple" with a 90% probability, but the target model determines it should only have a 5% probability, the token is rejected.
This mathematical guarantee ensures that the final output of speculative decoding is statistically identical to what the target model would have produced on its own (Chen et al., 2023). It is a completely lossless optimization. You are not trading quality for speed; you are getting the exact same high-quality output, just much faster.
In real-world production environments, achieving a high acceptance rate is the primary engineering challenge. If the draft model is too small or poorly aligned with the target model, it will generate garbage tokens that are constantly rejected. This wastes compute and provides no speedup, as the target model has to discard the draft and generate the token itself anyway. Conversely, if the draft model is too large, it takes too long to generate the drafts, negating the latency benefits. The time spent waiting for the draft model to finish its work ends up being longer than the time saved by parallel verification.
Engineering teams typically aim for an acceptance rate between 60% and 80%. At these rates, when drafting five tokens at a time, systems routinely see end-to-end latency reductions of 2x to 3x (BentoML, 2025). This requires careful tuning and often involves training custom draft models on the specific type of data the application will process, ensuring the draft model's vocabulary and phrasing closely match the target model's behavior in that specific domain.
When Speculative Decoding Fails
While speculative decoding is a powerful tool, it is not a universal solution for all AI workloads. There are specific scenarios where it provides massive benefits, and others where it actually degrades performance.
The technique shines in environments where the output is highly predictable. Code generation, data extraction, and structured JSON formatting are perfect use cases because the syntax is rigid and easy for a small draft model to guess.
However, speculative decoding struggles in highly creative tasks. If you ask a model to generate a highly creative story or brainstorm novel ideas, the number of possible correct next words explodes, making it nearly impossible for the draft model to consistently guess the right ones. The acceptance rate plummets, and the system ends up doing unnecessary work verifying rejected drafts.
Furthermore, speculative decoding is primarily designed to make a single request run faster by utilizing spare compute capacity. If a server is already under heavy load, processing hundreds of concurrent requests at the same time, the GPU's compute capacity is already fully utilized. In these heavy-load scenarios, adding the extra work of running a draft model can actually slow the system down. Speculative decoding is best deployed in environments where fast response times for individual users are the top priority, rather than maximizing the total number of users a server can handle at once (vLLM, 2026).
The Evolution of Draft Mechanisms
The classic draft-target approach requires hosting two separate models in memory, which introduces complexity in deployment and orchestration. You have to manage two sets of weights, ensure they are perfectly aligned, and handle the communication overhead between them. To solve this, researchers have developed several advanced variants of speculative decoding that eliminate the need for a standalone draft model entirely.
EAGLE (Extrapolation Algorithm for Greater Language-Model Efficiency) is one of the most widely deployed variants in production frameworks today. Instead of using a separate model, EAGLE trains a lightweight neural head that attaches directly to the target model. It uses the rich feature representations from the target model's own hidden states (the intermediate mathematical calculations the model makes before outputting a final word) to predict the next tokens. Because it has access to the target model's internal "thoughts," it is much better at guessing what the target model will do next. This approach significantly improves the acceptance rate while eliminating the memory overhead of hosting a second model (NVIDIA, 2025).
Medusa takes a similar approach by adding multiple decoding heads to the target model, allowing it to predict several future tokens simultaneously without any separate drafting mechanism. Instead of generating a single straight line of guesses, it uses a tree-based mechanism to verify multiple potential token paths in parallel. It essentially maps out several different ways the sentence could go and verifies all of them at once, further increasing the chances of finding an accepted sequence (Together AI, 2024).
For highly repetitive tasks, developers can even use N-gram speculation, which requires no neural network at all for drafting. It simply looks at the user's prompt, finds repeating string patterns, and proposes those strings as draft tokens. If a user asks a model to summarize a specific document, the N-gram speculator will guess that the model is going to quote directly from that document. It simply copies the text from the prompt and offers it as a draft. If the model does indeed quote the document, the tokens are accepted instantly, resulting in massive speedups for retrieval-augmented generation (RAG) workloads where the model is constantly referencing provided text.
Speculative Decoding in Reasoning Models
The recent rise of reasoning models, such as DeepSeek-R1, has created a massive new opportunity for speculative decoding. These models spend a significant amount of time generating internal "thinking" tokens before producing a final answer. A complex math problem might require the model to generate thousands of tokens of internal logic, breaking down the problem step-by-step, before it ever outputs the final number to the user.
Because this internal reasoning process follows strict logical steps and highly structured formatting, the tokens are incredibly predictable. The model is often just restating the rules of the problem, carrying over numbers from the previous step, or using standard transitional phrases like "Therefore, we can conclude that." This makes reasoning models the perfect candidates for speculative acceleration. The draft model does not need to be a genius to guess these transitional phrases; it just needs to recognize the pattern.
By training custom speculators specifically on the reasoning traces of these models, engineering teams have achieved speedups of nearly 3x, drastically reducing the cost and latency of complex logical tasks (Together AI, 2025). This is particularly important for reasoning models, as their massive token output makes them incredibly expensive and slow to run without optimization.
In fact, the architecture of many modern reasoning models now includes Multi-Token Prediction (MTP) natively. Instead of bolting a draft model on after the fact, the models are trained from the ground up to predict multiple tokens at once, seamlessly integrating speculative decoding into their core functionality. This native integration allows the model to share internal states between the drafting and verification phases, making the entire process even more efficient and further pushing the boundaries of how fast AI inference can be.
The Architecture of Speed
Speculative decoding represents a fundamental shift in how we approach AI inference. Instead of just building faster hardware or smaller models, it attacks the physics of the computation itself, finding clever ways to parallelize a strictly sequential process.
For engineering teams building production applications, understanding when and how to deploy these techniques is critical. It requires balancing the predictability of the workload, the concurrency of the server, and the alignment of the draft mechanisms. If you are building a high-throughput, high-temperature creative writing application, speculative decoding might not be the right tool. But if you are building a low-latency coding assistant, a structured data extraction pipeline, or a complex reasoning agent, it is absolutely essential. Platforms like Sandgarden make it easy to prototype and deploy these kinds of advanced AI applications, removing the infrastructure overhead so you can focus on optimizing the actual inference logic. When implemented correctly, speculative decoding offers that rare prize in computer science: a massive increase in speed with absolutely zero degradation in quality. It is not a compromise; it is simply a smarter way to use the hardware we already have.


