Continuous batching is a scheduling technique for artificial intelligence models that allows new user requests to join an active processing group the exact moment a slot becomes available, rather than waiting for the entire group to finish. By evaluating the queue at every single step of text generation, this approach ensures the computer chips running the model are always operating at maximum capacity, dramatically increasing the number of users a system can serve simultaneously.
If you have ever stood in line for a roller coaster, you know the frustration of watching a car dispatch with empty seats just because a group of five didn't want to split up. The ride operators are doing their best, but the system is inherently inefficient. For a long time, the servers running large language models operated exactly like that roller coaster. They would gather a group of requests, send them through the model, and wait until every single request was completely finished before loading the next group.
This worked fine when models were just classifying images or translating single sentences. But modern AI models generate text one piece at a time, and some responses are much longer than others. If one user asks for a two-word answer and another asks for a five-paragraph essay, the server would force the first user's completed request to sit idle on the computer chip, taking up valuable space, until the essay was finished.
Engineers realized this was a massive waste of expensive hardware. The solution they developed fundamentally changed how AI is deployed at scale.
The Two Phases of AI Thought
To understand why continuous batching was such a breakthrough, we first need to look at how an AI model actually generates text. When you send a prompt to a model like ChatGPT or Claude, the system doesn't just read it and instantly spit out a full response. The process is split into two very distinct phases: prefill and decode.
The prefill phase is the heavy lifting. The model takes your entire prompt—whether it's a single sentence or a fifty-page document—and processes it all at once in a single massive calculation. This phase requires an enormous amount of raw computational power. The computer chips, known as GPUs, love this kind of work because they are designed to perform thousands of math operations simultaneously. During prefill, the GPU is working incredibly hard, but it finishes the job relatively quickly. This initial burst of computation is essential because it establishes the context the model needs to understand what you are asking. It is like reading an entire book before being asked to write a summary; the reading takes intense focus, but once it is done, the knowledge is locked in. The prefill phase is entirely compute-bound, meaning the speed at which it finishes is limited only by how fast the GPU can crunch the numbers.
The decode phase is where the actual writing happens. The model generates the first word of its response, then looks at everything it has written so far, and generates the second word. It repeats this process over and over until the response is complete. This step-by-step generation is known as autoregressive generation, and it is the fundamental bottleneck of modern AI. Because the model cannot generate the third word until it knows what the second word is, it is forced to work sequentially. It cannot use its massive parallel processing power to generate the entire sentence at once.
Unlike the prefill phase, decoding doesn't require much math. Instead, it requires the GPU to constantly fetch data from its memory banks to remember what it just wrote. This means the decode phase is memory-bound. The GPU's calculator is mostly sitting around twiddling its thumbs, waiting for the memory chips to hand it the next piece of data. It is a frustrating reality for engineers: they have purchased the most powerful calculators on the planet, only to watch them sit idle while waiting for the filing cabinet to deliver the next folder. This idle time is the exact problem that continuous batching was invented to solve. If the calculator is going to sit around waiting for memory anyway, it might as well be waiting for the memory of ten different users at the same time.
The Padding Problem
Before continuous batching came along, engineers tried to make the decode phase more efficient using static batching. The idea was simple: if the GPU's calculator is bored generating one word at a time, let's give it ten requests to generate at once.
To make this work, the server would group ten requests together. But computer chips require data to be perfectly rectangular. If the ten requests were different lengths, the server had to add blank, meaningless data—called padding—to the shorter requests until they all matched the length of the longest one.
This created a massive inefficiency. The GPU was spending its time and memory processing blank space. Worse, because the server couldn't accept new requests until the longest response in the batch was completely finished, the GPU's utilization would slowly drop as shorter requests completed and their slots sat empty. It was like a bus driver refusing to let new passengers on at a stop because one person in the back was riding all the way to the end of the line. The bus is mostly empty, but the people waiting at the bus stop are still forced to stand in the rain. In the world of AI serving, those people standing in the rain are your users, and the empty bus seats represent thousands of dollars of wasted computing power.
The Iteration-Level Breakthrough
In 2022, researchers from Seoul National University and FriendliAI published a paper introducing a system called Orca (Yu et al., 2022). They proposed a radical idea: instead of scheduling work at the request level, why not schedule it at the iteration level?
In other words, instead of locking a batch in place until every request is finished, the server should pause after every single word is generated, check the queue, and instantly swap out finished requests for new ones. The researchers demonstrated that this iteration-level scheduling could improve throughput by an astonishing 36.9 times compared to the standard systems of the day, all without increasing the time it took for users to get their responses. It was a paradigm shift that proved the software managing the model was just as critical as the model itself.
This is continuous batching. When a request finishes generating its final word, it is immediately ejected from the GPU. In the very next fraction of a second, a new request from the queue is slotted into that exact space. The batch is never static; it is a flowing river of data, constantly updating its composition at every step of the generation process.
To make this work without the dreaded padding problem, engineers developed a technique called ragged batching (Ouazan Reboul et al., 2025). Instead of forcing all the requests to be the same length by adding blank space, the server simply concatenates them into one long, continuous stream of data. It then uses a mathematical mask to ensure that the words from user A's request don't accidentally mix with the words from user B's request.
This approach completely eliminates the wasted computation of processing blank padding tokens. When combined with the dynamic swapping of requests at every iteration, ragged batching forms the core engine of continuous batching. It transforms the rigid, blocky structure of static batches into a fluid, highly optimized stream of computation that keeps the GPU fed with useful work at all times.
The Memory Management Challenge
Continuous batching sounds like an obvious solution, but it is incredibly difficult to implement because of how AI models remember things.
During the prefill phase, the model calculates a massive mathematical representation of your prompt. It saves this representation in the GPU's memory so it doesn't have to recalculate the whole prompt every time it generates a new word. This saved data is called the KV cache.
As the model generates a response, the KV cache grows larger and larger with every new word. If you are constantly swapping requests in and out of the batch, the server has to manage dozens of these growing caches simultaneously. The memory requirements scale linearly with both the batch size and the length of the sequences being generated. If the server miscalculates and runs out of memory, the entire system crashes. This memory pressure is the primary reason why continuous batching is so difficult to engineer; it is not just a scheduling problem, it is a massive memory allocation problem.
This is why continuous batching didn't become the industry standard until frameworks like vLLM introduced advanced memory management techniques (Kwon et al., 2023). By treating the GPU's memory like the virtual memory in a standard computer operating system, these frameworks can chop the KV cache into tiny blocks and store them wherever there is free space, completely eliminating memory fragmentation and allowing continuous batching to run flawlessly. Instead of requiring a massive, contiguous block of memory for every single user, the system can scatter the memory blocks across the GPU and use a simple lookup table to find them when needed.
This innovation, often referred to as PagedAttention, was the missing puzzle piece. It allowed servers to safely manage the chaotic, unpredictable memory demands of continuous batching without reserving massive amounts of empty safety buffer space. Suddenly, the theoretical gains of iteration-level scheduling could be realized in production environments serving millions of users.
Chunking the Prefill
There is one final hurdle to making continuous batching work in the real world. Remember how the prefill phase requires a massive amount of math, while the decode phase requires very little?
Imagine a server is happily continuously batching ten requests, generating one word at a time. Suddenly, a new user drops a 100-page PDF into the queue. To add this new request to the batch, the server has to perform the massive prefill calculation for that entire PDF.
Because the GPU can only do one thing at a time, the ten users who were getting their responses generated word-by-word suddenly experience a massive lag spike while the GPU pauses to read the new user's PDF.
To solve this, modern serving systems use a technique called chunked prefill. Instead of forcing the GPU to read the entire 100-page PDF at once, the server chops the document into smaller chunks. It processes the first chunk, then generates a word for the other ten users. It processes the second chunk, then generates another word for the other users.
By interleaving the heavy prefill math with the light decode memory fetches, the server keeps the GPU perfectly balanced. The new user's prompt is processed smoothly, and the existing users never notice a hiccup in their generation speed.
This technique, known as chunked prefill, is the final evolution of continuous batching (Moon, 2024). It acknowledges that the prefill and decode phases have fundamentally different hardware requirements, and it cleverly mixes them together to ensure neither the calculator nor the memory banks are ever sitting idle. It is a masterclass in resource orchestration, turning the chaotic, unpredictable demands of thousands of users into a smooth, continuous hum of computation.
The Configuration Balancing Act
While continuous batching is incredibly powerful, it is not a magic bullet that works perfectly out of the box. Engineers must carefully configure the system to match the specific traffic patterns of their application.
The two most critical settings are the maximum number of concurrent sequences and the maximum number of batched tokens. The first setting dictates how many different users the server is allowed to juggle at once. If this number is set too low, the GPU will sit idle because it isn't being fed enough work. If it is set too high, the server will try to juggle too many requests, run out of memory for the KV cache, and crash.
The second setting, the maximum number of batched tokens, controls how much total data the server is allowed to process in a single forward pass. This is especially important when using chunked prefill, as it determines exactly how large those chunks can be. If the chunks are too large, the prefill phase will take too long, and the users waiting for their next word will experience a noticeable lag spike. If the chunks are too small, the GPU won't be able to use its massive parallel calculation abilities efficiently.
Finding the perfect balance between these two settings requires constant monitoring and adjustment. A server handling short, snappy customer service chats will need a very different configuration than a server summarizing massive legal documents. This is why platforms like Sandgarden are so valuable; they abstract away the immense complexity of tuning these low-level hardware parameters, allowing developers to focus on building great applications rather than manually tweaking batch sizes and memory limits. When tuned correctly, however, continuous batching is often the single largest contributor to throughput improvements in modern AI deployments (Belfer, 2026).
The Architecture of Scale
Continuous batching is arguably the single most important engineering breakthrough in making large language models commercially viable. Without it, the cost of running services like ChatGPT or Claude would be astronomically higher, as companies would have to buy significantly more GPUs just to handle the wasted idle time. It is the invisible engine that allows these platforms to offer free tiers to millions of users simultaneously without going bankrupt on server costs.
By treating AI generation not as a series of isolated tasks, but as a fluid, continuous stream of computation, engineers have managed to squeeze every last drop of performance out of the hardware. It is a testament to the fact that in the world of artificial intelligence, the software that manages the model is often just as important as the model itself. As models continue to grow in size and complexity, techniques like continuous batching will only become more critical in the quest to make AI faster, cheaper, and more accessible to everyone.


