LLM Serving: Making Trained Language Models Available to Handle Real-Time Requests

LLM serving is a battle against the two fundamental bottlenecks of the transformer architecture: memory bandwidth and computational cost. The entire field of LLM serving is dedicated to finding clever ways to break these bottlenecks, and the innovations of the last few years have been genuinely remarkable.

If an LLM Server is the kitchen—the hardware, the GPUs, the raw cooking power—then LLM Serving is the art and science of running the entire restaurant. It's the complex, high-stakes discipline of taking a massive, powerful language model and making it available to thousands or even millions of users simultaneously. It's about ensuring that every customer gets their answer quickly, that the kitchen doesn't get overwhelmed, and that the whole operation doesn't go bankrupt from the cost of expensive ingredients (in this case, GPU cycles). It's not enough to just have a state-of-the-art oven; serving is what turns a brilliant recipe into a production-grade service that can power chatbots, code assistants, and search engines at scale.

LLM serving is a battle against the two fundamental bottlenecks of the transformer architecture: memory bandwidth and computational cost. Every time an LLM generates a single token, it has to perform a massive calculation that involves reading the model's weights and the ever-growing Key-Value (KV) cache from the GPU's memory. This process is inherently sequential and creates a traffic jam of requests, leading to high latency and underutilized hardware. The entire field of LLM serving is dedicated to finding clever ways to break these bottlenecks, and the innovations of the last few years have been genuinely remarkable (Zhou et al., 2024).

‍

Solving the Memory Bottleneck with PagedAttention

The biggest headache in LLM serving is managing the KV cache. This is the model's short-term memory, where it stores the intermediate attention calculations (the Keys and Values) for every token in a sequence. Without it, the model would have to recompute the entire history for every new token, which would be impossibly slow. But the KV cache itself creates a significant memory problem. Early serving systems would pre-allocate a single, contiguous block of GPU memory for each incoming request, large enough to hold the KV cache for the maximum possible sequence length (e.g., 32,000 tokens). If a user only asks for a 100-token response, the other 31,900 token-slots of memory sit empty and unusable by any other request. This wasted space, known as internal fragmentation, meant that even the most powerful GPUs could only handle a handful of concurrent users, with studies showing that 60-80% of the reserved memory was often wasted (Kwon et al., 2023).

Enter PagedAttention, the groundbreaking innovation from the vLLM project at UC Berkeley. Inspired by virtual memory and paging in classical operating systems, PagedAttention divides the KV cache into smaller, fixed-size, non-contiguous blocks. Instead of reserving a huge chunk of memory upfront, the system allocates these blocks on demand as the sequence grows. This simple but brilliant idea virtually eliminates memory fragmentation, reducing wasted memory to under 4% in practice. The result is that the system can pack far more concurrent requests onto a single GPU, improving throughput by 2-4x compared to previous state-of-the-art systems like FasterTransformer and Orca (Kwon et al., 2023).

PagedAttention also enables a powerful secondary optimization: prefix caching. When many requests share a common prefix—like a long system prompt in a multi-turn chatbot—the KV cache blocks for that prefix can be computed once and then shared across all those requests. This dramatically reduces the time-to-first-token for subsequent requests that share the same context, which is a huge win for applications like customer service bots or code assistants that use the same detailed instructions for every user.

‍

Keeping the GPU Busy with Continuous Batching

Another major innovation that works hand-in-hand with PagedAttention is continuous batching. In the old world of static batching, the server would group a bunch of requests together, send them to the GPU, and then wait for every single one to finish before starting the next batch. This created a "head-of-line blocking" problem, where a short, quick request could get stuck waiting for a long, complex one to complete. The GPU, a multi-thousand dollar piece of hardware, was left idle, waiting for the slowest request to finish before it could start on anything new.

Continuous batching, also pioneered by vLLM and adopted by other frameworks like TensorRT-LLM (where it's called "in-flight batching"), solves this by making scheduling decisions at the iteration level. As soon as a single sequence in the batch is finished, the scheduler immediately frees up its KV cache blocks and slots in a new request from the queue. The GPU is never left idle as long as there are requests waiting, which smooths out performance and maximizes hardware utilization. The practical impact is enormous: continuous batching has been shown to deliver up to 24x higher throughput than naive static batching systems, according to benchmarks from Clarifai's analysis of the serving framework landscape (Clarifai, 2026).

‍

Putting Idle Cores to Work with Speculative Decoding

Even with PagedAttention and continuous batching, the fundamental autoregressive nature of LLMs—generating one token at a time—imposes a hard limit on latency. The time it takes to generate each token is dominated by the time it takes to read the model weights from the GPU's high-bandwidth memory (HBM), a memory-bandwidth-bound operation. During this time, the powerful compute cores of the GPU are mostly sitting idle, waiting for the data to arrive. The GPU is like a Formula 1 pit crew standing around while they wait for a single lug nut to be delivered.

‍Speculative decoding is a clever technique that puts those idle cores to work. It uses a much smaller, faster "draft" model to generate a short sequence of candidate tokens cheaply. Then, the large, powerful "target" model verifies all of these candidate tokens in a single, parallel forward pass—a compute-bound operation that makes full use of the GPU's cores. If the draft model's predictions were correct, the system has effectively generated several tokens for the memory cost of one, dramatically reducing latency. If the predictions were wrong, the system simply discards the incorrect tokens and continues from the last known good one. This technique has been shown to speed up inference by 2-3x in many cases, and the key insight is that the draft model's latency, not its raw accuracy, is the most important factor in determining the speedup (Yan et al., 2024).

Google Cloud has highlighted speculative decoding as one of the most impactful optimizations available for latency-sensitive workloads, noting that it "directly breaks the time-between-tokens floor set by memory bandwidth" (Google Cloud, 2026).

‍

Splitting the Work Across Multiple GPUs

For the largest models—the so-called "frontier" models with hundreds of billions of parameters—a single GPU is simply not enough to hold the model's weights, let alone serve it efficiently. This is where model parallelism comes in, allowing a single model to be spread across multiple GPUs working in concert.

‍Tensor Parallelism splits the individual mathematical operations within a model layer (like a massive matrix multiplication) across several GPUs. Each GPU computes its slice of the operation, and the results are combined at the end. This approach is excellent for reducing the latency of each individual request because the work gets done faster in parallel. However, it requires an enormous amount of high-speed communication between the GPUs (via interconnects like NVIDIA's NVLink) because the GPUs need to constantly share partial results to keep their work synchronized. Performance tends to scale sub-linearly as more GPUs are added, meaning there's a point of diminishing returns where the communication overhead starts to outweigh the parallelism benefits (Databricks, 2023).

‍Pipeline Parallelism takes a different approach, assigning different layers of the model to different GPUs. GPU 1 might handle layers 1-10, GPU 2 handles layers 11-20, and so on. The input data flows through this pipeline, with each GPU performing its designated operations before passing the result to the next. This strategy is better for increasing overall throughput, as multiple different requests can be in the pipeline at the same time, each at a different stage. The downside is the potential for "bubbles"—moments when GPUs are idle, waiting for the previous stage to finish. In practice, most large-scale deployments use a hybrid of both strategies, using tensor parallelism within a single node (where GPUs are connected by fast NVLink) and pipeline parallelism across nodes (where the communication is slower).

A newer and increasingly important technique is prefill-decode disaggregation. The LLM inference process has two distinct phases: the prefill phase, where the model processes the entire input prompt in one shot (compute-intensive), and the decode phase, where it generates the output tokens one by one (memory-bandwidth-intensive). These two phases have very different hardware requirements, and forcing them to share the same GPU means one resource is always underutilized. Disaggregation physically separates these phases onto different hardware—dedicated prefill clusters and decode clusters—allowing each to be optimized independently. This is one of the most architecturally significant optimizations in modern serving infrastructure (Google Cloud, 2026).

‍

Shrinking the Model with Quantization

No discussion of LLM serving would be complete without quantization. Model weights are typically stored in 16-bit floating-point format (FP16 or BF16), but they can be compressed to 8-bit integers (INT8) or even 4-bit integers (INT4) with only a modest impact on output quality. Since the decode phase is memory-bandwidth-bound, moving less data from memory to the compute cores directly speeds up token generation. A model with 7 billion parameters in FP16 takes roughly 14 GB of GPU memory; in INT4, that drops to around 3.5 GB, which means it can fit on a much smaller and cheaper GPU.

The trade-off is accuracy. Aggressive quantization can degrade the model's ability to reason carefully or follow complex instructions. Techniques like GPTQ and AWQ (Activation-aware Weight Quantization) have been developed to minimize this quality loss by intelligently choosing which weights to quantize and by how much. Modern serving frameworks like vLLM support a wide range of quantization formats natively, making it straightforward to deploy a quantized model without a complex compilation step.

‍

The Serving Framework Landscape

A handful of powerful open-source frameworks have emerged as the leaders in the LLM serving space, each embodying a different philosophy and set of trade-offs. Choosing between them is one of the most consequential decisions a machine learning engineering team will make.

Framework	Key Strengths	Best For
vLLM	Highest throughput, ease of use, wide community support, broad quantization formats	High-concurrency applications like chatbots and RAG where maximizing GPU utilization is the top priority
TensorRT-LLM & Triton	Lowest latency, enterprise-grade features, deep NVIDIA hardware optimization	Latency-sensitive applications and large-scale production environments where performance and control are paramount
Hugging Face TGI	Broad model support, ease of use, tight Hugging Face ecosystem integration	Teams already invested in the Hugging Face ecosystem, especially for newer or less common models

‍

vLLM is often the first choice for teams looking for maximum throughput and ease of use. Its Python-native design and OpenAI-compatible API make it incredibly simple to get started, and its active open-source community means it tends to support new models and techniques quickly. TensorRT-LLM, running on NVIDIA's Triton Inference Server, is the choice for enterprises that need the absolute lowest latency. It achieves this by compiling a hardware-specific inference engine for the target GPU, squeezing out every last drop of performance. The trade-off is a more complex setup and a longer compilation step whenever the model or hardware changes. Hugging Face's Text Generation Inference (TGI) rounds out the top three with its broad model support and tight integration with the Hugging Face Hub, making it the natural choice for teams already living in that ecosystem.

‍

Putting It All Together

The art of LLM serving is about understanding these trade-offs and choosing the right combination of techniques and tools for the job. A production serving stack for a high-traffic chatbot might combine vLLM with PagedAttention and continuous batching, speculative decoding for low-latency responses, INT4 quantization to reduce memory pressure, and tensor parallelism across multiple GPUs to handle peak load. Getting all of these pieces to work together efficiently is a genuine engineering challenge.

It's a rapidly evolving field where new optimizations and frameworks are constantly emerging. By abstracting away much of the complexity, teams can deploy and manage various serving engines without becoming GPU memory experts. Providing a unified layer for inference allows organizations to focus on building their applications while ensuring the underlying serving infrastructure runs efficiently. As models continue to grow and applications become more demanding, the discipline of LLM serving will only become more critical—ensuring that the power of large language models can be delivered to the world quickly, reliably, and without burning through a GPU budget the size of a small country's GDP.