Model Sharding: Splitting a Large Model Across Multiple Devices to Fit in Memory

Model sharding is the practice of dividing a massive artificial intelligence model into smaller, manageable pieces (called shards) and distributing them across multiple computer chips or storage drives. Rather than forcing a single graphics processing unit (GPU) to hold the entire model in its memory, sharding allows a cluster of chips to collectively hold the model, making it possible to train and run AI systems that are hundreds of times larger than any single piece of hardware could support.

Model sharding is the practice of dividing a massive artificial intelligence model into smaller, manageable pieces (called shards) and distributing them across multiple computer chips or storage drives. Rather than forcing a single graphics processing unit (GPU) to hold the entire model in its memory, sharding allows a cluster of chips to collectively hold the model, making it possible to train and run AI systems that are hundreds of times larger than any single piece of hardware could support.

The term itself is a fascinating piece of engineering etymology. It originally comes from database architecture, where engineers in the early 2000s needed a way to split massive, monolithic databases across multiple servers. The legend goes that the term was popularized by the developers of the multiplayer game Ultima Online, who needed to split their player base across different servers and called them "shards" of a shattered crystal. Today, that same concept is the only reason we can build models with hundreds of billions of parameters.

To understand why sharding is so critical, we have to look at the brutal math of modern AI development. A state-of-the-art GPU like an NVIDIA A100 has 80 gigabytes of high-speed memory. A model like GPT-3, with 175 billion parameters, requires roughly 700 gigabytes of memory just to store its weights in standard precision. That is before you even account for the memory needed to actually run data through the model. The math simply does not work. You cannot fit a 700-gigabyte model into an 80-gigabyte box.

Sharding solves this physics problem by breaking the box. Instead of trying to cram the whole model onto one chip, the system slices the model's data into fractions. If you have ten GPUs, each GPU holds exactly one-tenth of the model. When the system needs to perform a calculation, the GPUs rapidly communicate over high-speed network cables, sharing just enough information to complete the math before deleting the temporary data to free up space.

This approach has completely rewritten the economics of AI infrastructure. It allows researchers to treat a cluster of hundreds of GPUs as if it were one giant, unified supercomputer with terabytes of shared memory. Without model sharding, the current era of massive language models would have ended years ago, hard-capped by the physical limits of silicon manufacturing.

‍

The Redundancy Problem

To appreciate the elegance of modern model sharding, you have to understand the incredibly inefficient system it replaced. For years, the standard way to train an AI model across multiple GPUs was a technique called data parallelism.

In standard data parallelism, every single GPU in your cluster holds a complete, identical copy of the entire model. If you have eight GPUs, you have eight full copies of the model. You then take your massive dataset, split it into eight chunks, and send one chunk to each GPU. The GPUs process their data independently, calculate how the model should be updated, and then talk to each other to average out their findings. They all update their identical copies of the model, and the cycle repeats.

This works brilliantly for small models. But as models grew larger, engineers realized they were wasting a staggering amount of memory.

During training, a GPU doesn't just have to store the model's parameters (the weights). It also has to store the gradients (the calculations of how wrong the model was during its last guess) and the optimizer states (the historical tracking data that helps the model learn smoothly). For a standard optimizer like Adam, these extra states take up two to three times as much memory as the parameters themselves.

If you are training a 10-billion-parameter model on 64 GPUs using standard data parallelism, you are storing 64 identical copies of the parameters, 64 identical copies of the gradients, and 64 identical copies of the optimizer states. You are paying for thousands of gigabytes of premium GPU memory just to store redundant data. This realization led researchers at Microsoft to develop a breakthrough technique that changed the trajectory of AI scaling (Rajbhandari et al., 2020).

‍

The ZeRO Breakthrough

In 2020, the DeepSpeed team at Microsoft introduced the Zero Redundancy Optimizer, universally known as ZeRO. Their insight was simple but profound: instead of every GPU holding a full copy of all the training states, what if we sharded those states across the GPUs?

‍ZeRO was introduced in three progressive stages, each one slicing away a different layer of redundancy.

In Stage 1, the system shards only the optimizer states. Every GPU still holds a full copy of the model parameters and the gradients, but it only holds a fraction of the optimizer tracking data. Because optimizer states are the biggest memory hogs in the training process, this simple change immediately frees up massive amounts of memory with almost no communication penalty.

In Stage 2, the system goes further and shards the gradients as well. Now, each GPU only holds the gradients corresponding to its specific shard of the optimizer states. When a GPU finishes its backward pass calculation, it sends its gradients to the specific GPU responsible for that shard, rather than broadcasting them to everyone. This cuts the memory footprint again, allowing for even larger models.

Stage 3 is the holy grail: fully sharded data parallelism. In this stage, the model parameters themselves are sharded. No single GPU holds the full model. Instead, each GPU holds only a tiny fraction of the parameters, gradients, and optimizer states.

When a specific layer of the model needs to be processed, the GPUs rapidly broadcast their parameter shards to each other in an operation called an all-gather. For a brief fraction of a second, the full layer is reconstructed in memory. The calculation is performed, and then the full layer is immediately deleted, leaving only the shards behind. This dynamic reconstruction and deletion happens continuously as data flows through the network.

***ZeRO Sharding Stages: Memory vs. Communication Trade-offs***
Sharding Strategy	What is Sharded?	Memory Reduction	Communication Overhead
Standard Data Parallel	Nothing (Full replication)	None (Baseline)	Low (Only gradients synced)
ZeRO Stage 1	Optimizer States	Up to 4x reduction	Low (Same as baseline)
ZeRO Stage 2	Optimizer States + Gradients	Up to 8x reduction	Medium (Gradient routing)
ZeRO Stage 3	Optimizer States + Gradients + Parameters	Up to 64x reduction (on 64 GPUs)	High (Constant parameter gathering)

‍

This approach was so effective that Meta subsequently built it directly into the core of the PyTorch framework as Fully Sharded Data Parallel, or FSDP (Zhao et al., 2022). Today, almost every massive language model in the world is trained using some variation of this fully sharded architecture.

‍

Breaking the GPU Barrier

Even with fully sharded data parallelism, you eventually hit a wall. If you want to train a model with a trillion parameters, even sharding the model across hundreds of GPUs might not leave enough memory to actually process the data. GPU memory is incredibly fast, but it is also incredibly expensive and strictly limited in capacity.

To solve this, engineers looked beyond the GPU entirely. A modern AI server doesn't just have GPUs; it also has massive amounts of standard system RAM (CPU memory) and terabytes of high-speed solid-state storage (NVMe drives).

The problem is speed. GPU memory can transfer data at over 2,000 gigabytes per second. System RAM transfers data at maybe 200 gigabytes per second. An NVMe drive transfers data at about 10 gigabytes per second. If you try to run an AI model directly off an NVMe drive, the GPUs will spend 99 percent of their time sitting idle, waiting for the data to arrive.

The solution is a technique called ZeRO-Infinity, which extends model sharding across the entire hardware hierarchy (Microsoft Research, 2021). Instead of just sharding the model across GPU memory, ZeRO-Infinity shards the model across GPU memory, CPU memory, and NVMe storage simultaneously.

To hide the massive speed difference, the system uses aggressive prefetching. While the GPU is crunching the math for layer 10, the system is already pulling the shards for layer 11 out of the NVMe drives, moving them into CPU memory, and staging them for transfer to the GPU. By the time the GPU finishes layer 10, layer 11 is waiting and ready to go.

This hierarchical sharding allows researchers to train models of truly staggering scale. Using ZeRO-Infinity, a cluster of 512 GPUs can train a model with over 30 trillion parameters—a scale that was considered science fiction just a few years ago. More importantly, it democratizes access to large models. With NVMe sharding, a researcher with a single high-end GPU can fine-tune a 175-billion-parameter model, simply by using their cheap solid-state storage as overflow memory.

‍

Sharding for Inference

While most of the complex engineering around model sharding focuses on the training phase, sharding is equally critical when it comes time to actually use the model—a phase known as inference.

During inference, you don't need to store gradients or optimizer states, which means the memory footprint is significantly smaller. However, a 70-billion-parameter model still requires roughly 140 gigabytes of memory just to load the weights in 16-bit precision. You still cannot fit that onto a single 80-gigabyte GPU.

For inference, engineers typically rely on a different flavor of sharding called tensor parallelism. Instead of sharding the model layer by layer, or sharding the optimizer states, tensor parallelism takes the massive mathematical matrices inside the model and physically chops them in half.

If you have a matrix multiplication that is too large for one GPU, you split the matrix down the middle. GPU 1 calculates the left half, GPU 2 calculates the right half, and they combine their answers at the end. This requires an immense amount of communication between the chips, which is why it is usually restricted to GPUs that are physically bolted together inside the same server box using ultra-fast interconnects like NVLink.

Modern inference servers, such as vLLM, handle this sharding automatically (Kwon et al., 2023). When you boot up a massive model, you simply tell the server how many GPUs you have available. The software automatically calculates the optimal way to shard the matrices, distributes the shards across the chips, and orchestrates the split-second communication required to generate text.

As models continue to grow, the lines between these different sharding techniques are beginning to blur. The largest training runs today use a combination of fully sharded data parallelism to distribute the optimizer states, tensor parallelism to shard the individual matrices, and pipeline parallelism to slice the model into sequential stages. Managing all of this simultaneously requires robust orchestration software — tools that ensure shards are correctly distributed, communication pathways stay open, and the system can recover gracefully when a single node fails. Fault tolerance, once an afterthought, has become a first-class engineering concern.

‍

Who Actually Benefits

The most underappreciated consequence of model sharding is what it has done for access. Before ZeRO and FSDP, training a frontier model was the exclusive domain of organizations with nine-figure hardware budgets. After them, a team with a modest GPU cluster — or even a single well-equipped workstation — can fine-tune a 70-billion-parameter model using NVMe offloading.

Open-source research communities have taken full advantage of this shift. Groups like EleutherAI and LAION have used sharding techniques to train and release powerful language models to the public, demonstrating that serious AI research does not require a private data center (IBM Developer, 2024). Some AI‑agent tools adopt a philosophy that the complexity of building AI‑powered software should be handled by agents rather than the developer.

The communication overhead between shards remains the primary bottleneck that researchers are working to close. Advances in high-speed interconnects — including optical networking and silicon photonics — are already reducing the latency penalty of all-gather operations. As those gaps narrow, the practical ceiling for sharded models will continue to rise.