A friendly guide to multi‑GPU LLM serving

Most apps can stay on one GPU longer than you think. Move to multiple GPUs when memory or throughput goals demand it, or when your model size exceeds available GPU memory. Parallelism adds communication costs and new failure modes—plan it, test it, and keep your caps tight.

Try Compute today: Launch a vLLM inference server on Compute with 2×, 4×, or 8× GPU presets. Choose France or UAE, stream by default, and keep max_tokens and context caps sensible while you test batch shapes.

When multi‑GPU actually pays off

Model does not fit. Even with int8, model size (weights + KV‑cache) can exceed a single card’s VRAM.
Context is long. High concurrency + long prompts/outputs (increased sequence length) push cache past safe headroom.
Throughput ceiling. You need more tokens/second at the same latency target than one card can deliver.

Tuning the global batch size is important for optimizing throughput and GPU utilization in multi-GPU setups.

If your issue is mostly queueing or oversized caps, fix those first. Multi‑GPU won’t save a bad scheduler.

Start in seconds with the fastest, most affordable cloud GPU clusters.

Launch an instance in under a minute. Enjoy flexible pricing, powerful hardware, and 24/7 support. Scale as you grow—no long-term commitment needed.

Try Compute now

Modes of parallelism (plain English)

Tensor parallelism. Split individual matrix multiplies across GPUs. The model is divided into partitions, and these partitions are partitioned across the number of GPUs (gpu count) available. The tensor parallel size parameter determines how much of the computation is distributed across GPUs. Great for large models; needs fast links between cards.

Pipeline parallelism. Split layers into stages across GPUs. The model is split into partitions, and each partition is assigned to a different GPU. Micro‑batches flow stage to stage; good when layers are balanced. The pipeline parallel size parameter controls how many pipeline stages (partitions) are distributed across the gpu count.

Data parallelism. Many replicas handle different requests. Simple and robust; no single request spans GPUs. For guidance on choosing between on-demand and spot instances, see this overview.

These approaches fall under the umbrella of model parallelism, which refers to splitting a model across multiple GPUs to improve memory efficiency and training speed for very large models. Both pipeline parallel and tensor parallel are specific types of model parallelism, each using different strategies to partition and distribute model computations across the available GPUs.

For inference, start with data parallelism (more nodes) before crossing a single request over multiple GPUs.

Pipeline architecture

Pipeline architecture helps you scale large language models across multiple GPUs. Here's what it does: it splits your model into pipeline stages, with each stage running on a different GPU. Instead of cramming your whole model onto one GPU, you partition the layers so each stage handles a chunk of input data, then passes results to the next stage. This lets you train or serve models that are way too big for a single GPU's memory.

Here's the process: your input data gets divided into micro batches that flow through the pipeline. Each pipeline stage processes its micro batch, then hands off results to the next stage. This keeps all your GPUs busy—one might work on the forward pass for one micro batch while another handles the backward pass for a different batch. You want to keep the pipeline full and cut down the "pipeline bubble"—the idle time when GPUs sit around waiting for data from others.

Pipeline parallelism works well for large-scale models where model parameters and activation memory would crush a single GPU. By splitting the model, each GPU only stores and processes a subset of parameters and activations. This makes it possible to work with much larger models and helps balance memory usage and computation across your GPUs.

To set up pipeline parallelism, many frameworks give you a pipeline API—a simple way to split your model into stages, manage micro batches, and handle communication between GPUs. These APIs often support automatic partitioning of model layers, flexible pipeline schedules, and work alongside other parallelism strategies like tensor parallelism and data parallelism. You might combine pipeline parallelism (splitting model layers across GPUs) with tensor parallelism (splitting matrix multiplications within a layer across GPUs) to get the best throughput and efficiency.

Good pipeline architecture focuses on cutting communication overhead and keeping all pipeline stages busy. The less time your GPUs spend waiting for data, the higher your throughput and lower your latency. By carefully splitting your model and tuning your micro batch size, you can shrink the pipeline bubble and get the most from your hardware.

Pipeline architecture gives you a solid way to scale LLMs across multiple GPUs. By breaking your model into pipeline stages and processing input data as micro batches, you can train and serve models that would be too large for a single GPU. Combined with tensor parallelism and data parallelism, pipeline parallelism opens up efficient, large-scale model serving—making it essential for anyone working with big models and big data.

Memory math & the KV‑cache

Model weights: roughly 2 bytes/param at FP16; halves at int8; quarters at int4. Model weights are a major contributor to memory usage during training and inference.
KV‑cache: grows with context × batch. Even if weights fit, cache may not.
Shard vs replicate: tensor/pipeline can shard state; data parallel replicates it per replica. Sharding saves memory but increases coordination.

Sharding optimizer states and gradients across GPUs, as in ZeRO and similar strategies, can further reduce memory requirements in distributed training. Offloading parts of the model or cache to cpu memory or system memory can also help manage memory constraints in large-scale setups.

Rule of thumb: keep VRAM headroom >10–20% at peak or TTFT will creep up.

Batch shapes & scheduling

Prefer many small decodes over a few huge generations.
Divide the inputs or input tensors into micro-batches for processing.
Limit max_tokens per request so decode slots turn over.
Mix short and long prompts fairly each step (continuous batching).
Measure TTFT/TPS for each batch shape; the best shape is empirical, not theoretical.

Each micro-batch is processed sequentially through the pipeline. For example, effective batch shapes or scheduling strategies may include round-robin scheduling of micro-batches or partitioning input tensors along the sequence dimension to optimize parallelism.

Topology & interconnects

Same host, same switch beats cross‑machine hops.
NVLINK interconnect from NVIDIA optimizes communications between GPUs by providing high-bandwidth, low-latency links, which is especially beneficial for tensor and pipeline parallelism modes. When NVLINK interconnect is not available, pipeline parallelism can help mitigate communication overhead and improve throughput.
NVIDIA technology such as GPUDirect RDMA enables direct memory access between GPUs, further enhancing data transfer efficiency, reducing latency, and optimizing performance in multi-node GPU setups.
Pin processes to GPUs; avoid shared PCIe bottlenecks; keep clocks stable (thermals matter).

Failure modes & redundancy

Single‑request across GPUs means one GPU failure can kill the request. Add retries or prefer data parallel replicas for critical paths.
Hot reloads: drain streams before swapping models.
OOM cascades: enforce caps and kill stuck streams quickly.

A testing plan that catches real issues

Ramp concurrency until TTFT p95 crosses your target; compare 1× vs 2× vs 4×, and track the total number of GPUs used in scaling tests.
Mixed prompt set (short + long) to surface fairness issues.
Cancel storms to test KV‑cache reuse and cleanup.
Node loss: kill one process/GPU; test with different GPUs to verify in‑flight behavior and failover robustness.
Hot swap model or quantization; compare TTFT/TPS and quality.

Monitoring that matters

TTFT p50/p95, TPS p50/p95
Active batch size, queue length
GPU mem headroom per GPU, cache hit/miss
Inter‑GPU link utilization (if available)
Error rates by type (OOM, timeouts, 5xx)
Distribution of model parameters and partitions across GPUs for load balancing

Try Compute today: Scale out on Compute with multi‑GPU presets and an OpenAI‑compatible vLLM server. Keep endpoints close to users and measure before you upgrade hardware.

Scale beyond one GPU without breaking latency

Use multi‑GPU when the model or cache won’t fit, or when you need more tokens per second at the same latency. Prefer data‑parallel replicas first; reach for tensor/pipeline only when you must. Keep caps tight, stream, place nodes close to users, and let TTFT/TPS guide your next step.

FAQ

Do I need NVLink for tensor parallelism?

It helps a lot. Without a fast interconnect, communication can erase gains from splitting layers.

What should I try first: more GPUs or more nodes?

Try more nodes (data parallel) first. It is simpler, isolates failures, and scales well for many workloads.

Why did latency get worse after going multi‑GPU?

Likely communication overhead or a batch shape that triggers cache pressure. Check interconnect bandwidth, trim caps, and re‑measure.

Can multi‑GPU help with long context?

Yes, by spreading memory across cards. But also consider RAG and quantization before adding complexity.

How do I know it’s time to upgrade?

When TTFT p95 rises and TPS flattens at steady traffic despite clean caps and healthy memory headroom on one GPU.

What is the role of the embedding layer in pipeline parallelism?

The embedding layer maps input vocabulary to hidden states. In pipeline parallelism, the embedding layer is often placed at the start of the pipeline and may be tied or shared across model stages to ensure consistency and efficiency.

How are transformer blocks and transformer layers distributed across GPUs?

Transformer blocks and transformer layers are split across GPUs in pipeline and tensor parallelism. Each GPU processes a subset of these layers, allowing the model to scale efficiently and handle larger architectures.

How are expert layers distributed in Mixture of Experts (MoE) models?

Expert layers in MoE architectures are distributed across multiple GPUs. This distribution enables parallel computation of different experts, improving scalability and computational efficiency during training and inference.

What are the challenges of training LLM with large activation memory?

Training LLM (large language models) requires managing significant activation memory. Specialized frameworks like NeMo help distribute activation data and optimize memory use, which is critical for efficient multi-GPU training.

How do the sequence and sequence dimension affect parallelism strategies?

Parallelism strategies like sequence parallelism partition and distribute activation data along the sequence dimension. This allows efficient handling of long input sequences and better utilization of GPU memory and compute resources.

What does linear propagation of data through pipeline stages mean?

Linear propagation means data moves sequentially through each pipeline stage, with each stage inferring shapes and processing outputs in order, without skip connections or complex routing.

How is pipeline parallelism implemented in popular frameworks?

Pipeline parallelism is implemented in frameworks like Megatron-LM and DeepSpeed by integrating with Data Parallelism (DP), Tensor Parallelism (TP), ZeRO, and various pipeline schedules. These frameworks provide practical configurations and codebases for deploying pipeline parallelism effectively.

‍

← Back