← Back

Continuous batching explained for LLM inference, explained

Real traffic is messy. New requests arrive while others are mid‑generation. Continuous batching is especially important for large language model (LLM) inference and other AI models used in real-world applications, where serving efficiency and resource utilization are critical. If your server waits for a full batch to finish before it starts the next one, GPUs sit idle and users wait.

Continuous batching keeps the queue moving so the GPU rarely pauses, which is crucial for efficient text generation. It can achieve throughput improvements of up to 23x over naive batching in LLM inference scenarios. Additionally, continuous batching loads model weights at the token level rather than at the request level, further enhancing efficiency.

Continuous systems can process materials around the clock to maximize output and speed up production. Continuous batching allows manufacturers to achieve high output volumes while ensuring precise control over ingredient ratios and mix quality for different formulations.

Ingredient measuring and storing can be done in an upstream hopper located above the mixer, allowing the next batch to be prepared while the previous one is mixing.

Prefill vs decode

Prefill is the first pass where the model reads the entire input or input sequence (in addition to the prompt) and builds the key/value cache. Decode is the step‑by‑step generation that produces tokens for each input sequence or query.

During prefill, input tokens are processed in parallel, while sequences are handled step by step during decode. Prefill likes big parallel work; decode benefits from many small steps packed together. Good schedulers treat them differently.

Continuous batching does not require modification to the model and enables advanced memory optimizations. For instance, PagedAttention allocates memory in fixed-size pages, allowing non-contiguous KV cache storage to improve memory efficiency. PagedAttention reduces internal fragmentation by allocating GPU memory slots on demand instead of in advance.

Start in seconds with the fastest, most affordable cloud GPU clusters.

Launch an instance in under a minute. Enjoy flexible pricing, powerful hardware, and 24/7 support. Scale as you grow—no long-term commitment needed.

Try Compute now

How continuous batching works

  • Admit new work every step. The scheduler uses iteration-level scheduling, allowing it to process new requests at each step rather than waiting for a batch boundary. This improves GPU utilization and ensures efficient handling of incoming sequences.
  • Share the KV‑cache. The server splits cache memory into small blocks, storing and reusing them for efficient memory management and kv caching. Finished blocks go back to a pool so new requests do not stall for memory, optimizing memory usage.
  • Grow and shrink on the fly. As requests finish or get canceled, the active batch size changes without forcing a restart. This dynamic approach optimizes memory usage and memory bandwidth by allocating resources as needed.
  • Stream as you go. Clients see tokens as soon as decode starts, even while other requests are being admitted. Continuous batching operates at the token level and improves GPU utilization by processing new requests as tokens become available.

Continuous batching also helps manage model weights efficiently during inference. Throughput and latency are improved across all percentiles when using continuous batching in LLM inference. LLM inference is memory-IO bound, not compute bound, meaning it takes more time to load data to the GPU than to perform calculations.

However, continuous manufacturing typically requires significant capital investment, and the process is more complex than traditional batch systems. In drug manufacturing, continuous batching allows for real-time quality monitoring while defining specific lots to meet regulatory requirements.

Continuous manufacturing in pharmaceuticals can incorporate batching for segments like multi-step synthesis where intermediates are isolated and validated before continuing in a continuous flow.

Why it helps TTFT and TPS

  • Shorter queues. New requests hop into the schedule quickly, which lowers time to first token (TTFT). Decode steps can be compute bound, and a benchmarking script is often used to measure these effects.
  • Higher utilization. Decode steps are packed, so tokens per second (TPS) stays high even when prompts and outputs differ in length, improving throughput. Benchmark results and benchmarking results show the impact of continuous batching on throughput and efficiency.
  • Smoother spikes. Bursts of traffic get absorbed by the scheduler instead of forming long, idle gaps. LLM inference is memory-IO bound, meaning throughput is largely determined by how large a batch can fit in high-bandwidth GPU memory.
  • Dynamic batching is great for live traffic on models because it improves both latency and throughput. Dynamic batching is ideal for latency-sensitive production deployments. Uniform product quality in continuous systems is achieved by integrating all operations into a single, uninterrupted line, preventing material segregation. Food plants often utilize continuous cooking for base ingredients and then batch them with different spices.

Limits and trade‑offs

  • Memory pressure. Long prompts and long outputs grow the KV‑cache. Reserving GPU memory for the maximum number of tokens can lead to inefficiencies, as unused reserved memory may be wasted if user input lengths are shorter than the reserved maximum. When headroom is tight, TTFT rises and TPS falls.
  • Fairness vs speed. Aggressive packing can starve long prompts. Proprietary models may have different maximum context length limits, which affects batching strategies and how user input of varying lengths is handled. Add simple fairness rules and output caps.
  • Jitter. Streaming interleaves many users. Variable user input lengths can contribute to token timing variation. Token timing varies a bit, which is fine for chat but less so for hard real‑time tasks. Continuous batching requires sophisticated scheduling algorithms and robust backend systems to manage variable workload processing.

Additionally, continuous batching involves advanced automation technology and complex control systems requiring specialized expertise. The sequential operations in continuous batching involve loading the next pre-weighed batch into the mixer while the current batch is being discharged, creating a smooth workflow. Implementing a continuous batching system demands significant investment in hardware and software infrastructure.

Tuning checklist

This section describes key steps to create an efficient batching setup for your models.

Start with defaults, then change one thing at a time:

  1. Cap max_tokens. Large caps waste budget and block others.
  2. Set a sane context length. Use RAG to keep prompts short when you can.
  3. Pick batch limits. Allow enough parallel decode slots to keep the GPU busy, but avoid thrashing.
  4. Right‑size hardware. If cache misses climb and TTFT drifts up, you likely need more VRAM or a smaller model. The latest models may require more VRAM or optimized hardware to run efficiently.
  5. Stream by default. Users feel progress and your queue stays healthy.
  6. Use short system prompts. Fewer tokens in, fewer tokens to cache. Hybrid methods can be used for producing a base material continuously and then having a subsequent batch process for customization or finishing in the chemical and materials processing sector.

Tests to run

For thorough validation, it is recommended to run these tests over several iterations, using different iterations and multiple iterations to cover a range of scenarios. For example, you can vary prompt lengths or concurrency levels in each run to observe system behavior.

  • Load ramp. Increase concurrency step by step and watch TTFT and TPS.
  • Mixed prompts. Combine short and long prompts to see how the scheduler behaves.
  • Cancel storms. Fire many cancels to check cleanup and cache reuse.
  • Token caps. Verify that caps work and errors are readable.
  • Hot reload. Restart the server or reload a model and confirm recovery.

Instrumentation that pays off

Track these at minimum:

  • TTFT p50/p95 and TPS p50/p95 under load
  • Active batch size and queue length
  • GPU memory use and cache hit rate (also monitor attention computation efficiency to ensure optimal performance)
  • Error codes: OOM, timeouts, 5xx
  • Per‑request prompt and output token counts (track the length of each response and full response to better understand generating text performance; no text logs if you do not need them)

Wrap‑up

Continuous batching is not magic. It is a practical way to keep GPUs busy and users happy when traffic is uneven. Start with safe caps, measure TTFT and TPS, and adjust batch limits where the numbers say it matters.

Try Compute today
When you are ready, launch a vLLM endpoint on Compute. Choose hardware, set caps, and get an HTTPS URL that works with OpenAI SDKs.

FAQ

What is prefill vs decode?

Prefill reads the entire prompt once to set up memory. Decode generates tokens step by step using that memory.

How big should a batch be?

Big enough to keep the GPU busy during decode without causing cache thrash. Test with your real prompts and cap max tokens.

Why does TTFT get worse at high load?

Usually memory pressure or oversized outputs. Trim prompts, cap outputs, and check cache hit rate.

Does continuous batching work with streaming?

Yes. Streaming is the default for many servers. Users see tokens while the scheduler keeps admitting other requests.

Do I need multi‑GPU for this to help?

No. Single‑GPU nodes benefit a lot. Multi‑GPU helps when memory or throughput needs exceed one card.

How many tokens per second is ChatGPT?

ChatGPT typically generates tokens at a rate dependent on the underlying hardware and load, but common throughput rates range from hundreds to thousands of tokens per second on optimized servers.

How many words is 1,000 tokens?

On average, 1,000 tokens correspond to about 750 words, though this can vary depending on the language and tokenization method.

What does a token mean in AI?

A token is a unit of text used in natural language processing, often a word or subword piece that the model processes during inference or training.

How many tokens per second can a human read?

Humans generally read around 200 to 300 words per minute, which roughly translates to 250 to 400 tokens per minute, or about 4 to 7 tokens per second.

What is TTFT?

TTFT stands for Time To First Token, the latency measured from receiving a request to the moment the first token of the model output is generated.

How to measure TTFT?

TTFT is measured by timing the interval between when a model server receives the first request and when it outputs the first token, often tracked in benchmarking scripts.

What is the TPOT metric?

TPOT (Tokens Per Operation Time) is a performance metric indicating how many tokens a model generates per unit of processing time, useful for assessing throughput.

What is TPS in LLM?

TPS means Tokens Per Second, a measure of the model's throughput during inference, indicating how many tokens are generated each second.

What is a KV cache?

KV cache refers to the key-value cache that stores intermediate key and value tensors computed during the attention mechanism to speed up subsequent token generation.

What is GPU KV cache?

GPU KV cache is the storage of key-value pairs in GPU memory used during model inference to optimize attention computations and reduce redundant calculations.

What is the KV cache in LLM?

In large language models, the KV cache holds cached key and value vectors from previous tokens to efficiently compute attention for new tokens without recalculating past states.

What is key-value store cache?

A key-value store cache is a data storage method where data is stored as pairs of keys and corresponding values, enabling fast retrieval; in LLMs, this concept applies to caching intermediate computations.

How does dynamic batching work?

Dynamic batching groups incoming inference requests into batches dynamically based on arrival times and batch size limits, running batches either when full or after a timeout to balance latency and throughput.

What is the difference between static batching and dynamic batching?

Static batching waits for a batch to fill completely before processing, potentially increasing latency, while dynamic batching processes batches once full or after a set time, improving latency and resource utilization. Static batching is most appropriate when latency is not an issue. Static batching can substantially increase latency, limiting its use cases. Static batching requires a well-managed queue of requests to feed the model efficiently. Static batching can increase latency substantially, limiting its use cases. Static batching processes requests once a set number of requests has been received.

What does batching mean in shipping?

In shipping, batching refers to grouping multiple orders or shipments together to optimize transport efficiency and reduce costs.

What is dynamic batching in Unity?

In Unity, dynamic batching is a rendering optimization technique that combines multiple small meshes into a single draw call dynamically to improve graphics performance.

← Back