Training gets the attention. Inference carries the load. Traffic is spiky, prompts vary in length, and people expect words on screen almost immediately. To keep that promise, you need a serving setup that treats memory, batching, and cost as first‑class concerns. There is always a balance between minimizing latency and maximizing throughput when optimizing LLM inference. Low latency is critical for interactive applications to ensure a good user experience.
Need a dedicated endpoint you can tune? On Compute, you can launch a vLLM inference server on RTX 4090 or multi‑GPU presets. You get an HTTPS URL that works with OpenAI SDKs. Choose a region to keep data close to users.
Why inference is hard
Requests arrive in bursts. Some prompts are short, others carry long chats. The model builds a key/value cache as it generates tokens. That cache lives in GPU memory. If you do not manage it well, latency grows and throughput collapses. The available compute resources, such as GPUs, directly impact the model's performance and the system's ability to handle high throughput without encountering performance bottlenecks. Larger input sequence lengths (ISL) impact memory requirements and can increase time to first token (TTFT).
Your goal is simple to say: keep latency low while serving as many tokens per second as your users need, without blowing the budget. Balancing latency and throughput is critical when optimizing LLM inference, as both metrics significantly impact performance and cost. Evaluating LLM performance involves monitoring these metrics to ensure efficient and cost-effective operation. One of the biggest challenges of LLM inference is its computational cost, which can lead to high latency and expense. Latency is crucial for user experience in interactive, real-time applications.
The three numbers to watch
These are key metrics typically measured when evaluating LLM inference performance:
- Time to first token (TTFT). How quickly the response starts. Users feel this, as it marks the initial delay before the model's output begins. TTFT is influenced by request queuing, prefill, and network latency.
- Tokens per second (TPS). How fast text flows once it starts. This sets chat feel and API capacity.
- Throughput. System throughput is a critical metric for assessing the maximum processing capacity under load. It measures how many concurrent requests stay within your latency target, reflecting the efficiency of the inference system.
- Intertoken latency (ITL). The average time between the generation of consecutive tokens in a sequence, which impacts the smoothness of text generation. The time to generate each completion token directly affects overall inference speed and the responsiveness of the model's output. Acceptable latency varies depending on the use case; for example, chatbots require lower latency than offline processes.
Average latency and total latency are important for understanding user experience, as they represent the mean and overall time from request initiation to receiving the final token. Token based metrics help compare model efficiency, training cost, and inference speed across different models and tokenization methods.
Serving architectures
Single GPU. Straightforward for 7B–13B models, proofs of concept, and small apps.
Multi GPU. One host, several cards. Use tensor or pipeline parallelism to fit larger models or raise throughput. As concurrent requests increases, higher throughput can be achieved up to the limits of the inference system.
Horizontal scale. Many nodes behind a gateway. Add load balancing, sticky sessions for cache reuse, and a scheduler that knows about prompt and output lengths. Load balancing and scheduling are essential for scaling LLM systems efficiently.
Serverless endpoints. Good for sharp spikes when you can accept cold starts and variable cost.
Prefer predictable performance? Try Compute and launch a vLLM server on a single 4090 or scale to a multi‑GPU preset. You get dedicated capacity and clear pricing.
Engines at a glance
vLLM. Strong concurrency from continuous batching and smart KV‑cache paging. Ships an OpenAI‑compatible HTTP server.
Text Generation Inference (TGI). Solid choice in the Hugging Face ecosystem with mature tooling.
TensorRT‑LLM. NVIDIA’s path to top speed on supported hardware. Best when you can invest in optimization.
Ollama. Great locally or for simple single‑box setups. Less focused on high‑traffic APIs.
Pick based on traffic profile, model support, and how much tuning you want to own.
Context and memory
Long prompts and long chats increase the KV‑cache. Without careful paging, VRAM disappears and latency climbs. Larger input sequence lengths (ISL) impact memory requirements and can increase time to first token (TTFT). The complexity and length of input requests can significantly affect both memory usage and inference latency. The maximum context length limits the total number of input and output tokens the model can process at once, directly impacting the ability to handle longer sequences and overall performance. Two levers help most teams: Using a larger batch size requires more VRAM and can lead to increased memory usage for the KV cache.
- Retrieval‑augmented generation (RAG). Keep prompts short. Fetch the right context at request time. Controlling output length is important for managing memory and cost.
- Efficient caching. Engines like vLLM split the cache into small blocks, reuse what they can, and evict what they must.
- Quantization. This technique reduces the hardware requirements for LLM inference by decreasing the precision of model weights and activations. It is important to use representative test data to evaluate the impact of quantization on performance.
Quantization, briefly
Lower precision saves memory and can improve throughput. AWQ or GPTQ int8/int4 are common. Expect small quality losses. Quantization can impact generation quality, so it should be evaluated carefully using relevant benchmarks. Test with your data before you commit. Fine tuning may be required to maintain performance after quantization.
Hardware choices
- Memory first. VRAM sets model size and batch depth. Twenty‑four gigabytes fits many 7B models with room to batch. Larger models or long contexts often need 48–80 GB or multiple GPUs, which increases infrastructure costs as more hardware is required.
- Compute next. Extra cores help during prefill. Batch well and decode stays efficient.
- Network placement. Put the endpoint near users. Network latency stacks quickly. FlashAttention reorders the attention mechanism’s computations to reduce memory bandwidth requirements.
Optimizing hardware selection and batching strategies is essential for maximizing cost efficiency in LLM inference, balancing performance with resource and infrastructure costs.
EU users? Deploy Compute in France. Markets in the Middle East? Choose a UAE region. Keep traffic close.
Costs that matter
- GPU hours. Your main line item. Right‑size hardware to model and traffic.
- Idle time. Autoscale or shut down when quiet, or pay for instant availability.
- Token waste. Long prompts and high max_tokens burn money. Stream responses and cap outputs.
- Batch size. Higher batch sizes can lead to more efficient GPU utilization but may increase latency for individual requests. Larger batch sizes can improve throughput but generally lead to increased latency. Streaming mode allows LLMs to provide incremental outputs, enhancing user experience. Maximizing throughput with optimal batch sizes can significantly improve cost efficiency by making better use of available hardware resources.
Balancing latency and throughput is critical when optimizing LLM inference, as both metrics significantly impact performance and cost efficiency.
A rough model: estimate daily tokens generated, divide by expected tokens per second per GPU, then convert to GPU hours. Compare with real traffic and add headroom for spikes. As concurrency increases, total tokens per second (TPS) grows until a saturation point is reached, beyond which performance can decrease. It's important to understand how many requests your system can handle within a given time frame to plan for capacity and manage costs effectively. Note that real world performance may differ from these estimates due to hardware variations and infrastructure factors, so always validate with actual deployment data.
Reliability and observability
LLM performance benchmarking and evaluating LLM performance using key metrics are critical for ensuring reliable and efficient deployments. Tracking these metrics helps teams understand system capacity, identify bottlenecks, and optimize resource usage.
Track at least:
- Request rate, queue length, TTFT, TPS
- GPU memory use and cache hit rate
- Model inference resource usage (GPU utilization and memory requirements during inference)
- Error types: OOM, timeouts, 5xx
- Monitoring Model Bandwidth Utilization (MBU) can help compare efficiency across different inference systems.
Common benchmarking metrics include time to first token (TTFT) and tokens per second (TPS), which are essential for evaluating system performance. Benchmarking LLMs is essential for assessing their performance and efficiency in real-world applications, helping teams identify areas for improvement and optimization. Evaluating the performance of LLMs involves using various tools that define, measure, and calculate metrics differently. Performance benchmarking helps identify issues related to model efficiency and optimization. Combining load testing and performance benchmarking provides a comprehensive understanding of LLM deployment capabilities. Analyzing the latency curve is also important to understand the trade-off between batch size and latency, and how different configurations affect throughput and response times.
Alert when TTFT rises or TPS falls under steady load. That is often a signal of memory pressure, bad batching, or performance bottlenecks.
Security and data residency
Terminate TLS, rotate keys, keep access scoped, and avoid logging raw prompts unless you must. If you work in Europe, keep data in‑region and document retention and deletion.
Try Compute today!
Compute endpoints use HTTPS by default. Pick a European location to keep data in region.
Build or buy
Own it if you need full control and have time for tuning. Use a managed, dedicated endpoint if you want speed to value and predictable spend. Keep an exit path either way. Compute vLLM servers provide a dedicated endpoint with OpenAI‑compatible routes. Swap the base URL in your SDK and go live.
Further Reading
- Switch from OpenAI without breaking code
- Choose a serving engine
- Continuous batching
- Long context vs RAG
- Quantization without wrecking quality
- The right GPU for 2025
- Metrics that matter
- EU privacy checklist
- Production checklist
- Add streaming the simple way
- RAG at scale
- Rate limiting & quotas
- Multi-GPU serving
- Fix TTFT/TPS
- Tokenization traps
- Buyer’s guide to OpenAI-compatible APIs
FAQ
What is TTFT and why does it matter?
Time to first token is the gap between sending a prompt and seeing the first token. Short TTFT improves perceived speed and trust. People feel this number more than any other. End-to-end request latency (e2e_latency) includes the time from when a request is sent to when the final token is received, providing a broader measure of user experience.
How many concurrent users can one GPU serve?
It depends on model size, context length, and batching. A well‑tuned 7B model with short prompts and streaming can serve many users on a single 24 GB card. Long contexts cut that number quickly.
Does a long context beat RAG?
Not always. Long contexts are simple but costly. RAG keeps prompts tight and lets you scale retrieval independently. Many teams use a hybrid.
Do I need multi‑GPU right away?
Start single GPU if you can. Move to multi‑GPU when memory or throughput demands it. Test parallel modes and watch cache health.
Can I keep data in the EU?
Yes. Place the endpoint in an EU region, use HTTPS, control access, and define clear retention policies.
What is a LLM inference?
LLM inference is the process where a large language model generates a response based on an input prompt by processing tokens through its neural network. During inference, the LLM processes the prompt by activating its vast network of parameters to predict the most likely sequence of tokens. LLMs can process large volumes of text and provide concise summaries of articles or documents.
What are the stages of LLM inference?
LLM inference typically involves two stages: the prefill phase, where the input tokens are processed, and the decoding phase, where the model generates output tokens one at a time.
What is the difference between inference and training LLM?
Training involves adjusting the model's parameters using large datasets, while inference uses the trained model to generate outputs without changing its parameters. LLMs can generate articles, stories, marketing copy, and even code.
What are LLM inference engines?
These are software systems designed to efficiently run LLMs for generating outputs, optimizing for latency, throughput, and resource usage.
What is vLLM for?
vLLM is an inference engine focused on strong concurrency with continuous batching and efficient key-value cache management to optimize LLM serving.
What is the difference between vLLM and LLM?
LLM refers to the large language model itself, while vLLM is an engine or framework for serving LLMs efficiently in production.
Is vLLM faster than Ollama?
vLLM is optimized for high concurrency and throughput, often making it faster for serving multiple requests compared to Ollama, which is better suited for simpler setups.
Why is vLLM so fast?
Because it uses continuous batching and smart key-value cache paging to maximize GPU utilization and reduce latency.
What does LLM serving mean?
LLM serving refers to deploying and running large language models to respond to user requests in real time or batch modes.
What is an LLM serving engine?
It is a platform or software that hosts and manages LLMs, handling inference requests efficiently.
What is an LLM server?
A server configured to run LLM inference workloads, providing access to model predictions via APIs or other interfaces.
What does LLM as a judge mean?
It refers to using LLMs to evaluate or score outputs, such as assessing model quality or ranking responses.
What is tokens per second?
Tokens per second (TPS) measures how many tokens an LLM generates or processes in one second, indicating throughput.
How many tokens per second is ChatGPT?
ChatGPT's TPS varies by deployment and hardware but typically ranges from a few dozen to over a hundred tokens per second.
How many words is 1,000 tokens?
Approximately 750 English words, as one token roughly corresponds to 0.75 words.
What does a token mean in AI?
A token is the smallest unit of text that a language model processes, which can be a word, subword, or character.
What is TTFT?
Time to first token (TTFT) is the latency from sending a request to receiving the first generated token.
How to measure TTFT?
By recording the time difference between submitting a prompt and receiving the first output token from the model.
What is the TPOT metric in LLM?
Time per output token (TPOT) measures the average time taken to generate each output token after the first one.
What is time to first token Nvidia?
It is Nvidia's measurement of TTFT, focusing on latency metrics during LLM inference on Nvidia hardware.
What is a KV cache?
A key-value cache stores intermediate attention results during decoding to avoid recomputing past tokens.
What is GPU KV cache?
It is the storage of key-value cache data within GPU memory to accelerate LLM token generation.
What is the KV cache in LLM?
The KV cache holds keys and values from previous tokens to efficiently compute attention for new tokens.
What is key-value store cache?
A data structure storing pairs of keys and values, used in LLMs to cache intermediate computations.
What is continuous batching?
A technique where incoming requests are continuously batched together to maximize GPU utilization and throughput.
What is a continuous batch?
A batch of inference requests formed dynamically as they arrive, processed without waiting for fixed intervals.
What is the difference between continuous batching and in-flight batching?
Continuous batching forms batches dynamically and continuously, while in-flight batching refers to requests currently being processed.
What does batching mean in banking?
In banking, batching refers to grouping transactions to process them collectively, unrelated to LLM serving.
What is LLM throughput and latency?
Throughput is how many tokens or requests an LLM can process per second; latency is the time taken to generate responses.
How to reduce latency in LLMs?
By optimizing batching strategies, using efficient hardware, reducing input sequence length, and leveraging caching.
Which is better, 50 ms or 40 ms latency?
40 ms latency is better as it means faster response times.
What is the biggest problem with LLM?
High computational cost and latency, especially for large models with long contexts.
What is the throughput of an LLM?
It is the number of tokens or requests an LLM can process per second under given conditions.
How to test LLM throughput?
By measuring tokens generated over time under controlled loads and concurrency.
Is LLM CPU or GPU intensive?
LLM inference is primarily GPU intensive due to large matrix computations.
How to increase LLM throughput?
By batching requests, using optimized inference engines, and deploying on powerful GPUs.
What GPU to use for LLM?
GPUs with high memory and memory bandwidth like Nvidia RTX 4090 or A100 are commonly used.
Does anything LLM use GPU?
Yes, LLM inference and training heavily rely on GPUs for parallel computation.
Do I need a GPU to run LLM locally?
For large models, a GPU is recommended; small models can run on CPUs but with reduced performance.
Is RTX 4090 good for LLM?
Yes, RTX 4090 offers high VRAM and compute power suitable for many LLM inference tasks.