Users feel delay before they see words. Low latency is crucial for a good user experience and for optimizing a model's performance, as minimizing response times directly impacts how users perceive generative AI systems. Systems fail when queues stretch and memory runs out. A small set of metrics can warn you early and point to the right fix. Tracking these metrics is crucial for ensuring system reliability and maintaining the model's performance. Keep the list short, wire it well, and make alerts boring.
On Compute, a vLLM endpoint gives you an HTTPS URL with OpenAI‑style routes and predictable capacity. Place it close to users, then watch TTFT, TPS, and cache headroom.
Core metrics
These are the core performance metrics for LLM inference. Various factors, such as input prompt complexity, resource usage, and system configuration, influence these metrics. Monitoring tools and different methods are used to track and analyze them.
Time to first token (TTFT). Gap between request send and the first token. The most important number for perceived response times and user experience. The attention mechanism and prefill time contribute significantly to the delay before the first token, as the model processes input tokens and builds its key-value cache before generating responses. Track p50 and p95. The final token marks the end of a response.
Tokens per second (TPS). Throughput once tokens start. Higher is better for UX and capacity. TPS is calculated based on how many requests are processed and the number of output tokens generated per second for all the requests, often measured alongside requests per second. Inter token latency is the average time between consecutive tokens, and total TPS is a key metric for system throughput. Efficient attention computation can help reduce inter token latency. Track p50 and p95.
Queue length. Requests waiting for a decode slot. Rising length with flat traffic equals trouble.
GPU memory headroom. Free VRAM during peaks. Low headroom predicts failures, slow starts, and evictions. Resource usage and cost implications should be considered when monitoring GPU memory.
Cache hit rate. Health of the KV‑cache and any prompt caches. Lower hit rates often explain TTFT creep.
Prefill vs decode time. Prefill time is affected by the input prompt and the number of input tokens, as these factors determine how long the model spends in the prefill phase before decoding.
Error rate by type. Complexity and other factors can influence error rates, so tracking by type helps identify root causes.
Note: These metrics may vary depending on system configuration, workload complexity, and service provider.
Supporting metrics
Request rate (RPS). Pairs with queue length to explain pressure. Request rate is typically calculated as requests per second, providing a quantitative measure of system throughput.
Prefill vs decode time. Splitting these helps isolate long prompts from slow generation. The input prompt and the number of input tokens are key factors that affect prefill time, as more complex or lengthy prompts can increase processing duration.
Error rate by type. OOM, timeouts, 4xx/5xx. Group by route and model. Complexity and other factors can influence error rates, making it important to analyze error types in context.
Network latency. Client ↔ endpoint RTT by region. Spikes here can mimic server slowness.
Thermals and clocks. Throttling shows up as flat TPS and rising TTFT. Monitoring resource usage and cost is important when evaluating GPU memory headroom or error rates, as these can impact both performance and budgeting.
Supporting metrics are tracked using monitoring tools to ensure comprehensive observability and timely detection of issues.
Instrumentation that pays off
Effective instrumentation for LLM observability relies on monitoring tools and various methods to ensure comprehensive tracking and analysis.
- Client timers. Stamp request_start, first_token, and last_token. Tracking the timing of responses is a key part of the process for evaluating model performance.
- Server spans. Capture enqueue time, prefill, decode, and stream flush. Different methods can be used to capture relevant data at each stage.
- Stable IDs. Send a request_id from client to server and into logs.
- Token counters. Log prompt tokens and output tokens per request. Monitoring resource usage is important for understanding system efficiency. Skip raw text unless you truly need it.
- Sampling. Keep full detail for a slice of traffic and rollups for the rest. Select the appropriate method for sampling to balance detail and scalability.
Minimal events to log per request
- timestamp, route, model
- request_id, user/session id (hashed)
- prompt_tokens, output_tokens
- ttft_ms, tps_tokens_per_sec
- queue_length_at_start
- gpu_mem_free_mb, cache_hit_rate
- status (ok / timeout / oom / error_code)
- resource usage (e.g., CPU/GPU utilization, memory usage)
- response details (e.g., response length, response status)
Dashboards that answer real questions
Dashboards are built using monitoring tools to visualize key performance metrics for large language models (LLMs).
“Are users feeling slow right now?”
TTFT p50/p95 over time with traffic overlay. Drill by region and model. Response times and responses are key indicators of user experience.
“Can we take more load?”
TPS p50/p95, queue length, and GPU memory headroom. Flat TPS + rising queue = not yet.
“Why did latency spike?”
Split prefill vs decode, add network RTT. Long prefill → prompts too big. Long decode → caps or batch shape.
“Are we wasting tokens?”
Distribution of output tokens vs your caps. Big tail = loose settings.
Note: Dashboard metrics should be interpreted in context, as variations in measurement conditions or model configurations can affect results.
Alerting & SLOs
Set thresholds you can defend. Examples to start:
- TTFT p95 > 1,000 ms for 5 minutes at steady RPS.
- TPS p50 drops by 30% with flat RPS.
- GPU free < 10% for 5 minutes.
- Timeouts > 1% or OOM > 0.2% over 10 minutes.
Define a service level objective (SLO), such as “TTFT p95 ≤ 800 ms and error rate ≤ 1% over 28 days.” An SLO is often part of a service level agreement (SLA), which is a formal contract between the service provider and the user specifying performance and quality standards. Track the error budget and page when you burn it too fast.
Use monitoring tools and different methods to track SLOs and performance metrics. For alerting, select the method that best fits your operational needs.
Note: SLOs should be tailored to the needs of the service provider and users to ensure they are meaningful and actionable.
Load tests you should actually run
These are common methods for load testing:
- Ramp test. This method increases concurrency stepwise; as concurrent requests increases, total TPS rises until a saturation point is reached, after which performance may decline. Continue until TTFT p95 breaks your target.
- Burst test. Fire a sudden wave equal to your expected peak and observe total TPS under load.
- Mixed prompts. Blend short and long requests; watch scheduler behavior.
- Cancel storm. Stop many streams mid‑generation; check cleanup.
- Failover. Restart a node; verify recovery and client retries.
Note: Results may vary depending on the method used and system configuration.
Common pitfalls
- Chasing single‑request speed. Concurrency wins more often.
- Uncapped outputs. Long generations starve everyone else, increasing resource usage and driving up cost.
- Verbose system prompts. Costly and often ignored by the model, leading to unnecessary resource usage and higher cost.
- No region strategy. Cross‑region calls add avoidable latency.
- Counting tokens wrong. Keep tokenizer and model aligned.
Note: These pitfalls can lead to increased operational costs and inefficient resource usage if not properly managed.
Last thoughts
Focusing on essential metrics is more effective than tracking too many. Small, stable metrics beat a wall of charts. Watch TTFT and TPS, keep queues short, and leave headroom in memory. Fix prompts and caps before you change hardware.
Note: Simplicity is key—prioritize essential metrics to avoid unnecessary complexity.
Try Compute today
Prefer predictable ops? Launch a vLLM endpoint on Compute in France or UAE, cap tokens, and track TTFT/TPS from day one.
FAQ
What is TTFT?
Time to first token—the delay before the model's output begins, specifically the time it takes to create the first new token. Users feel it more than any other metric.
What is a good TPS?
Enough to keep chat smooth and queues short at your traffic. Measure with your prompts. Total TPS (tokens per second) reflects the throughput across all the requests and depends on how many requests are handled at once. Optimize batch shape and caps to raise it.
How many metrics do we really need?
Five core ones cover most issues: TTFT, TPS, queue length, GPU memory headroom, and cache hit rate. These are the essential performance metrics for monitoring LLMs. Various methods exist for tracking these metrics, depending on your infrastructure and evaluation needs. Add request rate and error types for context.
How do we test alerts without waking people up?
Use synthetic traffic in a staging environment, or temporarily raise thresholds in production and fire a controlled burst. Monitoring tools and various methods can be used to test alerting systems effectively.
Do we need distributed tracing?
Helpful once you run multiple nodes or a gateway. Distributed tracing is one of several monitoring tools and methods for system observability. Start with request IDs and clear spans; add tracing as you grow.
What is TTFT?
TTFT, or Time to First Token, measures the delay from when a request is sent until the first token of the model’s output is received. This metric includes the prefill time, during which the attention mechanism processes the input and creates the key-value cache needed to create the first token. It is a critical metric for perceived responsiveness.
How to measure TTFT?
TTFT is calculated by measuring the time it takes to create the first new token. This involves recording the timestamp when a request is sent and the timestamp when the first output token arrives, then calculating the difference.
What is the TPOT metric?
TPOT (Time per Output Token) is the average time taken to generate each token after the first one, reflecting the steady-state token generation speed. Specifically, TPOT measures the inter token latency between consecutive tokens produced by the model. Efficient attention computation can help reduce this inter token latency, improving overall decoding efficiency.
What is TPS in LLM?
TPS stands for Tokens Per Second and indicates how many tokens a model can generate each second during inference, measuring throughput. TPS is closely related to requests per second, as both metrics help evaluate system performance. Total TPS measures the overall throughput across all the requests being handled simultaneously, reflecting the system's capacity to generate output tokens per second for all the requests combined.
What is a token per second?
A token per second is a unit measuring the number of tokens produced by the model every second during output generation. This metric is calculated by dividing the number of tokens generated by the elapsed time.
How many tokens per second is ChatGPT?
ChatGPT typically generates tokens at a rate of about 20 to 30 tokens per second, depending on server load and model version.
Note: TPS may vary depending on performance metrics such as latency and throughput, as well as system conditions.
How many words is 1,000 tokens?
Approximately 750 words correspond to 1,000 tokens since a token averages about 0.75 words in English. The word-to-token ratio is calculated based on the average word length and the process of tokenization.
Is 7 tokens per second good?
Seven tokens per second is relatively slow for many real-time applications but may be acceptable for longer, less time-sensitive generations.
How to measure inference?
Inference is measured using different methods, which include tracking performance metrics such as latency (TTFT, total generation time) and throughput (TPS).
What are 5 examples of an inference?
Examples include text completion, question answering, translation, summarization, and code generation. These are all examples of responses generated as the model's output.
What are the 5 main steps to inference?
The inference process begins with the input prompt, which is tokenized into input tokens. The attention mechanism processes these input tokens during the prefill time, creating the key-value cache necessary for the model to generate a response. This prefill phase is crucial for creating the first new token, which marks the start of output token generation. The model then continues decoding and generating output tokens until the response is complete.
What does inference mean in court?
In court, inference refers to a conclusion reached based on evidence and reasoning rather than explicit statements.
What do you mean by observability?
Observability is the ability to understand a system’s internal state by analyzing its outputs, such as logs, metrics, and traces, using monitoring tools and various methods to assess and interpret the process and results.
What is observability vs monitoring?
Monitoring uses predefined metrics and alerts, often relying on monitoring tools and specific methods to track system health. Both observability and monitoring utilize monitoring tools and methods, but observability enables exploratory analysis without prior knowledge, offering a deeper understanding of the process and providing deeper insights.
What are the three pillars of observability?
Metrics, logs, and traces are the three pillars that provide comprehensive system visibility, supported by various monitoring tools and methods.
What is observability in DevOps?
In DevOps, observability helps teams proactively detect, diagnose, and resolve issues by collecting and analyzing telemetry data. Observability relies on monitoring tools and various methods to analyze the process, ensuring effective detection and resolution of issues.
What is a KV cache?
A KV (key-value) cache stores intermediate key-value pairs during model inference to speed up token generation and enables efficient attention computation during inference.
What is GPU KV cache?
GPU KV cache refers to storing key-value pairs in GPU memory to optimize attention computations during LLM inference, supporting efficient attention computation.
What is the KV cache in LLM?
The KV cache holds previously computed attention keys and values to avoid redundant calculations when generating new tokens. This cache is essential for efficient attention computation, enabling faster and more optimal processing during token generation.
What is key-value store cache?
A key-value store cache is a data structure that stores data indexed by unique keys for fast retrieval.
What do you mean by throughput?
Throughput is a key performance metric that represents the amount of work done or output produced by a system in a given time, such as tokens generated per second. Throughput is typically calculated as output divided by time, and can be measured using different methods depending on the evaluation approach.
What is throughput vs bandwidth?
Throughput measures actual processed data over time, while bandwidth is the maximum possible data transfer rate. Both throughput and bandwidth are important performance metrics when evaluating system efficiency.
What is an example of throughput?
Throughput is a performance metric that is calculated as tokens per second. For example, generating 100 tokens per second during LLM inference demonstrates throughput.
What's another word for throughput?
Output rate or processing rate.
Is 95 latency good?
A 95th percentile latency under 500 milliseconds is generally considered good for real-time applications.
What is p95 tail latency?
P95 tail latency is the latency value below which 95% of requests complete, indicating worst-case performance for most users.
What is P50 latency?
P50 latency is the median latency, meaning half of the requests complete faster and half slower than this value.
Is p95 an average?
No, p95 is a percentile metric representing the value below which 95% of observations fall, not an average. Both p95 and average are types of performance metrics used to evaluate and compare system performance.
What is the concept of error budget?
An error budget defines the allowable threshold of errors or latency breaches within a service period to balance reliability and innovation. Error budgets are typically part of a service level agreement (SLA) between the service provider and the user, and are based on performance metrics such as latency and throughput.
How do you calculate error budget?
Error budget is calculated using performance metrics as follows: Error budget = (1 - SLO target) × total requests or time period.
What is the difference between error budget and SLO?
SLO (Service Level Objective) sets the performance target as part of a service level agreement between the service provider and user, and is based on performance metrics; error budget quantifies allowable deviations from that target.
What is a budget error?
A budget error occurs when the error budget is exceeded, indicating service reliability issues and potential problems with performance metrics such as latency, throughput, or other benchmarking results.
What is a GPU memory?
GPU memory is the dedicated RAM on a graphics processing unit used to store data and computations during processing. It is a key aspect of resource usage, as monitoring GPU memory helps track overall resource utilization and performance metrics in LLM observability solutions.
How much GPU memory do I need?
The needed GPU memory depends on model size, batch size, precision, as well as resource usage and various factors such as throughput, latency, and response quality; larger models and batches require more memory.
What is using GPU memory?
Using GPU memory refers to the allocation of GPU RAM for model parameters, intermediate data, and computation during inference. This is a form of resource usage, as tracking GPU memory utilization is essential for understanding overall system performance and efficiency.
How do I check GPU memory?
You can check GPU memory usage with tools like nvidia-smi on NVIDIA GPUs, system monitoring software, or specialized monitoring tools designed for advanced observability. Monitoring tools can help track GPU memory usage, alert you to issues, and provide detailed metrics for effective system management.