Slow responses usually come from three things: prompts are too big, batches are shaped poorly, or the cache is out of room. Fix those before you shop for more GPUs.
Try Compute today: Launch a dedicated vLLM endpoint on Compute in France (EU), USA, or UAE. Set tight caps, keep traffic in‑region, and measure TTFT/TPS with your own prompts.
Symptoms and likely causes
Multiple factors, such as prompt size, batching, and cache health, influence LLM performance.
Quick triage (5‑minute checklist)
- Region — is the endpoint close to users?
- Streaming — is stream: true on?
- Caps — are max_tokens and context limits set per route?
- Prompt size — trim history and system prompts; target short inputs. Note: Shorter input size can improve performance metrics like tokens per second (TPS), as input tokens influence TPS measurements.
- Headroom — VRAM ≥ 10–20% free at peak?
- Queue — is queue length steady under load, or climbing?
- Logs — do you record TTFT/TPS per request with request IDs?
Fast fixes (today)
Client side
- Turn on streaming. Users stop early; TTFT feels faster; TPS improves.
- Trim prompts. Remove boilerplate, collapse history, keep examples minimal.
- Tighten caps. Set max_tokens per route (chat: 128–256; summaries: 256–512).
- Retries with jitter. Only on 429/5xx/timeouts; cap attempts.
- Abort on stop. Wire a Stop button that cancels the stream to free server slots.
Server side
- Right‑size batch shape. Keep many small decodes over a few long ones.
- Protect the cache. Enforce context caps; evict fairly; watch hit/miss.
- Disable proxy buffering on streaming routes; set keep‑alive timeouts.
- Token‑aware limits. Set TPM and per‑key concurrency to avoid starvation.
- Pin models. Avoid surprise upgrades that change speeds.
Durable fixes (this quarter)
- Switch to continuous batching. Admit/retire requests every step; measure fairness. Use the right tools to optimize batching performance, and manage complexity to achieve maximum performance, especially on NVIDIA hardware.
- Adopt RAG over long contexts. Fetch only what you need; prompts shrink; TTFT drops.
- Quantize wisely. Try int8 first; int4 only after quality checks. The underlying model architecture affects compatibility and performance with different quantization methods. Quantized models can be efficiently run on CPUs, especially in limited-resource environments. Be sure to use the correct command line instructions for quantization setup, and ensure the right version of libraries or drivers is installed.
- Place endpoints by region. EU in France; US in USA‑East; Middle East in UAE.
- Consider multi‑GPU only after you prove caps and cache are healthy. When scaling, consider model size—the number of parameters impacts both performance and resource requirements.
Quantization and other advanced techniques
Quantization helps you run large language models faster and use less memory. You convert model weights from higher precision formats like 16-bit floats to lower precision ones like 4-bit integers. This shrinks your model size and cuts memory needs. More of your model and its kv cache fits in GPU memory, so you get faster data access and lower latency when the model runs. When you're building generative AI, this means better performance and lower costs, whether you're handling many requests or working with bigger models.
You've got several quantization methods to choose from. Each comes with trade-offs. Post-training techniques like GPTQ and AWQ work well for LLMs. AWQ uses a hardware-aware, data-driven approach to compress model weights. It often gives you better performance and less accuracy loss on modern instruction-tuned models. Pick the right method for your needs. Smaller models and lower precision boost speed and cut costs, but they might hurt output quality if you don't test carefully.
Continuous batching keeps your LLM serving at high throughput. Instead of waiting for a full batch of requests, it processes multiple tokens and requests as they arrive. Your GPU stays busy with minimal idle time. Frameworks like vLLM use this approach. They handle many output tokens and new requests at the same time, which improves both throughput and how fast users see responses. When you need low latency and high responsiveness, continuous batching works.
FlashAttention speeds up LLMs through better attention mechanisms. It restructures attention computation to reduce memory bandwidth bottlenecks. Your model can process longer sequence lengths and larger contexts more efficiently. This helps when you're working with huge amounts of data or generating long outputs.
Your hardware and configuration choices matter. Use GPUs with enough kv cache and optimize your memory hierarchy. Pick the right model size and sequence length for what you're building. You'll balance speed, cost, and output quality. Larger models usually give better results but need more resources. Smaller models run faster and cost less.
Combine quantization, continuous batching, and techniques like FlashAttention. You'll get better performance, lower latency, and reduced costs for your large language models. Understand the trade-offs and tailor your approach to your specific needs. You can deliver faster, more efficient generative AI services without spending extra on hardware.
A test plan that catches the real problems
- Seed set — 30–60 real prompts (short + long). For instance, a short prompt may generate a much larger output than its input, illustrating how output size can differ significantly from input size.
- Ramp — increase RPS until TTFT p95 crosses your target. When measuring TTFT/TPS, note that tokens are processed during inference, and output tokens per second is a key generation metric.
- Mix — combine short and long prompts to expose fairness issues.
- Cancel storms — ensure KV‑cache frees fast on abort.
- Hot swap — change model/quantization; compare TTFT/TPS and quality to previous models. When testing, include new models to evaluate improvements.
- Failure drill — kill one node; check retries and user messaging.
Track progress over multiple test iterations to monitor improvements and identify issues. When analyzing outputs, review the generated content for quality and relevance. Be aware of a common mistake in test planning: assuming quantization mainly speeds up calculations, when in fact it primarily improves memory efficiency and bandwidth. During tokenization, remember that tokens can represent a word, part of a word, or punctuation, which affects how data is processed and evaluated.
Try Compute today: Run a vLLM server on Compute. Put it near users, watch TTFT/TPS, and scale only when numbers tell you to.
Fix TTFT and TPS before buying more GPU
Start with prompts, caps, and streaming, and focus on optimizing these areas before considering hardware upgrades. Keep the cache healthy and batches steady. Place the endpoint close to users. When TTFT drops and tokens/second climbs, you’ve solved the real problem—not just masked it with hardware.
FAQ
What is TTFT and why does it matter?
Time to first token is when users feel speed. Long TTFT signals big prompts, cold caches, or far regions.
How do I get more TPS without hurting latency?
Keep outputs short, shape batches for many small decodes, and enforce token‑aware limits so large jobs don’t starve others.
Do longer contexts always help?
No. Long contexts raise cost and TTFT. Use retrieval to keep prompts short.
When should I move to multi‑GPU?
Only when the model or cache no longer fits and you’ve already tuned prompts, caps, and scheduling.
How do I know if the kv cache is the problem?
Watch GPU memory headroom and cache hit rate. If TTFT rises while headroom shrinks, tighten context and clear stuck streams.