← Back

Which GPU should you use for LLM inference

Most inference problems are memory problems in disguise. If the model and its cache fit with headroom, you can batch requests and keep latency steady. If memory is tight, everything slows down. Start with VRAM, then think about speed and price.

The underlying GPU architecture plays a crucial role in determining inference performance and the suitability of hardware for LLM deployment, as architectural differences impact efficiency and scalability for large models.

On Compute, you can launch a vLLM server on Compute on single‑ or multi‑GPU presets, including 4090‑class and 5090‑class options where available. Cloud platforms like Compute are increasingly used for LLM deployment due to flexible access to high-performance GPUs. Pick France or UAE regions to keep endpoints close to users.

A quick decision path

  1. Pick the smallest model that solves the task. Try 7B before 13B. Use evals, not vibes.
  2. Estimate context and outputs honestly. Long chats and big prompts eat memory.
  3. Target concurrency. How many users at once with acceptable TTFT/TPS?
  4. Choose VRAM to fit model + cache + batch. Consider how much GPU memory and how much memory are required for your specific model size and batch requirements—larger models and higher batch sizes need more memory. If close to the limit, step up a tier, use quantization, or consider lower precision formats (like FP8 or int8) as a trade-off to optimize memory usage and throughput.
  5. Decide single vs multi‑GPU. Go multi when one card cannot meet memory or throughput needs. Optimizing multi-GPU setups can improve performance, but there are trade-offs between single and multi-GPU configurations, such as increased complexity and cost.
  6. Place the endpoint near users. Regional latency matters more than micro‑optimizations.

Start in seconds with the fastest, most affordable cloud GPU clusters.

Launch an instance in under a minute. Enjoy flexible pricing, powerful hardware, and 24/7 support. Scale as you grow—no long-term commitment needed.

Try Compute now

Model‑to‑VRAM cheat sheet (ballpark)

These are rough ranges for weights only. You still need headroom for the KV‑cache and batching.

  • 7B, FP16: ~14–16 GB
  • 7B, int8: ~7–9 GB
  • 7B, int4: ~4–6 GB
  • 13B, FP16: ~26–28 GB
  • 13B, int8: ~13–16 GB
  • 13B, int4: ~7–9 GB

Serving LLMs (large language models) efficiently requires careful planning of GPU memory allocation for both the model and its cache, as LLM inference is computationally demanding and benefits from specialized hardware.

Add cache headroom: longer contexts and higher concurrency can double or triple the working set. If VRAM sits >90% under load, expect TTFT to creep up.

Single vs multi‑GPU

Single‑GPU is simpler and often faster for 7B‑class models with moderate context. Start here if you can.

Multi‑GPU helps when the model or context does not fit, or when you need more throughput at the same latency target. Use tensor or pipeline parallelism, and test batch shapes. Parallelism adds communication overhead, so measure with your real prompts. It is crucial to measure real-world performance and communication overhead when optimizing multi-GPU setups.

Consumer vs data‑center parts

Consumer GPUs (e.g., 4090‑class, 5090‑class): great price‑performance for 7B–13B models. The RTX 4090, built on the Ada Lovelace architecture, is suitable for developers and small teams working on LLM inference and creative workloads, offering strong performance for both AI and artistic applications. Strong for dedicated endpoints where you control traffic.

Data‑center GPUs (e.g., A100 80GB, H100 80GB, L40S 48GB): designed for deployment in data centers, these GPUs leverage advanced architectures for high performance, energy efficiency, and power efficiency. The A100 uses the Ampere architecture, delivering strong and exceptional performance, high memory capacity, and an energy efficient design, making it suitable for research and large-scale AI workloads. The H100 is based on the Hopper architecture, which brings further improvements in performance and power efficiency for demanding AI and scientific computing tasks. The L40S, utilizing the Ada Lovelace architecture, enhances both creative workloads and AI tasks. Useful for long contexts, bigger models, or strict reliability needs. The H100 includes a specialized transformer engine to accelerate training and inference of transformer models, which are crucial for NLP tasks. Additionally, the A100 supports Multi-Instance GPU (MIG) technology to allow efficient partitioning for multiple workloads.

If you need ECC, long uptimes, or NVLink, lean data‑center. If you want maximum tokens per euro on small‑to‑mid models, consumer cards win.

Latency and throughput, briefly

  • TTFT is dominated by queueing and prefill. Bigger prompts and tighter memory headroom increase it.
  • Tokens per second (TPS) rises with healthy batching and decode efficiency. More VRAM → larger active batch → higher TPS. Advanced GPU features, such as mixed-precision training and Tensor Cores, can help maintain accuracy even as batching and throughput are increased.
  • Network placement can add 50–100 ms in a blink; keep endpoints close to users.

Power, thermals, and reliability

Hot cards throttle. Use cases with steady load need good airflow and power headroom. Data‑center parts are built for this; consumer cards can do it with care. Monitor temps and clocks.

Region placement

Place the endpoint where most users are. EU users benefit from France. Middle‑East markets benefit from UAE. Cross‑region calls add latency you cannot optimize away in code.

A budgeting approach you can reuse

  1. Estimate tokens/day. Include prompt + output.
  2. Divide by TPS/GPU at your target quality and model.
  3. That gives GPU‑hours/day. Multiply by your hourly rate.
  4. Run a sensitivity check. Vary context and max tokens; these swing cost the most.
  5. Decide on redundancy. One spare node costs money but saves incidents.

Monitoring that pays off

  • TTFT p50/p95 under rising load
  • TPS p50/p95 at steady traffic
  • GPU memory headroom and cache hit rate
  • Thermal throttling events
  • Error rates (OOM, timeouts, 5xx)

Quick checklist

  • Start with the smallest model that passes evals.
  • Choose VRAM with headroom for context and batch.
  • Prefer single‑GPU until you must scale out.
  • Stream responses and cap max_tokens.
  • Place endpoints in the region users live in.
  • Watch TTFT/TPS, memory, temps, and errors.
Try Compute today
On Compute, pick from single 4090‑class all the way to multi‑GPU presets, with France and UAE regions. Launch a vLLM server and point your OpenAI client at the new base URL.

Final recommendations for choosing the best GPU for LLM inference

Pick GPUs by VRAM first, then speed, then price. Keep endpoints close to users, stream responses, and watch TTFT and memory. Let clean measurements, not spec sheets, drive upgrades.

Ready to test? Launch a vLLM endpoint on Compute, choose your region and preset, and compare TTFT/TPS before you commit to a bigger card.

FAQ

What GPU is enough for a 7B chat model?

A 24 GB card usually works well, especially with int8 or int4 variants and sensible caps. Keep headroom for the cache and batching.

When do I need multi‑GPU?

When the model or context will not fit on one card with headroom, or when you need higher throughput at the same latency target.

Do I need NVLink?

Helpful for very large models and long contexts across multiple GPUs. For 7B–13B with moderate context, you can often stay on a single card.

4090 vs A100 vs H100—how should I think about it?

4090‑class offers strong price‑performance for small‑to‑mid models. A100/H100 add big VRAM pools, ECC, and interconnects for heavy duty, long context, and strict uptime. The H100 also features a transformer engine, which accelerates transformer model training and inference for large language models. The H100 offers up to 30x better inference and 9x better training performance compared to the A100, making it a significant upgrade for demanding AI workloads.

What changes for long context (32k+)?

Cache growth dominates. Either step up to more VRAM per node, trim prompts via RAG, or split across GPUs with care.

Will quantization let me drop a GPU tier?

Often yes. Start with int8; move to int4 only if your evals hold steady.

Do you need GPU for LLM inference?

Yes, GPUs are essential for efficient LLM inference as they provide the parallel processing power needed to handle the large number of parameters and matrix operations involved. While CPUs can run inference, GPUs significantly accelerate the process and reduce latency.

What GPU do I need for LLM?

The choice depends on the model size and workload. For smaller models like 7B, consumer GPUs with around 24 GB VRAM (e.g., RTX 4090) often suffice. Larger models or workloads requiring long context windows may need data-center GPUs like the NVIDIA A100 or H100, which offer more memory and features like NVLink. The RTX 4090 has 24 GB of GDDR6X memory, which is sufficient for running or fine-tuning models in the 7B–13B range.

What GPU does OpenAI use for inference?

OpenAI typically uses high-end data-center GPUs such as NVIDIA A100 and H100 for inference to handle large-scale models efficiently, benefiting from their large memory capacity, tensor cores, and multi-instance GPU capabilities.

How to choose GPU for inference?

Consider the model size, required VRAM, throughput needs, latency targets, and budget. Start by ensuring the GPU has enough memory for the model and its cache, then evaluate performance factors like CUDA cores, tensor cores, and memory bandwidth. Also, consider single vs multi-GPU setups based on workload scale.

How much faster is A100 than 4090?

Performance varies by workload. The A100 excels in large-scale AI workloads with features like tensor cores and high memory bandwidth, while the RTX 4090 offers competitive raw throughput for smaller models at a lower cost. For some tasks, the A100 may be faster, but the 4090 can match or exceed performance in others, especially in consumer-focused scenarios. The A100 provides an excellent balance of performance and power efficiency, making it suitable for many LLM workloads.

What is the difference between H100 and A100 vs RTX 4090?

The H100 and A100 are data-center GPUs optimized for AI workloads with features like higher VRAM, tensor cores, NVLink, and multi-instance GPU support. The RTX 4090 is a consumer GPU with excellent performance and efficiency for smaller models but lacks some enterprise features and large memory pools found in H100/A100. Both the H100 and A100 are ideal for large-scale AI workloads, while the RTX 4090 is more suited for smaller tasks.

Is the Nvidia A100 still relevant?

Yes, the A100 remains highly relevant for large-scale AI training and inference, offering an excellent balance of performance, memory capacity, and enterprise features, especially for workloads requiring large models and multi-GPU configurations.

Why is 4090 being discontinued?

There is no official confirmation that the NVIDIA RTX 4090 is being discontinued. Any rumors should be verified through official NVIDIA announcements. Typically, product discontinuation is due to new generation releases or supply chain changes.

How much VRAM does a 7B model need?

A 7B model typically requires around 14–16 GB of VRAM in FP16 precision, with less needed if using quantization techniques like int8 or int4. Additional memory headroom is needed for cache and batching.

What GPU will run 7B model?

GPUs with at least 16 GB VRAM, such as the NVIDIA RTX 4090 or A100 40GB, can run 7B models efficiently, especially when using quantization and optimized batching.

How much VRAM does Vega 7 have?

The AMD Vega 7 integrated GPU typically shares system memory and does not have dedicated VRAM. The amount available depends on system configuration, usually ranging from 2 to 4 GB of shared memory.

What GPU do you need for Mistral 7B?

Mistral 7B, being a 7 billion parameter model, requires a GPU with at least 16 GB of VRAM for efficient inference, such as the NVIDIA RTX 4090 or equivalent data-center GPUs, with quantization potentially reducing memory needs.

← Back