RTX 4090 and 5090s can match — and sometimes beat — the A100

Consumer GPUs aren’t just for gaming anymore. Here’s what our tests show.

‍

Consumer GPUs are catching up. Our latest benchmarks show that the RTX 5090 — and even the 4090 — can match or beat an A100 for small and medium LLM inference. Faster responses, higher throughput, and lower costs make them a serious option for anyone building or scaling AI workloads.

---

The A100 has long been the gold standard for high-performance inference. But in our latest benchmarks, the new RTX 5090 — and even the older 4090 — are proving that consumer-grade GPUs can hold their own. In some cases, they outperform the A100 while costing far less.

We ran inference tests on an 8B LLaMA 3.1 Instruct model using the vLLM benchmark suite and the ShareGPT dataset. The goal was simple: see how the 4090 and 5090 stack up against an A100 for small to medium LLM deployments, both in low-load (interactive) and high-load (throughput-heavy) scenarios.

The short version

‍

RTX 5090 beat the A100 on latency and slightly on throughput in this setup.
- Latency (1 rps): 5090 cut TTFT to ~45 ms vs ~296 ms on A100 (huge for interactive apps) and lowered end-to-end latency by ~14%.
- Throughput (heavy load): 5090 delivered ~3802 tokens/s vs ~3748 tokens/s on A100 (~1.4% higher).
Two 5090s roughly doubled throughput to ~7604 tokens/s, about ~2× an A100 in this test.
RTX 4090 trailed the A100 on both latency and throughput here. It’s strong for its class, but not an A100 replacement at these settings.

If you’re serving small to medium models (like an 8B) and you care about snappy first token and steady tokens/s, a single 5090 already meets or edges past an A100 in our runs. If you scale out with two 5090s, you can clear ~2× the tokens/s of a lone A100 while keeping hardware costs flexible.

That doesn’t make datacenter GPUs obsolete. VRAM still rules for larger models and longer contexts, and A100s shine where memory headroom and multi-instance partitioning matter. But for many production 8B workloads, well-configured consumer GPUs are a practical alternative with real-world gains, especially on TTFT where user perception lives.

Read on for more details on the benchmark.

Benchmark Objectives

‍

Evaluate latency and throughput across different GPU classes.
Determine whether one or multiple consumer-grade GPUs can surpass or match the A100 for small and medium-sized models.
Provide verifiable results for infrastructure decision-making (cost-effective deployment strategies).

‍

Static Configuration

Parameter	Value
Context Length	8192 tokens
Output Length	512 tokens
Model	meta-llama/Meta-Llama-3.1-8B-Instruct
Precision	BF16
Batch Size (auto)	Based on GPU memory
Dataset	82.6 TFLOPs
Benchmark Tool	450 W

‍

Test Scenarios

‍

1. Moderate Load (Latency Test)

Attribute	Value
Request Rate	1 req/s
Number of Prompts	100
Goal	Capture average latency (TTFT, E2E)

2. Extreme Load (Throughput Test)

Attribute	Value
Request Rate	1100 req/s
Number of Prompts	1500
Goal	Measure maximum output token throughput (tokens/sec)

‍

Results & Analysis

‍

Scenario 1 – Latency under Moderate Load (1 req/s)

GPU	Avg ITL (ms)	Avg TPOT (ms)	Avg TTFT (ms)	Avg E2E latency (ms)	Notes
RTX 4090	19	19	349.9	9759.07
RTX 5090	12.14	12.14	45.41	6058.57	E2E: 14% faster than A100. TTFT: 84% faster
A100	13.25	13.25	296.44	7080.9

All GPUs handle moderate load scenarios effectively. However, the RTX 5090 significantly outperforms all other tested GPUs, including the high-end A100, in all latency categories:

The RTX 5090 delivered 14% faster end-to-end latency than the A100.

Time-to-first-token was where it really shined — 84% faster than the A100. That’s a big deal for chatbots, real-time assistants, and anything where responsiveness matters.

The 4090 landed close to the A100’s performance, making it a strong budget-friendly alternative.

‍

Scenario 2 – Throughput under Extreme Load (1100 req/s)

GPU	Avg Token Throughput (Tokens/sec)	Sustained RPS
RTX 4090	737.65	1.47
RTX 5090	3802.09	7.58
A100	3748.16	7.46

The RTX 5090 edged out the A100 in raw throughput, hitting 3,802 tokens/sec vs the A100’s 3,748.

Pairing two 5090s doubled throughput to 7,604 tokens/sec, more than 100% above the A100. And you still spend less than on a single datacenter card.

‍

What this means for you

‍

Across both low-load and high-load inference scenarios with medium-sized model (8B), high-end consumer-grade GPUs demonstrate comparable or superior performance to the A100 datacenter-grade GPU.

Under moderate load (1 req/s), the RTX 4090 offers latencies close to the A100 performances, and the RTX 5090 delivers superior performances.
Under extreme load (1100 req/s), the RTX 5090 achieves slightly higher throughput than the A100, while dual RTX 5090s are expected to deliver ~100% more token throughput, respectively.

While the A100 remains advantageous for certain workloads requiring larger VRAM, these results show that for medium-sized models, some consumer-grade GPUs are viable alternatives, especially when cost, and scalability are key considerations.

If you’re deploying small to medium LLMs, a well-configured 5090 — or a small cluster of them — can rival datacenter-grade hardware. You’ll trade some VRAM headroom, but gain serious cost savings and scalability options. For startups, research teams, or anyone who needs high performance without locking into expensive hardware, consumer GPUs are no longer a compromise.

← Back