Serve AI models faster than they can think
Spin up a GPU in seconds. Keep first-token time low and tokens per second high. Pay only for the time you use.


Why teams choose Compute for inference

Managed inference with vLLM
Start serving in minutes using our vLLM template.

All-inclusive pricing
No egress charges. Per-second billing with prepaid credits and optional auto top-up.

Flexible networking
HTTPS, TCP, or UDP. Expose the ports your service needs.

In-region runs
France and UAE clusters keep traffic close to your users.
How it works

Pick a 4090 or 5090 tier.

Launch from a clean PyTorch or vLLM image.

Enable networking (HTTPS/TCP/UDP) and point your app to the endpoint.

Save it as a custom template for next time.
Most models sit idle for long stretches.
Pay for the minutes you use, not the hours you don’t.
Performance snapshot
From our benchmark:
Dual RTX 5090 reached 7,604 tokens/second, with ~45 ms time-to-first-token on Llama-3.1-8B.

Pricing at a glance
On-demand GPUs, billed per second via prepaid credits
All-inclusive: compute, storage, and data transfer included
Welcome bonus: up to €250 on first purchase
RTX 5090
RTX 4090
GPUs are on-demand today. Spot capacity is coming soon.
What people run on Compute

Conversational AI for support and tutoring

LLM endpoints tuned for apps and APIs

Voice models for real-time transcription or captions
Do you have any questions?
Do you support vLLM?
Yes. Use the vLLM template to serve models quickly.
Can I keep the service behind HTTPS?
Yes. HTTPS is available alongside TCP and UDP.
Can I pause my instance?
Yes. Stop/Start is available without extra fees for a limited time. See details on the blog.
Which regions are live?
France and UAE.
Do you store my inputs or outputs?
No. Logs and data stay in your instance unless you choose to persist.
Where does my data live?
Runs stay in the region you choose.