Compute now supports vLLM inference servers

Compute just got a major update. You can now launch an inference server with vLLM in only a few clicks. Pick a model, choose your hardware, and you’re ready to go.

It’s fast to get started, but if you want to fine-tune things, the settings are there: context length, sampling, memory use, and more.

We’re starting with Falcon3. Right now, you’ll find Falcon3 3B, Falcon3 Mamba-7B, Falcon3 7B, and Falcon3 10B in the catalog. But that’s only the beginning. Llama, Mistral, Qwen, and GPT-OSS are on the way.

And if the model you need isn’t listed, let us know. We’ll add it!

Introduction to the VLLM Feature and Its Configuration - Watch Video

The instance flow has also been rebuilt. It’s easier to follow and works the same whether you’re spinning up a general GPU or an inference server. You’ll notice more connection options, too. HTTPS stays the default, but you can now open TCP and UDP ports, run SSH sessions that survive interruptions with tmux, or launch straight into Jupyter

Pricing is still simple. You see the hourly cost before you launch, and you only pay per second with credits. You can start small on a single RTX 4090 or scale up to an eight-way 5090 cluster, depending on the model you choose. Servers are live in the UAE and France, with more locations on the way.

This is a big step for Compute. Inference is built in, easy to use, and flexible enough to handle serious workloads. We can’t wait to see what you run on it, and which models you’ll ask us to add next.

Launch your first inference server

← Back