← Back

Move from OpenAI to your own LLM endpoint without breaking anything

Teams switch for three reasons: cost, control, and data. You want a private endpoint, predictable performance, and a bill you can explain. The good news is you do not have to rebuild your app to get there. If your client talks to OpenAI today, you can point it to your own server and keep most of your code the same.

Try Compute today
Want a ready endpoint? Launch a vLLM server on Compute, pick a region, and get an HTTPS URL that works with OpenAI SDKs. Change the base URL and key in your client and you are live.

What “OpenAI‑compatible” should mean

Compatibility is not a slogan. It means the server exposes the same routes and payload shapes: /v1/chat/completions, /v1/completions, and streaming with Server‑Sent Events. You still send model, messages, max_tokens, temperature, and the rest. Errors should follow similar structures so your existing handling keeps working.

Map your models and parameters

Create a small model map in config. Your catalog will differ from OpenAI’s. Keep names stable and document defaults. Start with simple, safe settings: temperature 0.3–0.7, top_p 0.9, and tight max_tokens. Review stop sequences, presence/frequency penalties, and any server‑specific fields.

Update your client

Python

from openai import OpenAI
client = OpenAI(base_url="https://YOUR-ENDPOINT/v1", api_key="YOUR_KEY")

resp = client.chat.completions.create(
   model="f3-7b-instruct",  # example
   messages=[{"role":"user","content":"Write a short status update."}],
   max_tokens=200,
)
print(resp.choices[0].message.content)

Node

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "https://YOUR-ENDPOINT/v1", apiKey: process.env.KEY });

const resp = await client.chat.completions.create({
 model: "f3-7b-instruct",
 messages: [{ role: "user", content: "Summarize yesterday's meeting notes." }],
 max_tokens: 200
});
console.log(resp.choices[0].message.content);

Keep your retry logic. Keep timeouts sensible. Log the request ID (or provide one) so you can trace problems across client and server.

Start in seconds with the fastest, most affordable cloud GPU clusters.

Launch an instance in under a minute. Enjoy flexible pricing, powerful hardware, and 24/7 support. Scale as you grow—no long-term commitment needed.

Try Compute now

Streaming done right

Use Server‑Sent Events when your UI expects a live stream. Test both paths: happy flow and abrupt cancels. Verify that partial generations are usable if a user stops early. Make sure your client decodes chunks correctly and clears buffers between requests.

Plan for rate limits and retries

Stick to exponential backoff with jitter. Make requests idempotent when you can. Prefer 429 for backoff signals and include a hint about when to try again. On the server side, set token‑aware limits so one chat with huge prompts does not freeze others.

Keep costs under control

Budget drift comes from long prompts and loose max_tokens. Cap both. Stream results. Cache system prompts. If you use RAG, keep chunks small and relevant. Watch for tokenization mismatches across SDKs.

Measure before and after

Track metrics that matter: time to first token (TTFT) and tokens per second (TPS) for the same tasks on both endpoints. A self‑hosted server near your users, with efficient batching, should improve both. If it does not, right‑size hardware, shorten prompts, and check cache health.

Privacy and residency

Decide what you log, for how long, and who can read it. Avoid storing raw prompts and outputs unless you must. Use HTTPS. Rotate keys. If you operate in Europe, keep data in‑region and document retention and deletion.

Try Compute
Compute endpoints are HTTPS by default. Choose France for EU users or a UAE region for nearby markets.

Migration checklist

  1. Write a model and parameter map.
  2. Swap base URL and key in your client.
  3. Test chat and streaming with a fixed prompt set.
  4. Verify error shapes and retry behavior.
  5. Cap max_tokens and trim prompts.
  6. Add token‑aware rate limits on the gateway.
  7. Log TTFT, TPS, queue length, and memory use.
  8. Pick a region close to users and confirm latency.
  9. Review logging and retention policies.
  10. Roll out gradually and watch the numbers.

FAQ

Does streaming work the same way?
Yes, if the server supports Server‑Sent Events for /chat/completions. Test chunk decoding and early stop behavior in your client.

What about tokenization differences?
Different models use different tokenizers. Keep the tokenizer and model in sync and double‑check counts in your budget math.

How do I test compatibility quickly?
Run a small suite of chat prompts against both endpoints and compare responses, latencies, and error codes. Add one streaming test and one timeout test.

Can I keep data in the EU?
Yes. Host the endpoint in an EU region, use HTTPS, control access with keys and IP lists, and define retention limits.

← Back