“OpenAI‑compatible” should mean you can change a base URL, rotate a key, and ship—a true drop in replacement. Many APIs get close, a few get it right, and some hide quirks that break production. Use the checks below to find out before you migrate.
Try Compute today: Launch a vLLM inference server on Compute in France (EU), USA, or UAE. You get an HTTPS endpoint with OpenAI‑style routes. Point your existing OpenAI SDK at the new base URL and start testing.
What compatibility should mean
- Supported routes: /v1/chat/completions, /v1/completions (optional), /v1/embeddings.
- Supported payload shapes: model, messages, max_tokens, temperature, top_p, stop, stream.
- Supported streaming via Server‑Sent Events (SSE) with chunk format your SDK already parses.
- Supported error schema with stable fields (type, message, code, param, request_id).
- Supported rate‑limit headers and honest Retry‑After.
- Reasonable defaults with clear docs (tokens, sampling, timeouts).
- Supported security & residency knobs: region, logging policy, retention.
Quick start: swap the base URL
Python
To authenticate your requests, insert your API key in the api_key parameter. The following code sample demonstrates how to send requests to the OpenAI-compatible API endpoint:
from openai import OpenAI
client = OpenAI(base_url="https://YOUR-ENDPOINT/v1", api_key="YOUR_KEY")
resp = client.chat.completions.create(
model="f3-7b-instruct",
messages=[{"role":"user","content":"Write a one‑sentence status update."}],
max_tokens=60
)
print(resp.choices[0].message.content)
Node
Make sure to set your API key in the apiKey field to authenticate requests. This code shows how to send requests to the OpenAI-compatible API endpoint:
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "https://YOUR-ENDPOINT/v1", apiKey: process.env.KEY });
const r = await client.chat.completions.create({
model: "f3-7b-instruct",
messages: [{ role: "user", content: "Give me one key risk for this project." }],
max_tokens: 60
});
console.log(r.choices[0].message.content);
If this works without code changes beyond the base URL, you’re close. Now verify the edges.
Endpoints to test (minimal but telling)
- Chat completions — non‑streaming and streaming.
- Completions — if you still use text endpoints.
- Embeddings — batch >1 input; verify ordering and vector size.
- Models list — GET /v1/models; check IDs match docs and access policy.
- Moderation (optional) — if vendor claims compatibility, verify schema and ensure endpoint configuration is correct.
What to assert
- Status codes: 200 on success, 4xx on client errors, 5xx on server.
- Field names: choices[].message.content, usage.prompt_tokens, usage.completion_tokens.
- Order stability: embeddings results align with inputs, ensuring the results are consistent and in the expected order.
- Token usage: numbers sum and make sense for your prompts; verify that the results reflect accurate token counts.
Streaming & SSE (where many APIs drift)
- SSE headers set and no proxy buffering.
- Chunk format: SDK can parse incremental deltas from the server's response; final message contains usage and finish_reason. Correctly handling the response ensures smooth streaming and accurate parsing.
- Cancellation: aborting the request frees server resources quickly by stopping the response stream.
- First‑token time (TTFT): measure p50/p95; compare with your current provider.
Node stream loop (should Just Work)
const stream = await client.chat.completions.create({
model: "f3-7b-instruct",
messages: [{ role: "user", content: "Stream 3 short sentences." }],
stream: true,
max_tokens: 120
});
for await (const chunk of stream) {
const delta = chunk.choices?.[0]?.delta?.content;
if (delta) process.stdout.write(delta);
}
Errors, limits, and headers
- Structured errors: machine‑readable fields (type, message, code, request_id). Be aware of known issues related to error handling, such as inconsistent error codes or unexpected payload formats.
- Rate limits: clear docs + headers; Retry‑After must match reality. There may be issues with rate limit enforcement or header accuracy in some environments.
- 429 behavior: clients recover with jittered backoff; streams end cleanly when limits hit.
- Timeouts: confirm sensible server timeouts and keep‑alive for streaming.
Example error payload
{
"error": {
"type": "rate_limit_exceeded",
"message": "Key exceeded 60k tokens/minute.",
"code": "tpm_exceeded",
"request_id": "..."
}
}
Model names, defaults, and sampling
- Model IDs: consistent and documented; you may need to map model IDs from OpenAI to the new provider using a mapping table if you migrate from OpenAI names.
- Defaults: temperature, top_p, presence/frequency penalties; sane max_tokens.
- Stop sequences: honored and not silently altered.
- Tool calling / JSON modes (if offered): stable schemas and examples.
- Quantization and context length: disclosed limits and trade‑offs.
Eval prompts and parity checks
- Developers should build a seed set (30–60 prompts) from your real workload and run the evaluation tests.
- Measure TTFT and tokens/sec alongside quality.
- Keep caps tight during tests to expose batching fairness.
- Compare functionality parity: streaming behavior, stop sequences, error types, and tokenizer counts.
Security, privacy, and residency
- Region choice with documented locations. Ensure you select the appropriate region to maintain security and privacy requirements.
- Logging policy: numbers vs raw text; default retention and deletion. Ensure logging configurations protect sensitive data.
- Keys and access: per‑service keys, rotation, IP allowlists for admin. Ensure proper key management and access controls are in place.
- Contracts: DPA/BAA where required; subprocessor list and change policy.
Try Compute today: Point your OpenAI client at a vLLM endpoint on Compute. Keep traffic in‑region, stream tokens, and enforce caps. Run the checklist above against your real prompts.
A simple test plan that makes “OpenAI‑compatible” real
Change the base URL, then verify endpoints, streaming, and errors with your own prompts. Maintain control over your API endpoints and configurations throughout the migration and testing process. Measure time to first token and tokens per second, check rate‑limit headers and defaults, and demand clear docs on region and logs. If it passes this list, you can switch with confidence—and switch back just as easily.
FAQ
What does “OpenAI‑compatible” actually cover?
Routes, payloads, streaming, errors, and rate‑limit behavior that existing OpenAI SDKs and clients already handle. For more detailed information, refer to the relevant section of the documentation or guide.
Do I need to change my tokenizer or prompts?
No for the client wrapper; maybe for the model. When migrating, verify what has changed in counts and defaults, then keep a small mapping from old to new model IDs.
How do I test SSE without a library?
Use HTTP requests with curl and the -N option, or the browser Streams API. You should see incremental chunks and a clear end of stream.
What if usage numbers don’t match my estimates?
Treat server‑side usage as truth. Align your client’s counter to the model tokenizer and normalise inputs.
Can I rely on Retry‑After for backoff?
Yes—if the provider is honest. Different providers may implement Retry‑After differently, so always back off with jitter and respect Retry‑After; streams should end cleanly on soft quota.
How do I keep the option to switch back?
Wrap the base URL and model IDs in config. Add configuration options or metrics dashboards to facilitate switching back. Keep your eval set and metrics dashboards ready so you can compare at any time.