← Back

Add streaming to your LLM app the simple way

Streaming is the easiest win for UX and cost. Users see words sooner, cancel when they have enough, and you spend fewer tokens. Streaming reduces latency, which is critical for a smooth end-user experience. You only need two things: a server that streams, and a client that reads chunks without buffering.

Try Compute today
Launch a vLLM inference server on Compute. You get an HTTPS endpoint with OpenAI‑style routes that stream by default. Point your existing OpenAI SDK at the new base URL and start measuring TTFT.

Streaming Tokens: A Comparison of SSE vs WebSockets in Plain English

Server‑Sent Events (SSE). One‑way stream over HTTP. SSE streams data from the server to the client over an HTTP connection. Simple, proxy‑friendly, great for token streaming. The EventSource API is standardized as part of the HTML Living Standard by the WHATWG. Works with EventSource in browsers and with streaming responses in most HTTP clients.

WebSockets. Two‑way messages over a persistent socket. WebSockets use the WebSocket protocol and are managed via a WebSocket object, enabling real-time, bi-directional communication between client and server. Useful when the client must send events mid‑stream (typing, cursor sync, collaborative edits). Token streaming is the mode in which the server returns the tokens one by one as the model generates them.

Rule of thumb: use SSE for chat unless you truly need bi‑directional messaging. With token streaming, the server can start returning tokens before generating the whole response.

When to use which

  • SSE: chat responses, summaries, code gen, anything read‑only to the browser.
  • WebSockets: collaborative editors, voice streams, tools that push mid‑generation updates from client to server, and real world applications requiring interactive, two-way communication.
  • Mix: start with SSE; add WS later for the few places that need it. Streaming tokens are temporary, unique credentials used by online video platforms to authenticate and authorize a user's access to a specific video stream.

Implement Server Sent Events (SSE)

Node (OpenAI SDK, OpenAI‑compatible server)

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "https://YOUR-ENDPOINT/v1", apiKey: process.env.KEY });

const stream = await client.chat.completions.create({
 model: "f3-7b-instruct",
 messages: [{ role: "user", content: "Draft a short update about project status." }],
 stream: true,
 max_tokens: 200
});

for await (const chunk of stream) {
 const delta = chunk.choices?.[0]?.delta?.content;
 if (delta) process.stdout.write(delta);
}

Python (OpenAI SDK, OpenAI‑compatible server)

from openai import OpenAI
client = OpenAI(base_url="https://YOUR-ENDPOINT/v1", api_key="YOUR_KEY")

with client.chat.completions.stream(
   model="f3-7b-instruct",
   messages=[{"role":"user","content":"Write a one‑paragraph summary."}],
   max_tokens=200,
) as stream:
   for event in stream:
       if event.type == "token":
           print(event.token, end="")

Cancel fast when users stop reading:

const controller = new AbortController();
// pass signal: client.chat.completions.create({ ..., stream: true, signal: controller.signal })
// later on
controller.abort();

Implement WebSocket Server (when you need two‑way)

The WebSocket server establishes a persistent websocket connection with clients, enabling real-time, bidirectional communication. During the initial handshake, the server and client exchange HTTP headers, including Sec-WebSocket-Key, Sec-WebSocket-Version, and Sec-WebSocket-Protocol, to upgrade the HTTP connection to a WebSocket connection and ensure security and protocol compliance.

// Server sketch using ws
import { WebSocketServer } from "ws";
const wss = new WebSocketServer({ port: 8080 });

wss.on("connection", (ws) => {
 ws.on("message", async (msg) => {
   const { prompt } = JSON.parse(msg.toString());
   // call your OpenAI‑compatible endpoint with stream=true
   for await (const token of generateStream(prompt)) {
     ws.send(token); // backpressure: check ws.bufferedAmount
   }
 });

 ws.on("close", () => {
   // Handle cleanup when the connection closes
   // Free resources or perform any necessary cleanup here
 });
});

When the connection closes, the WebSocket server triggers a close event, allowing you to handle cleanup and free resources associated with that websocket connection.

Handle backpressure: pause when ws.bufferedAmount is large; resume when it drains. In browsers, use streams API with reader.read() and respect ReadableStreamDefaultReader signals.

Cancellations, retries, backpressure

  • Cancel quickly. Wire a stop button that aborts the HTTP request or closes the socket. Free server‑side KV‑cache blocks right away.
  • Retries. Use exponential backoff with jitter on connect and first‑token errors. Do not retry after you have streamed user‑visible text unless you clearly restart the answer.
  • Backpressure. In Node, check res.flush()/drain signals or bufferedAmount on WS. Keep chunks small; prefer \n‑delimited JSON lines for custom streams. Configure the maximum number of concurrent requests or streams to prevent server overload.

Gateways and proxies (avoid buffering surprises)

  • Keep response buffering off for streaming routes.
  • Use HTTP/1.1 or HTTP/2 with keep‑alive; avoid intermediaries that coalesce chunks.
  • Set sensible idle timeouts so long generations do not drop mid‑stream.
  • Send Content-Type: text/event-stream for SSE with Cache-Control: no-store and Connection: keep-alive.

UX tips that make streaming feel fast

  • Show tokens as they arrive; keep a subtle caret.
  • Let users stop and copy easily.
  • Print partial code blocks cleanly; close fences when streams end.
  • Keep system prompts short; tight output caps stop rambling.
  • Log the first token timestamp to spot regressions.
Try Compute today
Deploy a vLLM endpoint on Compute. Streaming is on by default. Place it near users, set strict output caps, and watch TTFT and TPS improve.

Token streaming that feels fast and costs less

Use SSE for most chat. Reach for WebSockets only when you need two‑way messages. Cancel promptly, cap outputs, and turn off proxy buffering. Measure time to first token and tokens per second, then tune caps and batch limits before changing hardware.

FAQ

What is a Server‑Sent Event (SSE)?

A one‑way HTTP stream the server pushes to the client. Token streaming maps well to SSE.

What is the difference between SSE and WebSockets?

SSE is one‑way and simple; WebSockets are two‑way and better for interactive apps. For chat output, SSE is usually enough.

Does the OpenAI‑compatible API support streaming?

Yes—use stream: true or an SSE client. You will receive incremental tokens until the model stops or you cancel.

How do I cancel a stream?

Abort the HTTP request (SSE) or close the WebSocket. Always free server resources on cancel.

Why do I see output arrive in a single chunk?

An intermediary is buffering the response. Disable buffering for the route and keep the connection alive.

Can I stream in browsers without a library?

Yes—EventSource for SSE, or fetch() with the Streams API to read chunks.

← Back