How to Add WebSocket Streaming to LLM Apps
WebSocket streaming makes an LLM app feel faster because users see tokens, tool status, and progress events as they happen. The model may still take 8 seconds to finish, but the UI can respond in the first 300 to 800 ms with useful feedback.
For production teams, streaming is not only a UI feature. It changes how you handle state, authentication, errors, tracing, retries, and observability. A WebSocket connection stays open, so you need to design it like a session, not like a one-off HTTP request.
When to use WebSocket streaming for LLM apps
Use WebSockets when your LLM workflow needs frequent server-to-client updates during one run. Common cases include:
- Streaming assistant tokens into a chat UI.
- Showing tool call progress, such as “searching docs” or “querying database.”
- Running multi-step agents where each step should be visible to the user.
- Sending intermediate status updates for long workflows, such as report generation.
- Keeping one connected session open across prompt execution, tool calls, and completion events.
If you only need a single response after the model finishes, regular HTTP is simpler. If you need one-way streaming and do not need bidirectional messages, Server-Sent Events can also work. Choose WebSockets when the frontend may send messages back during the run, such as cancellation, user input, approval, or tool confirmation.
The basic architecture
A typical WebSocket-based LLM flow has four parts:
- The frontend opens an authenticated WebSocket connection.
- The frontend sends a structured “start run” event with the prompt input and metadata.
- The backend starts the LLM call, streams tokens and workflow events back to the client, and records trace data.
- The backend sends a final completion or error event, then closes the run cleanly.
Do not send random text blobs over the socket. Use a typed event schema. This gives your UI, logs, tests, and traces a stable contract.
Define a streaming event schema first
Before you write the WebSocket handler, define the messages that can pass between client and server. A practical schema includes run IDs, event types, timestamps, and structured payloads.
{
"type": "run.started",
"runId": "run_01J8ZV6K8X9N7F4A2B3C",
"timestamp": "2026-05-29T15:04:12.000Z",
"payload": {
"promptName": "support_agent",
"promptVersion": "v12",
"model": "gpt-4.1-mini"
}
}Here are common events for an LLM streaming workflow:
{
"client_to_server": [
{
"type": "run.start",
"payload": {
"input": "Help me debug this billing webhook error",
"metadata": {
"userId": "user_123",
"sessionId": "sess_456"
}
}
},
{
"type": "run.cancel",
"runId": "run_01J8ZV6K8X9N7F4A2B3C"
}
],
"server_to_client": [
{
"type": "run.started",
"runId": "run_01J8ZV6K8X9N7F4A2B3C",
"payload": {
"model": "gpt-4.1-mini"
}
},
{
"type": "token",
"runId": "run_01J8ZV6K8X9N7F4A2B3C",
"payload": {
"text": "First,"
}
},
{
"type": "tool.started",
"runId": "run_01J8ZV6K8X9N7F4A2B3C",
"payload": {
"toolName": "search_docs",
"toolCallId": "tool_789"
}
},
{
"type": "tool.failed",
"runId": "run_01J8ZV6K8X9N7F4A2B3C",
"payload": {
"toolName": "search_docs",
"toolCallId": "tool_789",
"error": "Docs API returned 503"
}
},
{
"type": "run.completed",
"runId": "run_01J8ZV6K8X9N7F4A2B3C",
"payload": {
"finishReason": "stop",
"usage": {
"inputTokens": 812,
"outputTokens": 236
}
}
},
{
"type": "run.failed",
"runId": "run_01J8ZV6K8X9N7F4A2B3C",
"payload": {
"message": "Model provider timeout",
"retryable": true
}
}
]
}Every event should include a runId once the run starts. This one field saves hours of debugging because you can connect the browser session, backend logs, provider request, prompt version, tool calls, and trace.
Backend WebSocket handler example
The example below uses Node.js with the ws package. The same structure works with FastAPI, Django Channels, Next.js route handlers with a WebSocket adapter, or a dedicated realtime service.
import { WebSocketServer } from "ws";
import crypto from "crypto";
const wss = new WebSocketServer({ port: 8080 });
function send(ws, event) {
ws.send(JSON.stringify({
timestamp: new Date().toISOString(),
...event
}));
}
function requireAuth(req) {
const token = new URL(req.url, "http://localhost").searchParams.get("token");
if (!token) {
throw new Error("Missing auth token");
}
// Replace this with your real auth check.
// For production, validate a short-lived signed token or session cookie.
if (token !== process.env.DEMO_WS_TOKEN) {
throw new Error("Invalid auth token");
}
return { userId: "user_123" };
}
async function streamModel({ input, onToken, onToolStart, onToolError }) {
// Replace this function with your model provider stream.
// This mock shows the shape of the control flow.
onToken("I’ll ");
onToken("check ");
onToken("the ");
onToken("webhook ");
onToken("failure. ");
onToolStart({
toolName: "search_logs",
toolCallId: "tool_" + crypto.randomUUID()
});
try {
// Simulate tool execution.
await new Promise((resolve) => setTimeout(resolve, 300));
throw new Error("Log service returned 503");
} catch (error) {
onToolError({
toolName: "search_logs",
toolCallId: "tool_unknown",
error: error.message
});
}
onToken("The log service is unavailable, so I can’t verify the event yet.");
return {
finishReason: "stop",
usage: {
inputTokens: 128,
outputTokens: 42
}
};
}
wss.on("connection", async (ws, req) => {
let user;
let activeRunId = null;
let closed = false;
try {
user = requireAuth(req);
} catch (error) {
send(ws, {
type: "auth.failed",
payload: { message: error.message }
});
ws.close(1008, "Unauthorized");
return;
}
ws.on("message", async (raw) => {
let message;
try {
message = JSON.parse(raw.toString());
} catch {
send(ws, {
type: "client.error",
payload: { message: "Invalid JSON" }
});
return;
}
if (message.type === "run.cancel" && message.runId === activeRunId) {
send(ws, {
type: "run.cancelled",
runId: activeRunId,
payload: {}
});
activeRunId = null;
return;
}
if (message.type !== "run.start") {
send(ws, {
type: "client.error",
payload: { message: "Unsupported event type" }
});
return;
}
const runId = "run_" + crypto.randomUUID();
activeRunId = runId;
send(ws, {
type: "run.started",
runId,
payload: {
userId: user.userId,
model: "gpt-4.1-mini",
promptName: "debug_assistant",
promptVersion: "v3"
}
});
try {
const result = await streamModel({
input: message.payload.input,
onToken: (text) => {
if (!closed) {
send(ws, {
type: "token",
runId,
payload: { text }
});
}
},
onToolStart: (tool) => {
send(ws, {
type: "tool.started",
runId,
payload: tool
});
},
onToolError: (toolError) => {
send(ws, {
type: "tool.failed",
runId,
payload: toolError
});
}
});
send(ws, {
type: "run.completed",
runId,
payload: result
});
} catch (error) {
send(ws, {
type: "run.failed",
runId,
payload: {
message: error.message,
retryable: true
}
});
} finally {
activeRunId = null;
}
});
ws.on("close", () => {
closed = true;
if (activeRunId) {
console.log("WebSocket closed during active run", {
runId: activeRunId,
userId: user.userId
});
}
});
ws.on("error", (error) => {
console.error("WebSocket error", {
message: error.message,
runId: activeRunId,
userId: user.userId
});
});
});This handler does a few important things:
- Authenticates the connection before accepting work.
- Parses structured client events instead of trusting raw strings.
- Assigns a run ID and sends it to the frontend immediately.
- Sends tool errors as events so they do not disappear inside backend logs.
- Logs close and error events with the active run ID.
Frontend streaming UI example
On the frontend, treat the WebSocket as a stateful connection. Track connection state, run state, streamed text, tool events, and errors separately.
import { useEffect, useRef, useState } from "react";
export function StreamingAssistant() {
const wsRef = useRef(null);
const [status, setStatus] = useState("disconnected");
const [runId, setRunId] = useState(null);
const [input, setInput] = useState("");
const [answer, setAnswer] = useState("");
const [events, setEvents] = useState([]);
const [error, setError] = useState(null);
useEffect(() => {
const token = encodeURIComponent(window.localStorage.getItem("ws_token"));
const ws = new WebSocket(`wss://api.example.com/llm-stream?token=${token}`);
wsRef.current = ws;
ws.onopen = () => setStatus("connected");
ws.onmessage = (message) => {
const event = JSON.parse(message.data);
setEvents((current) => [...current, event]);
if (event.type === "run.started") {
setRunId(event.runId);
setStatus("running");
setAnswer("");
setError(null);
}
if (event.type === "token") {
setAnswer((current) => current + event.payload.text);
}
if (event.type === "tool.started") {
setStatus(`running tool: ${event.payload.toolName}`);
}
if (event.type === "tool.failed") {
setError(`${event.payload.toolName} failed: ${event.payload.error}`);
}
if (event.type === "run.completed") {
setStatus("completed");
}
if (event.type === "run.failed") {
setStatus("failed");
setError(event.payload.message);
}
};
ws.onerror = () => {
setStatus("socket_error");
};
ws.onclose = () => {
setStatus("disconnected");
};
return () => {
ws.close(1000, "Component unmounted");
};
}, []);
function startRun() {
if (!wsRef.current || wsRef.current.readyState !== WebSocket.OPEN) {
setError("WebSocket is not connected");
return;
}
wsRef.current.send(JSON.stringify({
type: "run.start",
payload: {
input,
metadata: {
page: "debug_assistant",
clientTimestamp: new Date().toISOString()
}
}
}));
}
function cancelRun() {
if (!wsRef.current || !runId) return;
wsRef.current.send(JSON.stringify({
type: "run.cancel",
runId
}));
}
return (
<section>
<p>Status: {status}</p>
<p>Run ID: {runId || "none"}</p>
<textarea
value={input}
onChange={(event) => setInput(event.target.value)}
placeholder="Describe the issue..."
/>
<button type="button" onClick={startRun}>
Send
</button>
<button type="button" onClick={cancelRun}>
Cancel
</button>
{error && <p role="alert">{error}</p>}
<article>
<h2>Assistant</h2>
<p>{answer}</p>
</article>
<details>
<summary>Stream events</summary>
<pre>{JSON.stringify(events, null, 2)}</pre>
</details>
</section>
);
}In a real UI, you may not show raw events to users. During development, though, the event log panel helps your team inspect ordering issues, missing run IDs, malformed tool events, and provider failures.
Trace the run from prompt to streamed response
A streamed LLM run can fail in several places: prompt assembly, retrieval, model streaming, tool execution, frontend rendering, or connection handling. You need trace data that follows the run across these steps.
A useful trace for one run might look like this:
{
"runId": "run_01J8ZV6K8X9N7F4A2B3C",
"userId": "user_123",
"sessionId": "sess_456",
"prompt": {
"name": "debug_assistant",
"version": "v3",
"inputVariables": {
"issue": "billing webhook error"
}
},
"modelRequest": {
"provider": "openai",
"model": "gpt-4.1-mini",
"startedAt": "2026-05-29T15:04:12.100Z"
},
"stream": {
"firstTokenAt": "2026-05-29T15:04:12.720Z",
"completedAt": "2026-05-29T15:04:17.440Z",
"tokenEvents": 64
},
"toolCalls": [
{
"toolCallId": "tool_789",
"name": "search_logs",
"status": "failed",
"error": "Log service returned 503"
}
],
"result": {
"status": "completed",
"finishReason": "stop",
"inputTokens": 812,
"outputTokens": 236
}
}This is where LLM observability becomes practical. You want to search by run ID, compare prompt versions, inspect tool failures, and see whether the model produced a weak answer because the prompt was wrong, context was missing, or a downstream system failed.
Handle authentication before the first run event
Skipping authentication is one of the fastest ways to turn a useful streaming endpoint into a security issue. WebSockets do not automatically make auth easier than HTTP.
Common authentication options include:
- Short-lived signed URL tokens: generate a token through an authenticated HTTP request, then use it to open the WebSocket.
- Session cookies: validate the user session during the upgrade request, if your stack supports it safely.
- Bearer tokens in the first message: possible, but weaker because you accept a socket before checking identity.
For production, prefer short-lived tokens. For example, your frontend can request a WebSocket token that expires in 60 seconds, then connect to wss://api.example.com/llm-stream?token=.... The server validates the token during connection setup and closes with code 1008 if the check fails.
Do not treat WebSockets like stateless HTTP
HTTP request handlers usually receive input, return output, and end. WebSocket handlers keep state across time. That means you must track active runs, cancellation, connection health, and cleanup.
At minimum, track these fields per connection:
- userId: the authenticated user or service account.
- connectionId: a unique ID for the socket connection.
- activeRunId: the current LLM run, if one is active.
- startedAt: when the connection opened.
- lastMessageAt: when the last event arrived from the client.
You should also send heartbeat pings or app-level keepalive events. Many proxies and load balancers close idle sockets after 30 to 120 seconds. If your agent can spend a long time inside a tool call, send a status event every few seconds so the UI does not look frozen.
Make tool calls visible and error-safe
Tool calls often break streaming implementations. The model starts streaming, calls a tool, the tool fails, and the frontend never receives a clear explanation. The user sees a hanging spinner. The backend logs show an exception, but the run has no clean end state.
Send tool lifecycle events explicitly:
tool.startedwhen the tool begins.tool.outputif you want to stream partial tool output.tool.completedwhen the tool returns successfully.tool.failedwhen the tool throws, times out, or returns invalid data.
Then decide how the run should continue. Some tool failures should end the run. Others should let the model explain the limitation and produce a partial answer. Either path should produce a final run.completed or run.failed event.
Close connections cleanly
A streaming run should have a clear ending. When the run finishes, send run.completed. When it fails, send run.failed. When the user cancels, send run.cancelled. Avoid leaving the UI to infer the outcome from a socket close.
Use close codes intentionally:
1000: normal closure.1001: server going away or deploy in progress.1008: policy violation, often failed auth.1011: unexpected server error.
If your backend deploys often, design reconnect behavior carefully. A reconnect should not silently duplicate an expensive model run. Use idempotency keys or a client-generated request ID if users may retry after a disconnect.
Add evaluations for streamed workflows
Streaming does not remove the need for quality checks. In fact, streaming can make failures harder to notice because the UI may look responsive even when the final answer is wrong.
Save the final response, prompt version, model, inputs, tool outputs, and trace metadata. Then run LLM evaluations against completed runs. For example, you can evaluate whether a support assistant:
- Answered the user’s question directly.
- Used retrieved documentation accurately.
- Handled tool failure honestly.
- Avoided making up account-specific facts.
- Escalated when required.
For subjective checks, teams often use LLM-as-a-judge to score completed outputs against a rubric. Keep these evals tied to prompt versions and run IDs so you can compare changes before shipping a new prompt.
Production checklist
Before you ship WebSocket streaming in an LLM app, check these items:
- Use
wss://in production. Do not stream model outputs over an unencrypted socket. - Authenticate during connection setup. Close unauthorized sockets before accepting run events.
- Use structured events. Avoid untyped blobs that mix tokens, errors, and status messages.
- Send the run ID early. Show it in debug UI and include it in every backend log line.
- Handle cancellation. Users need a way to stop long or expensive runs.
- Set timeouts. Add limits for idle connections, model calls, and tool calls.
- Emit final states. Every run should complete, fail, or cancel clearly.
- Record traces. Store prompt version, model, inputs, outputs, tool calls, token usage, latency, and errors.
- Test reconnect behavior. Simulate tab refreshes, mobile network drops, deploy restarts, and proxy timeouts.
- Run evals on completed outputs. Streaming improves UX, but evals tell you whether the answer was good.
Common mistakes to avoid
Sending plain text instead of events
Plain text streaming works for a demo, then breaks when you need tool status, errors, retries, or usage data. Use event types from the start.
Letting tool errors disappear
Catch tool errors and send them to the client as structured events. Also record them in your trace. A failed tool call is part of the LLM run, not an unrelated backend exception.
Forgetting to log run IDs
If a user reports “the assistant froze,” you need the run ID. Log it in the WebSocket handler, model call, tool calls, provider callbacks, and frontend debug state.
Leaving sockets open forever
Add idle timeouts and clean closure logic. A busy product can accumulate thousands of stale connections if browser tabs, mobile clients, or proxies disconnect poorly.
Retrying without idempotency
If the frontend reconnects and resends the same request, you may run the same expensive prompt twice. Add a client request ID and dedupe retries where possible.
A simple implementation plan
- Define your event schema with
run.start,run.started,token,tool.started,tool.failed,run.completed, andrun.failed. - Add WebSocket authentication with short-lived tokens.
- Build the backend handler and assign a run ID for every run.
- Stream model tokens through structured
tokenevents. - Send tool lifecycle events and catch tool errors.
- Render streamed tokens in the frontend and show run status separately.
- Record traces for prompt input, prompt version, model settings, output, tool calls, latency, and usage.
- Add evals for completed runs before changing prompts or models in production.
WebSocket streaming is worth the extra engineering when your LLM app needs realtime feedback, agent progress, cancellation, or rich workflow events. Keep the protocol structured, secure the connection, make errors visible, and connect every event to a traceable run ID.
PromptLayer helps AI teams manage prompts, trace LLM runs, debug streamed workflows, and evaluate production outputs. If you are adding WebSocket streaming to an LLM app, create a PromptLayer account to track each run from prompt to streamed response to completion.