Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Why Streaming Matters
LLM responses can take 5-30 seconds to fully generate. Streaming fundamentally changes user perception: instead of staring at a loading spinner for 10 seconds, users see the first word in 200ms and can start reading immediately. The psychological effect is dramatic — streaming makes a 10-second response feel nearly instant because the user’s brain is busy processing content as it arrives, not counting seconds. Without streaming:| Metric | Non-Streaming | Streaming |
|---|---|---|
| Time to First Token | 5-30s | 100-500ms |
| Perceived Latency | Full wait | Near instant |
| User Experience | Frustrating | Responsive |
OpenAI Streaming
Basic Streaming
Collecting Streamed Response
Async Streaming
Streaming with Tool Calls
This is where streaming gets tricky. When the model decides to call a tool, the tool call arguments arrive as fragments across multiple chunks — you need to accumulate them before you can parse the JSON and execute the function. Think of it like receiving a fax of a phone number digit by digit: you can’t dial until you have all the digits.FastAPI Streaming Endpoints
SSE (Server-Sent Events) is the most common pattern for AI chat applications. Unlike WebSockets, SSE is unidirectional (server to client), works over regular HTTP, survives proxy/load balancer configurations, and auto-reconnects on disconnect. Use SSE unless you need bidirectional streaming (like collaborative editing or voice chat).Server-Sent Events (SSE)
Streaming with Token Counting
WebSocket Streaming
For bidirectional real-time communication:Client-Side WebSocket
Streaming with LangChain
Async Streaming with LangChain
Production Streaming Patterns
In production, streaming introduces failure modes you don’t see with regular request-response. The connection can drop mid-stream, the LLM can time out after generating half an answer, rate limits can hit between chunks, and clients can disconnect while you’re still paying for generation. These patterns address the real-world problems.Graceful Error Handling
Timeout and Cancellation
Streaming with Rate Limiting
SSE vs WebSocket Decision Table
| Criterion | SSE | WebSocket |
|---|---|---|
| Direction | Server-to-client only | Bidirectional |
| Protocol | Standard HTTP | Upgrade from HTTP |
| Auto-reconnect | Built-in (EventSource API) | Manual implementation required |
| Proxy/CDN support | Works through most proxies | Requires proxy WebSocket support |
| Binary data | Text only (base64 for binary) | Native binary frames |
| Serverless friendly | Yes (standard HTTP response) | No (requires persistent connection) |
| Use for AI chat | Preferred | Only if you need client-to-server streaming too |
Streaming Edge Cases
Structured output mid-stream parsing
Structured output mid-stream parsing
json.loads() a partial stream. Solutions: (1) Use Instructor’s create_partial for Pydantic models that update as fields arrive. (2) For raw JSON, accumulate the full string and parse only on [DONE]. (3) For real-time UI updates during structured streaming, use a streaming JSON parser like ijson that emits events as key-value pairs become complete.Tool calls arrive as fragments during streaming
Tool calls arrive as fragments during streaming
tool_call.function.arguments on each chunk — it is a fragment, not valid JSON.Client falls behind (backpressure)
Client falls behind (backpressure)
send_text() blocking. Fix: implement a bounded queue per client. If the queue is full, drop the oldest unacknowledged tokens and send a “gap” marker so the client requests the full response on completion.Key Takeaways
Always Stream
Use SSE for Simplicity
Handle Errors Gracefully
Manage Resources
What’s Next
Prompt Versioning & Management
Interview Deep-Dive
You are building a production AI chat application. Explain the architectural trade-offs between Server-Sent Events and WebSockets for streaming LLM responses, and when you would pick each.
You are building a production AI chat application. Explain the architectural trade-offs between Server-Sent Events and WebSockets for streaming LLM responses, and when you would pick each.
- SSE (Server-Sent Events) is unidirectional: the server pushes data to the client over a standard HTTP connection. WebSockets are bidirectional: both client and server can send messages at any time over a persistent TCP connection. For LLM streaming, the data flow is inherently unidirectional during generation — the server streams tokens to the client — so SSE is the natural fit.
- SSE has several operational advantages. It works over standard HTTP, which means it survives proxies, load balancers, CDNs, and corporate firewalls that often block or mishandle WebSocket upgrade requests. It auto-reconnects on disconnect with the
Last-Event-IDheader, giving you resumability for free. It is simpler to implement, debug, and monitor — you can curl an SSE endpoint and see the stream in your terminal. You cannot do that with WebSockets. - WebSockets become necessary when you need bidirectional streaming. Real-time voice chat (audio streaming in both directions simultaneously), collaborative editing where multiple users push changes, or any scenario where the client needs to send data mid-stream (like canceling a generation while tokens are still flowing). For a standard chat UI where the user sends a message, waits for the response, then sends the next message, WebSockets add complexity without benefit.
- The concrete architectural difference: with SSE, each message from the user creates a new HTTP request that returns a streaming response. The connection closes when generation completes. With WebSockets, you maintain a persistent connection across multiple messages, which means you need connection lifecycle management, heartbeats, reconnection logic, and state management on the server. For a chat app with 10K concurrent users, that is 10K persistent WebSocket connections your server must hold open, versus SSE where connections open and close per message.
- My recommendation for most AI chat products: SSE for the generation stream, with a regular REST endpoint for sending messages. If you later need features like real-time typing indicators or multi-user presence, add WebSockets for those specific features alongside SSE for the main generation stream. Do not force everything through WebSockets just because you might need bidirectional communication someday.
proxy_buffering off; to the Nginx location block, or set the X-Accel-Buffering: no response header. Second, your application framework is buffering. Some WSGI servers and middleware buffer response bodies. Ensure you are using an ASGI server (like Uvicorn) with a proper StreamingResponse that flushes after each chunk. Third, the CDN or cloud load balancer is buffering. AWS ALB, Cloudflare, and similar services may buffer SSE streams. Each has specific configuration to disable buffering for streaming endpoints. I would diagnose by testing at each layer: curl directly to the application server (bypasses all proxies), then through Nginx, then through the full stack, and identify where the bursting begins.How do you handle streaming with tool calls? Walk me through the specific challenge and how the data arrives differently compared to regular text streaming.
How do you handle streaming with tool calls? Walk me through the specific challenge and how the data arrives differently compared to regular text streaming.
- With regular text streaming, each chunk contains a fragment of the assistant’s message — a few tokens of text that you can display immediately. The client appends each fragment to the UI as it arrives. Simple and progressive.
- Tool calls break this pattern entirely. When the model decides to call a tool, the tool call arguments arrive as fragmented JSON strings spread across multiple chunks. You might receive: chunk 1 contains
{"loc, chunk 2 containsation":, chunk 3 contains"NYC"}. You cannot parse or act on the tool call until all fragments are assembled into valid JSON. Meanwhile, you cannot display the fragments to the user because they are machine-readable function arguments, not human-readable text. - The implementation challenge is that a single response can interleave text content and tool calls, or contain multiple parallel tool calls. Each tool call has an index that identifies which call it belongs to. You need an accumulator that: tracks each tool call by index, concatenates argument fragments per index, detects when the stream ends or transitions to a new tool call, and only then parses the complete JSON and executes the function.
- The UX challenge is what to show the user during tool call accumulation. The user sees nothing while arguments stream in, creating a dead period. Best practice is to show a status indicator: “Searching for weather data…” as soon as you detect the tool call name (which arrives early in the stream, before the arguments). This gives the user feedback that work is happening. Once the tool executes and you feed results back to the model, the second-pass text response streams normally.
- An additional production concern: the model might stream partial arguments that are syntactically invalid JSON even when complete. You need a try/catch around
json.loadsafter accumulation, and a recovery strategy — either retry the API call or return an error to the user. I have seen this happen with complex nested schemas where the model generates a trailing comma or mismatched bracket.
asyncio.gather or equivalent, not sequentially. Sequential execution means the total wait time is the sum of all three tool latencies. Concurrent execution means it is the maximum of the three. For three API calls averaging 200ms each, that is 200ms versus 600ms — a 3x improvement. After all three results come back, you append them all to the message history (each with its corresponding tool_call_id) and make one more API call to get the final response. The model sees all three results simultaneously and synthesizes a coherent answer. The key implementation detail: you must include every tool result for every tool call. If the model requested three tools and you only return two results, the API returns an error. If one tool fails, return an error object as its result rather than omitting it — the model can reason about the failure and explain it to the user.Your streaming endpoint is running in production. One user opens 50 browser tabs and starts 50 concurrent streams, causing your OpenAI rate limits to spike and affecting other users. How do you design rate limiting for streaming specifically?
Your streaming endpoint is running in production. One user opens 50 browser tabs and starts 50 concurrent streams, causing your OpenAI rate limits to spike and affecting other users. How do you design rate limiting for streaming specifically?
- Streaming rate limiting has two dimensions that traditional request rate limiting does not: concurrent connections and total generation tokens. A single streaming request can hold a connection open for 30 seconds and consume 2000 tokens, while a non-streaming request is a single brief round-trip. You need to limit both the request rate (messages per minute) and concurrent streams (simultaneous open connections per user).
- For concurrent stream limiting, I would implement a server-side counter per user. When a stream starts, increment the counter. When it ends (complete, error, or client disconnect), decrement it. If the counter exceeds the limit (I would start with 3 concurrent streams per user), reject the new request with a 429 status code and a clear message: “Maximum concurrent streams reached. Please wait for an existing response to complete.” This prevents the 50-tab scenario directly.
- For request rate limiting, use a sliding window: maximum 10-20 requests per minute per user for a chat application. Use Redis or an in-memory counter with TTL. The sliding window is better than a fixed window because it prevents burst-at-boundary attacks where a user sends 20 requests at second 59 and another 20 at second 61 of the next window.
- The OpenAI rate limit protection is a separate concern. You should have a global token bucket or semaphore that limits total concurrent requests to the OpenAI API across all users, sized to stay within your rate limit. If the global limit is hit, queue incoming requests rather than rejecting them, with a timeout. This way, a single abusive user fills the queue but other users’ requests are still eventually served rather than immediately rejected.
- One subtlety specific to streaming: detecting client disconnects. If the user closes the browser tab mid-stream, the server may keep generating tokens and paying for them until the stream finishes. Implement disconnect detection by catching
ConnectionResetErroror using middleware that checks if the client is still connected before yielding each chunk. When a disconnect is detected, cancel the upstream OpenAI stream immediately.
Walk me through how you would implement a streaming response that can be canceled mid-generation, including the cost implications.
Walk me through how you would implement a streaming response that can be canceled mid-generation, including the cost implications.
- Cancellation has three layers: client-side, server-side, and upstream API. The client needs to signal cancellation (user clicks a stop button), the server needs to stop the generator and clean up, and ideally you stop the upstream API call to stop paying for tokens you will not use.
- On the client side with SSE, the client calls
eventSource.close()which drops the HTTP connection. With WebSockets, the client sends a cancel message like{"type": "cancel", "stream_id": "abc123"}. The server needs to detect the disconnection or receive the cancel message. - On the server side, the key is making the stream generator cancellation-aware. In Python with FastAPI, the
StreamingResponsegenerator can check a cancellation flag between each chunk yield. When using WebSockets, the cancel message sets a flag that the generator checks on its next iteration. The generator should break out of its loop immediately, perform cleanup (close the OpenAI stream, update metrics), and return. - For the upstream OpenAI API call, the Python client supports
stream.close()which terminates the HTTP connection to OpenAI. Crucially, you are still billed for tokens already generated before cancellation. If the model has generated 500 tokens when the user cancels at token 501, you pay for 500 output tokens. You do not get those back. This means cancellation saves future token costs but not past ones. - The cost implication creates an interesting optimization: if a user consistently cancels after the first paragraph, consider reducing
max_tokensfor that user, or implementing a “pause” feature where the model generates a paragraph at a time and the user can choose to continue or stop. This aligns token generation with actual consumption. - Implementation detail that trips people up: when the client disconnects during SSE, the server does not get an immediate notification in many frameworks. The server discovers the disconnect only when the next
yieldfails with a write error. This means there can be a lag between the user clicking “stop” and the server actually stopping generation. To minimize this lag, yield frequently (every token) rather than batching tokens before yielding.
request.is_disconnected() which checks the connection state without writing. Poll this every 500ms in a background task and set a cancellation event when it returns true. Alternatively, use a heartbeat mechanism: send a lightweight SSE comment (: heartbeat\n\n) every second. If the write fails, you know immediately. This reduces the detection window from 5 seconds to 1 second, cutting waste by 80%.