Transport Protocol Comparison
| Protocol | Direction | Connection | Proxy/CDN Support | Reconnection | Best For |
|---|---|---|---|---|---|
| HTTP/REST | Request-response | Short-lived | Excellent | N/A | Simple queries, non-streaming |
| SSE | Server-to-client only | Persistent (HTTP) | Good (standard HTTP) | Built-in auto-reconnect | AI chat streaming, notifications |
| WebSocket | Bidirectional | Persistent (upgrade) | Requires config | Manual implementation | Voice chat, collaborative editing |
| Realtime API | Bidirectional audio | Persistent (WebSocket) | Requires config | Manual implementation | Voice assistants, live audio |
- Default to SSE for AI chat applications — it covers 90% of use cases with the least infrastructure pain.
- Use WebSocket only when the client needs to send frequent, low-latency messages (voice streaming, collaborative real-time editing, gaming).
- Use the Realtime API when you need native audio-in/audio-out without separate STT/TTS round-trips.
- Stick with HTTP for batch processing, non-interactive workloads, or environments where persistent connections are impractical (serverless functions with hard timeout limits).
WebSocket Chat
Basic WebSocket Server
WebSocket Client
Server-Sent Events (SSE)
SSE is the simpler alternative to WebSockets when you only need server-to-client streaming (which covers most AI chat use cases). The client sends a regular HTTP request, and the server keeps the connection open to push data. No special protocol, works through proxies and load balancers, and auto-reconnects on failure. Use SSE unless you specifically need bidirectional communication.OpenAI Realtime API
The Realtime API is fundamentally different from the standard chat API. Instead of request-response, it is a persistent WebSocket connection where audio flows in both directions simultaneously. The model listens while it speaks, detects when you start talking (voice activity detection), and responds with audio directly — no separate STT/TTS round-trips. This cuts latency from seconds to hundreds of milliseconds.Audio Conversation
Voice Activity Detection
Latency Optimization
Typing Indicators and Progress
Realtime Collaboration
Realtime Best Practices
- Use WebSockets for bidirectional communication, SSE for server-to-client only (SSE is simpler and more proxy-friendly)
- Implement typing indicators — without them, users have no idea if the system is working or frozen
- Stream tokens as they are generated — users can start reading while the model is still generating
- Handle disconnections gracefully — mobile users lose connection constantly; reconnect and resume without losing context
- Monitor Time-to-First-Token (TTFT) as your primary latency metric — total time matters less than perceived responsiveness
- Watch out for Nginx/CloudFlare buffering — it can silently destroy your streaming experience
Edge Cases and Production Gotchas
Client disconnects mid-stream
Client disconnects mid-stream
Problem: The user closes the tab or loses cellular signal while the LLM is still generating. Your server keeps generating tokens, paying for them, but nobody is listening.Fix: Check the connection state before sending each chunk. In FastAPI WebSocket, wrap
send_text in a try/except for WebSocketDisconnect. For SSE, the StreamingResponse generator should catch asyncio.CancelledError (which FastAPI raises when the client disconnects). Always clean up resources (cancel the OpenAI stream, decrement active connection counters) in a finally block.Nginx/CloudFlare buffering destroys streaming
Nginx/CloudFlare buffering destroys streaming
Problem: Tokens arrive in large bursts instead of one-by-one. The reverse proxy buffers the entire response before forwarding.Fix: For Nginx, add
proxy_buffering off; and the X-Accel-Buffering: no response header. For CloudFlare, disable “Rocket Loader” and “Auto Minify.” For AWS ALB, set idle timeout higher than your longest expected stream. This is the most common “streaming isn’t working” issue in production and is never an application-layer bug.Mobile networks with high latency and packet loss
Mobile networks with high latency and packet loss
Problem: WebSocket connections drop frequently on 3G/4G. Users see partial responses and frozen UIs.Fix: Implement message sequence numbers so the client can detect gaps. Add a reconnection protocol that resumes from the last received sequence number. For SSE, the
Last-Event-ID header handles this automatically — another reason to prefer SSE over WebSocket for mobile clients.Multiple concurrent streams from one user
Multiple concurrent streams from one user
Problem: A user opens three tabs, each initiating a stream. Your server is now running three concurrent LLM calls for one person, tripling cost and potentially hitting rate limits.Fix: Track active streams per user (using the
RateLimitedStreamer pattern above). Either reject new streams while one is active, or cancel the previous stream when a new one starts. Include the stream ID in the WebSocket session so the client can resume rather than restart.Practice Exercise
Build a realtime AI application that:- Uses WebSockets for bidirectional communication
- Implements voice activity detection
- Streams responses with latency metrics
- Shows typing indicators and progress
- Supports multiple concurrent users
- Minimizing time to first token
- Smooth streaming experience
- Graceful error handling
- Scalable connection management