Voice AI enables natural human-computer interaction through speech recognition and synthesis. The voice pipeline has three stages: Speech-to-Text (STT) converts audio to text, your AI processes the text (chat, RAG, agents), and Text-to-Speech (TTS) converts the response back to audio. Each stage adds latency, so the end-to-end experience depends on optimizing all three. The gold standard is under 500ms total round-trip — fast enough that conversations feel natural. This chapter covers building production voice applications.Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Speech-to-Text with Whisper
Basic Transcription
Multi-Language Support
Audio Processing Pipeline
Text-to-Speech
OpenAI TTS
Voice Selection System
Real-Time Transcription
WebSocket-Based Transcription
Audio Analysis
Speaker Diarization Setup
Audio Content Analysis
- Pre-process audio — reduce noise and normalize volume before transcription. Whisper handles noisy audio reasonably well, but clean audio produces significantly better results.
ffmpeg -i input.mp3 -af "highpass=f=200,lowpass=f=3000,volume=2" output.mp3is a good starting point. - Use 16kHz mono as your standard format. It’s Whisper’s native format, reduces file sizes by 4x vs. 44.1kHz stereo, and doesn’t sacrifice transcription quality.
- Chunk long audio at natural pause points (silence detection) rather than at fixed intervals. Cutting mid-sentence confuses the model and produces garbled word boundaries.
- Cache transcriptions aggressively. Audio files don’t change — the same meeting recording always produces the same transcript. No reason to pay for re-transcription.
- Latency budget: For real-time voice assistants, aim for STT (200ms) + LLM (500ms) + TTS (200ms) = under 1 second total. Use
tts-1(nottts-1-hd) andgpt-4o-mini(notgpt-4o) to hit this target. - Pitfall: Long silences and background music can confuse Whisper’s language detection. If you know the language, always specify it explicitly with the
languageparameter.
Practice Exercise
Build a voice-enabled assistant that:- Accepts continuous voice input
- Transcribes in real-time with speaker detection
- Generates natural voice responses
- Supports multiple languages
- Provides meeting summaries and action items
- Low-latency response times
- Graceful handling of audio quality issues
- Natural conversation flow
- Persistent conversation context
Interview Deep-Dive
You are building a voice assistant that needs to feel conversational -- the user speaks, the assistant responds with audio, and the round-trip should feel natural. Walk me through the latency budget and where each millisecond goes.
You are building a voice assistant that needs to feel conversational -- the user speaks, the assistant responds with audio, and the round-trip should feel natural. Walk me through the latency budget and where each millisecond goes.
- The latency target for a natural-feeling voice conversation is under 1 second total round-trip, ideally under 800ms. Anything above 1.5 seconds feels like talking to someone with a bad satellite connection — the user starts wondering if the system heard them. The budget breaks down into three stages, each with hard constraints.
- Stage one: Speech-to-Text (STT). Whisper API typically returns in 200-500ms depending on audio length. For short utterances (under 10 seconds), expect 200-300ms. This is a network call with audio upload, so the user’s upload bandwidth matters. Optimization: use 16kHz mono audio (Whisper’s native format) to minimize upload size. Pre-process on the client: start recording silence detection, and send the audio the moment the user stops speaking rather than waiting for a button press. The gap between the user finishing their sentence and the audio hitting the API is dead latency that feels like the system is not listening.
- Stage two: LLM processing. This is your main variable. GPT-4o takes 500-1500ms for the first token depending on prompt complexity. GPT-4o-mini takes 200-500ms. For voice assistants, always use the fastest model that produces acceptable quality. Keep the system prompt short — every token in the prompt adds to time-to-first-token. Stream the LLM response and start TTS on the first complete sentence rather than waiting for the full response. This overlap is the single biggest latency optimization in the pipeline.
- Stage three: Text-to-Speech (TTS). OpenAI’s
tts-1model is optimized for speed at ~200ms latency. The HD variant (tts-1-hd) takes ~400ms. For real-time conversation, always usetts-1. Stream the audio output to the client as chunks arrive rather than waiting for the full audio file. - The critical optimization is pipelining stages two and three. As the LLM generates text token by token, buffer until you have a complete sentence, then immediately start TTS on that sentence while the LLM continues generating the next sentence. The user hears the first sentence of the response while the rest is still being generated. This hides the LLM’s total generation time behind the playback time of the first sentence.
- Total budget with pipelining: STT (250ms) + LLM first sentence (400ms) + TTS first sentence (200ms) = 850ms to first audio. The user starts hearing the response in under a second, and subsequent sentences arrive seamlessly because TTS for sentence N is happening while the LLM generates sentence N+1.
Whisper transcribes a meeting recording, but the output has no speaker labels -- it is one continuous block of text. How do you add speaker diarization, and what are the accuracy trade-offs?
Whisper transcribes a meeting recording, but the output has no speaker labels -- it is one continuous block of text. How do you add speaker diarization, and what are the accuracy trade-offs?
- Whisper does not natively support speaker diarization — it produces a single transcript without any speaker attribution. This is one of its biggest limitations for meeting and conversation transcription. There are three approaches, each with different accuracy and cost trade-offs.
- Approach one: LLM-based post-processing. Take Whisper’s timestamped segment output and feed it to GPT-4o with a prompt asking it to identify speaker changes based on conversational patterns, topic shifts, and contextual clues (people referring to each other by name, role-based content like “as the designer, I think…”). This is cheap and requires no additional ML infrastructure, but accuracy is 60-75% at best. The LLM is guessing based on content, not audio characteristics. It works reasonably well for 2-3 speakers with distinct roles but degrades rapidly with more speakers or when speakers have similar roles.
- Approach two: dedicated diarization model. Use pyannote.audio (open source) or a commercial service like AssemblyAI. These models analyze the audio signal itself — pitch, cadence, spectral features — to identify speaker boundaries. Accuracy is 85-95% for 2-4 speakers in clean audio. The workflow: run diarization to get speaker-labeled time segments, run Whisper separately to get text with timestamps, then align the two outputs by matching timestamps. The diarization model says “Speaker A from 0:00 to 0:15, Speaker B from 0:16 to 0:30,” and you map Whisper’s text segments into those speaker windows.
- Approach three: use a service that combines both. AssemblyAI, Deepgram, and Google Cloud Speech-to-Text offer integrated transcription + diarization in a single API call. This is the easiest to implement and often the most accurate because the STT and diarization models share information. The trade-off is vendor lock-in and cost ($0.01-0.05 per minute of audio).
- My recommendation for most teams: start with approach three (integrated service) for production, and use approach one (LLM post-processing) for prototype or low-stakes use cases. Approach two is for teams that need the accuracy of dedicated diarization but want to self-host due to data privacy requirements.
You need to process a 3-hour podcast for transcription, analysis, and chapter generation. The Whisper API has a 25MB file limit. Design the end-to-end pipeline.
You need to process a 3-hour podcast for transcription, analysis, and chapter generation. The Whisper API has a 25MB file limit. Design the end-to-end pipeline.
- A 3-hour podcast in standard MP3 (128kbps) is approximately 170MB — well over the 25MB limit. The pipeline has four stages: preprocessing, chunked transcription, assembly, and analysis.
- Stage one: preprocessing. Convert the audio to 16kHz mono MP3 at 64kbps using ffmpeg. This reduces file size by roughly 4x with negligible quality loss for speech transcription (Whisper processes at 16kHz internally anyway). The 3-hour file drops from 170MB to about 43MB — still over the limit, so chunking is required. Also apply basic audio cleanup: high-pass filter at 200Hz to remove rumble, normalize volume to prevent quiet sections from being missed.
- Stage two: chunked transcription. Split the audio into 10-minute segments with 10 seconds of overlap. Ten-minute chunks at 64kbps are about 4.7MB each, well under the 25MB limit. The overlap is critical: without it, words at chunk boundaries get cut mid-syllable and Whisper produces garbled text at the edges. Use silence detection (ffmpeg’s
silencedetectfilter) to find natural pause points near the 10-minute mark rather than cutting at exact timestamps. This prevents splitting mid-word or mid-sentence. Transcribe each chunk using the Whisper API withresponse_format="verbose_json"to get timestamps. - Stage three: transcript assembly. Merge the chunk transcripts by aligning the overlapping regions. In the overlap zone, both chunks produced text for the same audio. Take the text from the first chunk’s overlap (which has better context from preceding content) and discard the second chunk’s overlap text. Concatenate the trimmed transcripts. Adjust timestamps to be relative to the full episode (chunk 2’s timestamps start at 10:00, not 0:00).
- Stage four: analysis. With the full transcript assembled, use GPT-4o to generate: episode summary, key topics discussed, chapter markers with timestamps (critical for podcast listeners who want to skip to specific topics), notable quotes, and SEO keywords. For the chapter markers specifically, instruct the model to identify major topic transitions and assign a timestamp and title to each. A 3-hour podcast typically has 10-20 natural chapters.
- Cost and latency: 18 chunks transcribed via Whisper at ~1.08 total for transcription. One GPT-4o call for analysis with the full transcript (roughly 30K-50K tokens) costs about 2.50 per episode. With parallel chunk transcription, the pipeline completes in 3-5 minutes.
language parameter, Whisper auto-detects based on the first 30 seconds. If the host starts in English, Whisper will attempt to transcribe the entire chunk in English, producing garbage during the Spanish sections. Three strategies. First, if the language switches happen at predictable boundaries (alternating segments), detect the switch points using a language identification model, split there, and transcribe each segment with the correct language parameter. Second, if the switching is fluid (code-switching mid-sentence, which is common in bilingual speakers), use Whisper’s translation mode instead of transcription: set task="translate" to get everything in English. You lose the Spanish text, but you get a coherent single-language transcript. Third, use a multilingual-aware transcription service like AssemblyAI that handles code-switching natively. Whisper’s large-v3 model (self-hosted) handles mixed language better than the API’s whisper-1, so if quality matters, consider running Whisper locally with the large model, which can detect and transcribe language switches within a single audio segment.