Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Voice AI enables natural human-computer interaction through speech recognition and synthesis. The voice pipeline has three stages: Speech-to-Text (STT) converts audio to text, your AI processes the text (chat, RAG, agents), and Text-to-Speech (TTS) converts the response back to audio. Each stage adds latency, so the end-to-end experience depends on optimizing all three. The gold standard is under 500ms total round-trip — fast enough that conversations feel natural. This chapter covers building production voice applications.

Speech-to-Text with Whisper

Basic Transcription

from openai import OpenAI
from pathlib import Path


def transcribe_audio(audio_path: str) -> str:
    """Transcribe audio file using Whisper."""
    client = OpenAI()
    
    with open(audio_path, "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file
        )
    
    return transcript.text


def transcribe_with_timestamps(audio_path: str) -> dict:
    """Transcribe with word-level timestamps.
    
    Why timestamps? They enable: subtitle generation, speaker diarization,
    jump-to-moment in audio/video, and aligning transcript with visual content.
    Word-level timestamps are especially useful for karaoke-style highlighting
    in real-time transcription UIs.
    """
    client = OpenAI()
    
    with open(audio_path, "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
            response_format="verbose_json",
            timestamp_granularities=["word", "segment"]
        )
    
    return {
        "text": transcript.text,
        "segments": transcript.segments,
        "words": transcript.words
    }


# Usage
result = transcribe_with_timestamps("meeting.mp3")
print(f"Full transcript: {result['text'][:200]}...")

print("\nSegments:")
for segment in result['segments'][:3]:
    print(f"  [{segment['start']:.1f}s - {segment['end']:.1f}s] {segment['text']}")

Multi-Language Support

from openai import OpenAI


class MultiLanguageTranscriber:
    """Transcribe audio in multiple languages."""
    
    def __init__(self):
        self.client = OpenAI()
    
    def transcribe(
        self,
        audio_path: str,
        language: str = None,
        translate: bool = False
    ) -> dict:
        """Transcribe or translate audio."""
        with open(audio_path, "rb") as audio_file:
            if translate:
                # Translate to English
                result = self.client.audio.translations.create(
                    model="whisper-1",
                    file=audio_file
                )
            else:
                # Transcribe in original language
                kwargs = {"model": "whisper-1", "file": audio_file}
                if language:
                    kwargs["language"] = language
                
                result = self.client.audio.transcriptions.create(**kwargs)
        
        return {"text": result.text, "translated": translate}
    
    def detect_and_transcribe(self, audio_path: str) -> dict:
        """Detect language and transcribe."""
        # First pass: get verbose response to detect language
        with open(audio_path, "rb") as audio_file:
            result = self.client.audio.transcriptions.create(
                model="whisper-1",
                file=audio_file,
                response_format="verbose_json"
            )
        
        return {
            "text": result.text,
            "language": result.language,
            "duration": result.duration
        }


# Usage
transcriber = MultiLanguageTranscriber()

# Transcribe in specific language
result = transcriber.transcribe("spanish_audio.mp3", language="es")
print(f"Spanish: {result['text']}")

# Translate to English
result = transcriber.transcribe("french_audio.mp3", translate=True)
print(f"Translated: {result['text']}")

Audio Processing Pipeline

from openai import OpenAI
from pathlib import Path
from dataclasses import dataclass
import subprocess
import tempfile
import os


@dataclass
class AudioChunk:
    """A chunk of processed audio."""
    path: str
    start_time: float
    duration: float


class AudioProcessor:
    """Process audio files for transcription.
    
    Whisper has a 25MB file size limit. For longer audio (podcasts, meetings, 
    lectures), you need to chunk the audio into segments, transcribe each, 
    and concatenate. This class handles the full pipeline including format 
    validation, duration detection, and intelligent splitting.
    
    Tip: Always pre-process audio to 16kHz mono before transcription -- 
    it reduces file size and is Whisper's native format.
    """
    
    SUPPORTED_FORMATS = [".mp3", ".mp4", ".mpeg", ".mpga", ".m4a", ".wav", ".webm"]
    MAX_FILE_SIZE = 25 * 1024 * 1024  # 25 MB -- Whisper API limit
    
    def __init__(self):
        self.client = OpenAI()
    
    def validate_file(self, path: str) -> bool:
        """Check if file is valid for transcription."""
        file_path = Path(path)
        
        if not file_path.exists():
            raise FileNotFoundError(f"File not found: {path}")
        
        if file_path.suffix.lower() not in self.SUPPORTED_FORMATS:
            raise ValueError(f"Unsupported format: {file_path.suffix}")
        
        if file_path.stat().st_size > self.MAX_FILE_SIZE:
            return False  # Needs chunking
        
        return True
    
    def get_duration(self, path: str) -> float:
        """Get audio duration in seconds using ffprobe."""
        cmd = [
            "ffprobe", "-v", "error",
            "-show_entries", "format=duration",
            "-of", "default=noprint_wrappers=1:nokey=1",
            path
        ]
        
        result = subprocess.run(cmd, capture_output=True, text=True)
        return float(result.stdout.strip())
    
    def split_audio(
        self,
        path: str,
        chunk_duration: float = 600  # 10 minutes
    ) -> list[AudioChunk]:
        """Split audio into chunks."""
        total_duration = self.get_duration(path)
        chunks = []
        
        temp_dir = tempfile.mkdtemp()
        
        current_time = 0
        chunk_num = 0
        
        while current_time < total_duration:
            chunk_path = os.path.join(temp_dir, f"chunk_{chunk_num}.mp3")
            duration = min(chunk_duration, total_duration - current_time)
            
            # Use ffmpeg to extract chunk
            cmd = [
                "ffmpeg", "-y",
                "-i", path,
                "-ss", str(current_time),
                "-t", str(duration),
                "-acodec", "libmp3lame",
                "-q:a", "2",
                chunk_path
            ]
            
            subprocess.run(cmd, capture_output=True)
            
            chunks.append(AudioChunk(
                path=chunk_path,
                start_time=current_time,
                duration=duration
            ))
            
            current_time += duration
            chunk_num += 1
        
        return chunks
    
    def transcribe_long_audio(
        self,
        path: str,
        chunk_duration: float = 600
    ) -> str:
        """Transcribe long audio files by chunking."""
        if self.validate_file(path):
            # File is small enough, transcribe directly
            with open(path, "rb") as f:
                result = self.client.audio.transcriptions.create(
                    model="whisper-1",
                    file=f
                )
            return result.text
        
        # Split and transcribe chunks
        chunks = self.split_audio(path, chunk_duration)
        transcripts = []
        
        for chunk in chunks:
            with open(chunk.path, "rb") as f:
                result = self.client.audio.transcriptions.create(
                    model="whisper-1",
                    file=f
                )
            transcripts.append(result.text)
            
            # Clean up chunk file
            os.remove(chunk.path)
        
        return " ".join(transcripts)


# Usage
processor = AudioProcessor()

# Transcribe a long podcast episode
transcript = processor.transcribe_long_audio("podcast_episode.mp3")
print(f"Transcript ({len(transcript)} chars): {transcript[:500]}...")

Text-to-Speech

OpenAI TTS

from openai import OpenAI
from pathlib import Path


def text_to_speech(
    text: str,
    output_path: str,
    voice: str = "alloy",
    model: str = "tts-1"
) -> str:
    """Convert text to speech.
    
    Two models available:
    - tts-1: Optimized for speed. ~100ms latency. Good enough for real-time.
    - tts-1-hd: Optimized for quality. ~300ms latency. Better for pre-generated content.
    
    For real-time voice assistants, always use tts-1 -- users notice latency 
    more than audio quality in conversation.
    """
    client = OpenAI()
    
    # Available voices: alloy (neutral), echo (warm), fable (British/narrative), 
    # onyx (deep/authoritative), nova (friendly/energetic), shimmer (calm/soothing)
    response = client.audio.speech.create(
        model=model,
        voice=voice,
        input=text
    )
    
    response.stream_to_file(output_path)
    return output_path


def text_to_speech_streaming(text: str, voice: str = "alloy"):
    """Stream TTS audio."""
    client = OpenAI()
    
    response = client.audio.speech.create(
        model="tts-1",
        voice=voice,
        input=text,
        response_format="opus"  # Good for streaming
    )
    
    # Iterate over audio chunks
    for chunk in response.iter_bytes():
        yield chunk


# Usage
# Generate speech file
text_to_speech(
    "Hello! This is a test of the text-to-speech system.",
    "output.mp3",
    voice="nova"
)

# Streaming usage
for audio_chunk in text_to_speech_streaming("Streaming audio test"):
    # Process or play audio chunk
    pass

Voice Selection System

from openai import OpenAI
from dataclasses import dataclass
from enum import Enum


class VoiceStyle(Enum):
    PROFESSIONAL = "professional"
    CASUAL = "casual"
    ENERGETIC = "energetic"
    CALM = "calm"
    NARRATIVE = "narrative"


@dataclass
class VoiceProfile:
    """Profile for a TTS voice."""
    name: str
    openai_voice: str
    style: VoiceStyle
    description: str


class VoiceSelector:
    """Select appropriate voice for content."""
    
    VOICES = [
        VoiceProfile("Professional", "onyx", VoiceStyle.PROFESSIONAL, 
                    "Deep, authoritative voice for business content"),
        VoiceProfile("Friendly", "nova", VoiceStyle.CASUAL,
                    "Warm, approachable voice for casual content"),
        VoiceProfile("Dynamic", "echo", VoiceStyle.ENERGETIC,
                    "Energetic voice for marketing and presentations"),
        VoiceProfile("Soothing", "shimmer", VoiceStyle.CALM,
                    "Calm, gentle voice for meditation and relaxation"),
        VoiceProfile("Storyteller", "fable", VoiceStyle.NARRATIVE,
                    "Expressive voice for stories and narratives"),
        VoiceProfile("Neutral", "alloy", VoiceStyle.CASUAL,
                    "Balanced, versatile voice for general use"),
    ]
    
    def __init__(self):
        self.client = OpenAI()
    
    def select_voice(
        self,
        content_type: str = None,
        style: VoiceStyle = None
    ) -> VoiceProfile:
        """Select voice based on content or style."""
        if style:
            matches = [v for v in self.VOICES if v.style == style]
            return matches[0] if matches else self.VOICES[0]
        
        # Auto-detect based on content type
        content_voice_map = {
            "business": VoiceStyle.PROFESSIONAL,
            "tutorial": VoiceStyle.CASUAL,
            "marketing": VoiceStyle.ENERGETIC,
            "meditation": VoiceStyle.CALM,
            "story": VoiceStyle.NARRATIVE,
        }
        
        style = content_voice_map.get(content_type, VoiceStyle.CASUAL)
        return self.select_voice(style=style)
    
    def generate_speech(
        self,
        text: str,
        output_path: str,
        content_type: str = None,
        style: VoiceStyle = None,
        hd_quality: bool = False
    ) -> dict:
        """Generate speech with auto-selected voice."""
        voice = self.select_voice(content_type, style)
        
        response = self.client.audio.speech.create(
            model="tts-1-hd" if hd_quality else "tts-1",
            voice=voice.openai_voice,
            input=text
        )
        
        response.stream_to_file(output_path)
        
        return {
            "path": output_path,
            "voice": voice.name,
            "openai_voice": voice.openai_voice,
            "style": voice.style.value
        }


# Usage
selector = VoiceSelector()

# Generate business presentation audio
result = selector.generate_speech(
    "Welcome to our quarterly earnings report...",
    "presentation.mp3",
    content_type="business",
    hd_quality=True
)
print(f"Generated with voice: {result['voice']}")

# Generate meditation audio
result = selector.generate_speech(
    "Take a deep breath and relax...",
    "meditation.mp3",
    style=VoiceStyle.CALM
)
print(f"Generated with voice: {result['voice']}")

Real-Time Transcription

WebSocket-Based Transcription

import asyncio
import websockets
import json
from openai import OpenAI
import tempfile
import wave
import struct


class RealtimeTranscriber:
    """Real-time audio transcription."""
    
    def __init__(
        self,
        chunk_duration: float = 5.0,
        sample_rate: int = 16000
    ):
        self.client = OpenAI()
        self.chunk_duration = chunk_duration
        self.sample_rate = sample_rate
        self.buffer = []
    
    async def process_audio_stream(
        self,
        audio_generator,
        callback
    ):
        """Process streaming audio and transcribe."""
        samples_per_chunk = int(self.sample_rate * self.chunk_duration)
        
        async for audio_data in audio_generator:
            self.buffer.extend(audio_data)
            
            while len(self.buffer) >= samples_per_chunk:
                chunk = self.buffer[:samples_per_chunk]
                self.buffer = self.buffer[samples_per_chunk:]
                
                # Transcribe chunk
                transcript = await self._transcribe_chunk(chunk)
                
                if transcript:
                    await callback(transcript)
    
    async def _transcribe_chunk(self, audio_samples: list) -> str:
        """Transcribe an audio chunk."""
        # Save to temporary WAV file
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
            with wave.open(f.name, "wb") as wav:
                wav.setnchannels(1)
                wav.setsampwidth(2)  # 16-bit
                wav.setframerate(self.sample_rate)
                
                # Convert float samples to 16-bit integers
                int_samples = [int(s * 32767) for s in audio_samples]
                wav.writeframes(struct.pack(f"{len(int_samples)}h", *int_samples))
            
            # Transcribe
            with open(f.name, "rb") as audio_file:
                result = self.client.audio.transcriptions.create(
                    model="whisper-1",
                    file=audio_file
                )
            
            return result.text


class VoiceAssistant:
    """Voice-based assistant with real-time transcription."""
    
    def __init__(self):
        self.client = OpenAI()
        self.transcriber = RealtimeTranscriber()
        self.conversation_history = []
    
    def process_voice_input(self, audio_path: str) -> dict:
        """Process voice input and generate response."""
        # Transcribe
        with open(audio_path, "rb") as f:
            transcription = self.client.audio.transcriptions.create(
                model="whisper-1",
                file=f
            )
        
        user_text = transcription.text
        
        # Add to conversation
        self.conversation_history.append({
            "role": "user",
            "content": user_text
        })
        
        # Generate response
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a helpful voice assistant. Keep responses concise and conversational."},
                *self.conversation_history
            ]
        )
        
        assistant_text = response.choices[0].message.content
        
        self.conversation_history.append({
            "role": "assistant",
            "content": assistant_text
        })
        
        # Generate speech response
        audio_response = self.client.audio.speech.create(
            model="tts-1",
            voice="nova",
            input=assistant_text
        )
        
        # Save response audio
        response_path = "response.mp3"
        audio_response.stream_to_file(response_path)
        
        return {
            "user_text": user_text,
            "assistant_text": assistant_text,
            "audio_path": response_path
        }


# Usage
assistant = VoiceAssistant()

# Process voice input
result = assistant.process_voice_input("user_question.mp3")
print(f"User said: {result['user_text']}")
print(f"Assistant: {result['assistant_text']}")
print(f"Audio response: {result['audio_path']}")

Audio Analysis

Speaker Diarization Setup

from openai import OpenAI
from dataclasses import dataclass
import json


@dataclass
class SpeakerSegment:
    """A segment of speech from one speaker."""
    speaker_id: str
    start_time: float
    end_time: float
    text: str


class MeetingTranscriber:
    """Transcribe meetings with speaker identification.
    
    Important limitation: Whisper does NOT natively support speaker diarization
    (identifying who said what). This implementation uses a workaround: transcribe
    first with timestamps, then use an LLM to identify speaker changes based on
    content patterns, voice descriptions, and conversational flow. For production
    diarization, consider dedicated services like pyannote.audio or AssemblyAI.
    """
    
    def __init__(self):
        self.client = OpenAI()
    
    def transcribe_meeting(self, audio_path: str) -> dict:
        """Transcribe a meeting and use LLM to identify speakers."""
        # Get detailed transcription
        with open(audio_path, "rb") as f:
            transcription = self.client.audio.transcriptions.create(
                model="whisper-1",
                file=f,
                response_format="verbose_json",
                timestamp_granularities=["segment"]
            )
        
        # Use LLM to identify speakers and structure content
        segments_text = "\n".join([
            f"[{s['start']:.1f}s - {s['end']:.1f}s]: {s['text']}"
            for s in transcription.segments
        ])
        
        analysis_prompt = f"""Analyze this meeting transcript and identify different speakers.
Label speakers as Speaker 1, Speaker 2, etc.

Transcript with timestamps:
{segments_text}

Return as JSON:
{{
    "speakers": [{{"id": "Speaker 1", "description": "brief description"}}],
    "segments": [
        {{"speaker": "Speaker 1", "start": 0.0, "end": 5.0, "text": "..."}}
    ],
    "summary": "brief meeting summary",
    "action_items": ["list of action items"]
}}"""
        
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": analysis_prompt}],
            response_format={"type": "json_object"}
        )
        
        return json.loads(response.choices[0].message.content)
    
    def generate_meeting_notes(self, analysis: dict) -> str:
        """Generate formatted meeting notes."""
        notes = ["# Meeting Notes\n"]
        
        notes.append("## Summary")
        notes.append(analysis.get("summary", "No summary available."))
        notes.append("")
        
        notes.append("## Participants")
        for speaker in analysis.get("speakers", []):
            notes.append(f"- {speaker['id']}: {speaker.get('description', '')}")
        notes.append("")
        
        notes.append("## Transcript")
        for segment in analysis.get("segments", []):
            notes.append(f"**{segment['speaker']}** ({segment['start']:.0f}s): {segment['text']}")
        notes.append("")
        
        notes.append("## Action Items")
        for item in analysis.get("action_items", []):
            notes.append(f"- [ ] {item}")
        
        return "\n".join(notes)


# Usage
transcriber = MeetingTranscriber()

analysis = transcriber.transcribe_meeting("team_meeting.mp3")
notes = transcriber.generate_meeting_notes(analysis)

print(notes)

Audio Content Analysis

from openai import OpenAI
import json


class AudioAnalyzer:
    """Analyze audio content for various purposes."""
    
    def __init__(self):
        self.client = OpenAI()
    
    def analyze_podcast(self, audio_path: str) -> dict:
        """Analyze a podcast episode."""
        # Transcribe
        with open(audio_path, "rb") as f:
            transcription = self.client.audio.transcriptions.create(
                model="whisper-1",
                file=f
            )
        
        # Analyze content
        prompt = f"""Analyze this podcast transcript:

{transcription.text}

Provide a comprehensive analysis as JSON:
{{
    "title_suggestion": "suggested episode title",
    "topics": ["main topics discussed"],
    "key_points": ["key takeaways"],
    "quotes": ["notable quotable moments"],
    "sentiment": "overall tone (positive/neutral/negative)",
    "target_audience": "who would benefit from this",
    "seo_keywords": ["relevant keywords"],
    "chapter_markers": [
        {{"title": "Introduction", "description": "brief description"}}
    ],
    "summary": "2-3 paragraph summary"
}}"""
        
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )
        
        return json.loads(response.choices[0].message.content)
    
    def transcribe_and_translate(
        self,
        audio_path: str,
        target_languages: list[str]
    ) -> dict:
        """Transcribe and translate to multiple languages."""
        # Get English transcript
        with open(audio_path, "rb") as f:
            english = self.client.audio.translations.create(
                model="whisper-1",
                file=f
            )
        
        translations = {"en": english.text}
        
        # Translate to target languages
        for lang in target_languages:
            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {
                        "role": "system",
                        "content": f"Translate the following text to {lang}. Maintain the original meaning and tone."
                    },
                    {"role": "user", "content": english.text}
                ]
            )
            
            translations[lang] = response.choices[0].message.content
        
        return translations


# Usage
analyzer = AudioAnalyzer()

# Analyze podcast
analysis = analyzer.analyze_podcast("episode.mp3")
print(f"Suggested title: {analysis['title_suggestion']}")
print(f"Topics: {analysis['topics']}")

# Multi-language transcription
translations = analyzer.transcribe_and_translate(
    "speech.mp3",
    ["es", "fr", "de"]
)
for lang, text in translations.items():
    print(f"{lang}: {text[:100]}...")
Voice AI Best Practices
  • Pre-process audio — reduce noise and normalize volume before transcription. Whisper handles noisy audio reasonably well, but clean audio produces significantly better results. ffmpeg -i input.mp3 -af "highpass=f=200,lowpass=f=3000,volume=2" output.mp3 is a good starting point.
  • Use 16kHz mono as your standard format. It’s Whisper’s native format, reduces file sizes by 4x vs. 44.1kHz stereo, and doesn’t sacrifice transcription quality.
  • Chunk long audio at natural pause points (silence detection) rather than at fixed intervals. Cutting mid-sentence confuses the model and produces garbled word boundaries.
  • Cache transcriptions aggressively. Audio files don’t change — the same meeting recording always produces the same transcript. No reason to pay for re-transcription.
  • Latency budget: For real-time voice assistants, aim for STT (200ms) + LLM (500ms) + TTS (200ms) = under 1 second total. Use tts-1 (not tts-1-hd) and gpt-4o-mini (not gpt-4o) to hit this target.
  • Pitfall: Long silences and background music can confuse Whisper’s language detection. If you know the language, always specify it explicitly with the language parameter.

Practice Exercise

Build a voice-enabled assistant that:
  1. Accepts continuous voice input
  2. Transcribes in real-time with speaker detection
  3. Generates natural voice responses
  4. Supports multiple languages
  5. Provides meeting summaries and action items
Focus on:
  • Low-latency response times
  • Graceful handling of audio quality issues
  • Natural conversation flow
  • Persistent conversation context

Interview Deep-Dive

Strong Answer:
  • The latency target for a natural-feeling voice conversation is under 1 second total round-trip, ideally under 800ms. Anything above 1.5 seconds feels like talking to someone with a bad satellite connection — the user starts wondering if the system heard them. The budget breaks down into three stages, each with hard constraints.
  • Stage one: Speech-to-Text (STT). Whisper API typically returns in 200-500ms depending on audio length. For short utterances (under 10 seconds), expect 200-300ms. This is a network call with audio upload, so the user’s upload bandwidth matters. Optimization: use 16kHz mono audio (Whisper’s native format) to minimize upload size. Pre-process on the client: start recording silence detection, and send the audio the moment the user stops speaking rather than waiting for a button press. The gap between the user finishing their sentence and the audio hitting the API is dead latency that feels like the system is not listening.
  • Stage two: LLM processing. This is your main variable. GPT-4o takes 500-1500ms for the first token depending on prompt complexity. GPT-4o-mini takes 200-500ms. For voice assistants, always use the fastest model that produces acceptable quality. Keep the system prompt short — every token in the prompt adds to time-to-first-token. Stream the LLM response and start TTS on the first complete sentence rather than waiting for the full response. This overlap is the single biggest latency optimization in the pipeline.
  • Stage three: Text-to-Speech (TTS). OpenAI’s tts-1 model is optimized for speed at ~200ms latency. The HD variant (tts-1-hd) takes ~400ms. For real-time conversation, always use tts-1. Stream the audio output to the client as chunks arrive rather than waiting for the full audio file.
  • The critical optimization is pipelining stages two and three. As the LLM generates text token by token, buffer until you have a complete sentence, then immediately start TTS on that sentence while the LLM continues generating the next sentence. The user hears the first sentence of the response while the rest is still being generated. This hides the LLM’s total generation time behind the playback time of the first sentence.
  • Total budget with pipelining: STT (250ms) + LLM first sentence (400ms) + TTS first sentence (200ms) = 850ms to first audio. The user starts hearing the response in under a second, and subsequent sentences arrive seamlessly because TTS for sentence N is happening while the LLM generates sentence N+1.
Follow-up: The user is on a slow 3G mobile connection. Upload bandwidth is limited. How does this change your architecture?On a slow connection, the audio upload for STT becomes the dominant latency. A 5-second audio clip in MP3 at 128kbps is about 80KB, which takes 2-3 seconds to upload on 3G. Three mitigations. First, compress audio aggressively on the client: use Opus codec at 16kbps instead of MP3 at 128kbps — an 8x size reduction with acceptable quality for speech. Opus was designed for low-bandwidth voice. Second, implement on-device STT using a lightweight model like Whisper’s tiny or base variant compiled for mobile. This eliminates the upload entirely — transcription happens locally in 100-300ms for short utterances. Send only the text to the server. Third, for the response audio, use progressive playback: stream Opus audio chunks from the server so the user starts hearing the response before the full audio is downloaded. The combination of on-device STT and streamed TTS means the network only carries small text payloads for the LLM step and streamed audio chunks for the response, which is manageable even on 3G.
Strong Answer:
  • Whisper does not natively support speaker diarization — it produces a single transcript without any speaker attribution. This is one of its biggest limitations for meeting and conversation transcription. There are three approaches, each with different accuracy and cost trade-offs.
  • Approach one: LLM-based post-processing. Take Whisper’s timestamped segment output and feed it to GPT-4o with a prompt asking it to identify speaker changes based on conversational patterns, topic shifts, and contextual clues (people referring to each other by name, role-based content like “as the designer, I think…”). This is cheap and requires no additional ML infrastructure, but accuracy is 60-75% at best. The LLM is guessing based on content, not audio characteristics. It works reasonably well for 2-3 speakers with distinct roles but degrades rapidly with more speakers or when speakers have similar roles.
  • Approach two: dedicated diarization model. Use pyannote.audio (open source) or a commercial service like AssemblyAI. These models analyze the audio signal itself — pitch, cadence, spectral features — to identify speaker boundaries. Accuracy is 85-95% for 2-4 speakers in clean audio. The workflow: run diarization to get speaker-labeled time segments, run Whisper separately to get text with timestamps, then align the two outputs by matching timestamps. The diarization model says “Speaker A from 0:00 to 0:15, Speaker B from 0:16 to 0:30,” and you map Whisper’s text segments into those speaker windows.
  • Approach three: use a service that combines both. AssemblyAI, Deepgram, and Google Cloud Speech-to-Text offer integrated transcription + diarization in a single API call. This is the easiest to implement and often the most accurate because the STT and diarization models share information. The trade-off is vendor lock-in and cost ($0.01-0.05 per minute of audio).
  • My recommendation for most teams: start with approach three (integrated service) for production, and use approach one (LLM post-processing) for prototype or low-stakes use cases. Approach two is for teams that need the accuracy of dedicated diarization but want to self-host due to data privacy requirements.
Follow-up: The meeting has 8 participants and the diarization model keeps merging two speakers who have similar voices into one speaker ID. How do you handle this?Speaker merging is the most common diarization failure mode, especially when speakers have similar vocal characteristics (same gender, similar age, similar accent). Three mitigations. First, if you have a participant list ahead of time, use speaker enrollment: feed short audio samples of each speaker (from previous meetings or an onboarding recording) to the diarization model as reference profiles. pyannote.audio supports this with speaker embeddings — it compares each segment against the enrolled profiles instead of clustering blindly. This dramatically improves accuracy for known speakers. Second, use channel separation if available: if participants are on separate audio channels (common in conference call recordings), diarization becomes trivial — each channel is one speaker. Request multi-channel audio from your conferencing platform’s API (Zoom, Teams, and Google Meet all support this) rather than a mixed-down mono recording. Third, post-processing correction: after diarization, use an LLM to review the transcript for inconsistencies. If “Speaker 3” is assigned dialogue about both marketing strategy and engineering architecture in alternating segments, the LLM can flag these as likely speaker merge errors. A human reviewer can then correct the 10-15 flagged segments rather than reviewing the entire transcript.
Strong Answer:
  • A 3-hour podcast in standard MP3 (128kbps) is approximately 170MB — well over the 25MB limit. The pipeline has four stages: preprocessing, chunked transcription, assembly, and analysis.
  • Stage one: preprocessing. Convert the audio to 16kHz mono MP3 at 64kbps using ffmpeg. This reduces file size by roughly 4x with negligible quality loss for speech transcription (Whisper processes at 16kHz internally anyway). The 3-hour file drops from 170MB to about 43MB — still over the limit, so chunking is required. Also apply basic audio cleanup: high-pass filter at 200Hz to remove rumble, normalize volume to prevent quiet sections from being missed.
  • Stage two: chunked transcription. Split the audio into 10-minute segments with 10 seconds of overlap. Ten-minute chunks at 64kbps are about 4.7MB each, well under the 25MB limit. The overlap is critical: without it, words at chunk boundaries get cut mid-syllable and Whisper produces garbled text at the edges. Use silence detection (ffmpeg’s silencedetect filter) to find natural pause points near the 10-minute mark rather than cutting at exact timestamps. This prevents splitting mid-word or mid-sentence. Transcribe each chunk using the Whisper API with response_format="verbose_json" to get timestamps.
  • Stage three: transcript assembly. Merge the chunk transcripts by aligning the overlapping regions. In the overlap zone, both chunks produced text for the same audio. Take the text from the first chunk’s overlap (which has better context from preceding content) and discard the second chunk’s overlap text. Concatenate the trimmed transcripts. Adjust timestamps to be relative to the full episode (chunk 2’s timestamps start at 10:00, not 0:00).
  • Stage four: analysis. With the full transcript assembled, use GPT-4o to generate: episode summary, key topics discussed, chapter markers with timestamps (critical for podcast listeners who want to skip to specific topics), notable quotes, and SEO keywords. For the chapter markers specifically, instruct the model to identify major topic transitions and assign a timestamp and title to each. A 3-hour podcast typically has 10-20 natural chapters.
  • Cost and latency: 18 chunks transcribed via Whisper at ~0.006/minute=0.006/minute = 1.08 total for transcription. One GPT-4o call for analysis with the full transcript (roughly 30K-50K tokens) costs about 0.501.00.Totalpipelinecost:under0.50-1.00. Total pipeline cost: under 2.50 per episode. With parallel chunk transcription, the pipeline completes in 3-5 minutes.
Follow-up: The host switches between English and Spanish throughout the podcast. How does this affect your pipeline?Mixed-language audio is a known challenge for Whisper. If you do not specify the language parameter, Whisper auto-detects based on the first 30 seconds. If the host starts in English, Whisper will attempt to transcribe the entire chunk in English, producing garbage during the Spanish sections. Three strategies. First, if the language switches happen at predictable boundaries (alternating segments), detect the switch points using a language identification model, split there, and transcribe each segment with the correct language parameter. Second, if the switching is fluid (code-switching mid-sentence, which is common in bilingual speakers), use Whisper’s translation mode instead of transcription: set task="translate" to get everything in English. You lose the Spanish text, but you get a coherent single-language transcript. Third, use a multilingual-aware transcription service like AssemblyAI that handles code-switching natively. Whisper’s large-v3 model (self-hosted) handles mixed language better than the API’s whisper-1, so if quality matters, consider running Whisper locally with the large model, which can detect and transcribe language switches within a single audio segment.