Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

December 2025 Update: Comprehensive guide to multimodal AI including GPT-4 Vision, audio processing, and real-time voice with OpenAI’s Realtime API.

What is Multimodal AI?

Think of how humans understand the world: you don’t just read text — you see images, hear sounds, watch videos, and combine all of those signals to form understanding. A doctor doesn’t just read a patient’s chart; they look at X-rays, listen to heart sounds, and observe body language. Multimodal AI works the same way — it processes and reasons across multiple types of content simultaneously, which unlocks capabilities that text-only models simply cannot match. Multimodal AI processes and generates multiple types of content:
Text ──────┐

Image ─────┼───▶ Multimodal LLM ───▶ Text/Image/Audio Output

Audio ─────┘
Video ─────┘
CapabilityUse CasesModelsTypical LatencyCost Range
VisionImage analysis, OCR, diagram understandingGPT-4o, Claude 3.5, Gemini1-5s per image$0.01-0.04 per image
AudioTranscription, TTS, voice understandingWhisper, TTS, Realtime API0.5-2s per minute of audio$0.006/min (Whisper)
VideoScene analysis, content moderationGemini 1.5 Pro5-30s per minute of video$0.05-0.20 per minute
Real-time VoiceVoice assistants, phone agentsRealtime API, Gemini LiveSub-500ms round-trip$0.06/min (audio input)

Choosing the Right Modality Approach

ScenarioRecommended ApproachWhy
One-off image analysisDirect base64 or URL to GPT-4oSimplest path, high quality
Batch image processing (100+ images)Queue-based with detail: "low"85% cheaper, sufficient for classification
Real-time conversationRealtime API over WebSocketSub-500ms latency, natural feel
Transcribe then respondWhisper + Chat Completions + TTSCheaper, more control over each step
Document understanding with chartsMultimodal RAG with high detailCharts carry information text alone cannot convey
Accessibility alt-text generationGPT-4o Vision with specific promptVision models excel at descriptive captions

Vision: Image Understanding

Analyzing Images with GPT-4o

from openai import OpenAI
import base64
from pathlib import Path

client = OpenAI()

def encode_image(image_path: str) -> str:
    """Encode image to base64.
    
    Why base64? The API expects images as text data, not binary files.
    Base64 converts binary image bytes into a text string that can be
    embedded directly in the JSON request payload.
    """
    with open(image_path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

def analyze_image(
    image_path: str,
    prompt: str = "Describe this image in detail."
) -> str:
    """Analyze an image with GPT-4o Vision"""
    
    base64_image = encode_image(image_path)
    
    # Determine MIME type
    suffix = Path(image_path).suffix.lower()
    mime_types = {
        ".jpg": "image/jpeg",
        ".jpeg": "image/jpeg",
        ".png": "image/png",
        ".gif": "image/gif",
        ".webp": "image/webp"
    }
    mime_type = mime_types.get(suffix, "image/jpeg")
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:{mime_type};base64,{base64_image}",
                            # "high" = more tokens, better accuracy for dense images
                            # "low" = fewer tokens, fine for simple images, ~85% cheaper
                            # "auto" = let the model decide based on image size
                            "detail": "high"
                        }
                    }
                ]
            }
        ],
        max_tokens=1000
    )
    
    return response.choices[0].message.content

# Usage
result = analyze_image(
    "screenshot.png",
    "Extract all text from this screenshot and format it as markdown."
)
print(result)

Vision Detail Levels: When to Use What

The detail parameter is the single biggest cost lever for vision tasks. Most teams default to "high" and overpay by 5-10x.
Detail LevelToken CostBest ForAvoid When
"low"~85 tokens fixedSimple classification, presence detection, general scene descriptionImage contains fine text, dense tables, or small UI elements
"high"85-1,700+ tokens (scales with resolution)OCR, chart reading, UI analysis, medical imagingBatch processing thousands of images, simple yes/no questions
"auto"VariesWhen you cannot predict image complexityYou need predictable costs per request
Edge case: Images larger than 2048x2048 are automatically resized before processing. If your source images are high-resolution scans or photographs, downsample them yourself to control quality and avoid unexpected token costs.

Analyzing Images from URLs

def analyze_image_url(url: str, prompt: str) -> str:
    """Analyze an image from URL"""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {"url": url}
                    }
                ]
            }
        ]
    )
    return response.choices[0].message.content

# Analyze multiple images
def compare_images(image_urls: list[str], comparison_prompt: str) -> str:
    """Compare multiple images"""
    content = [{"type": "text", "text": comparison_prompt}]
    
    for url in image_urls:
        content.append({
            "type": "image_url",
            "image_url": {"url": url}
        })
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}]
    )
    return response.choices[0].message.content

Vision Use Cases

class VisionAssistant:
    """Specialized vision analysis tasks"""
    
    def __init__(self):
        self.client = OpenAI()
    
    def extract_text_ocr(self, image_path: str) -> str:
        """OCR: Extract text from image"""
        return analyze_image(
            image_path,
            """Extract ALL text from this image exactly as it appears.
            Maintain formatting, structure, and layout.
            If it's a table, format as markdown table.
            If it's code, format as code block with language."""
        )
    
    def analyze_chart(self, image_path: str) -> dict:
        """Analyze a chart or graph"""
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": """Analyze this chart/graph and return JSON:
{
    "chart_type": "bar|line|pie|scatter|etc",
    "title": "chart title if visible",
    "x_axis": "x-axis label",
    "y_axis": "y-axis label", 
    "key_insights": ["insight1", "insight2"],
    "data_points": [{"label": "...", "value": "..."}]
}"""
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/png;base64,{encode_image(image_path)}"
                            }
                        }
                    ]
                }
            ],
            response_format={"type": "json_object"}
        )
        import json
        return json.loads(response.choices[0].message.content)
    
    def describe_for_accessibility(self, image_path: str) -> str:
        """Generate alt text for accessibility"""
        return analyze_image(
            image_path,
            """Generate comprehensive alt text for this image for screen readers.
            Include: main subject, context, important details, text if any.
            Keep it under 150 words but be descriptive."""
        )
    
    def analyze_ui_screenshot(self, image_path: str) -> dict:
        """Analyze UI/UX of a screenshot"""
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": """Analyze this UI screenshot:
{
    "page_type": "login|dashboard|form|etc",
    "ui_elements": ["button", "form", "nav"],
    "accessibility_issues": [],
    "ux_suggestions": [],
    "detected_text": []
}"""
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/png;base64,{encode_image(image_path)}"
                            }
                        }
                    ]
                }
            ],
            response_format={"type": "json_object"}
        )
        import json
        return json.loads(response.choices[0].message.content)

Audio: Speech and Sound

Audio processing with LLMs is like having a universal translator that also takes dictation. Whisper handles speech-to-text (including across languages), TTS converts text back to natural-sounding speech, and the two together form a complete audio pipeline. The key insight: Whisper is not just transcription — it understands context, handles accents, and can even translate between languages in a single step.

Audio Pipeline Decision Framework

Before writing code, decide which pipeline architecture you need:
ArchitectureLatencyCostQualityWhen to Use
Whisper -> Chat -> TTS (sequential)3-8s totalLowestFull control over each stepAsync processing, batch jobs, cost-sensitive
Realtime API (streaming)Sub-500msHighest (0.06/minin+0.06/min in + 0.24/min out)Most natural conversationVoice assistants, phone agents, live customer support
Whisper -> Chat (text response)2-5sLowN/A (no audio output)Meeting transcription with Q&A, voice search
Chat -> TTS (audio output only)1-3sMediumGood for narrationPodcast generation, audiobook creation, notifications

Speech-to-Text with Whisper

def transcribe_audio(audio_path: str, language: str = None) -> dict:
    """Transcribe audio to text using Whisper.
    
    Practical tip: Whisper supports files up to 25 MB. For longer recordings,
    split them into chunks using pydub or ffmpeg first. Specifying the language
    explicitly improves accuracy -- without it, Whisper auto-detects but may
    struggle with short clips or heavily accented speech.
    """
    with open(audio_path, "rb") as f:
        response = client.audio.transcriptions.create(
            model="whisper-1",
            file=f,
            language=language,  # Optional: "en", "es", etc. -- improves accuracy
            response_format="verbose_json",  # Get timestamps (vs. plain "text")
            timestamp_granularities=["word", "segment"]  # Word-level for subtitles
        )
    
    return {
        "text": response.text,
        "language": response.language,
        "segments": response.segments,
        "words": response.words
    }

# Transcribe with timestamps
result = transcribe_audio("meeting.mp3")
print(f"Transcription: {result['text'][:500]}...")

# Print with timestamps
for segment in result["segments"]:
    print(f"[{segment['start']:.2f}s] {segment['text']}")

Text-to-Speech

from pathlib import Path

def text_to_speech(
    text: str,
    output_path: str,
    voice: str = "alloy",  # alloy, echo, fable, onyx, nova, shimmer
    model: str = "tts-1-hd"  # tts-1 (fast, cheaper) or tts-1-hd (higher quality)
) -> str:
    """Convert text to speech.
    
    Practical tip: tts-1 has noticeable artifacts in quiet passages but is 
    2x faster. Use tts-1-hd for customer-facing audio and tts-1 for internal
    tools or previews. Voice choice matters too -- "nova" and "shimmer" tend
    to sound more natural for conversational content.
    """
    response = client.audio.speech.create(
        model=model,
        voice=voice,
        input=text
    )
    
    # Stream to file
    response.stream_to_file(output_path)
    return output_path

# Generate speech
text_to_speech(
    "Welcome to our AI-powered assistant. How can I help you today?",
    "welcome.mp3",
    voice="nova"
)

# Generate with different voices for comparison
voices = ["alloy", "echo", "fable", "onyx", "nova", "shimmer"]
for voice in voices:
    text_to_speech(
        "Hello, this is a voice sample.",
        f"sample_{voice}.mp3",
        voice=voice
    )

Audio Translation

def translate_audio(audio_path: str) -> str:
    """Translate audio from any language to English"""
    with open(audio_path, "rb") as f:
        response = client.audio.translations.create(
            model="whisper-1",
            file=f
        )
    return response.text

# Translate Spanish audio to English text
english_text = translate_audio("spanish_meeting.mp3")

Real-Time Voice

OpenAI Realtime API

The Realtime API is fundamentally different from the transcribe-then-generate-then-speak pipeline. Instead of three sequential steps (each adding latency), it processes audio input and produces audio output in a single streaming connection — like a phone call rather than a walkie-talkie. This cuts round-trip latency from 3-5 seconds down to under 500ms, which is the threshold where conversations feel natural. For real-time voice conversations with AI:
import asyncio
import websockets
import json
import base64
import pyaudio

REALTIME_URL = "wss://api.openai.com/v1/realtime"

class RealtimeVoiceAgent:
    """Real-time voice conversation agent"""
    
    def __init__(self, api_key: str, model: str = "gpt-4o-realtime-preview"):
        self.api_key = api_key
        self.model = model
        self.ws = None
        self.audio = pyaudio.PyAudio()
    
    async def connect(self):
        """Connect to Realtime API"""
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "OpenAI-Beta": "realtime=v1"
        }
        
        self.ws = await websockets.connect(
            f"{REALTIME_URL}?model={self.model}",
            extra_headers=headers
        )
        
        # Configure session
        await self.ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "instructions": "You are a helpful voice assistant. Be concise and conversational.",
                "voice": "alloy",
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "turn_detection": {
                    "type": "server_vad",  # Voice activity detection
                    "threshold": 0.5,
                    "silence_duration_ms": 500
                }
            }
        }))
    
    async def send_audio(self, audio_data: bytes):
        """Send audio chunk to API"""
        await self.ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": base64.b64encode(audio_data).decode()
        }))
    
    async def receive_messages(self):
        """Receive and handle messages"""
        async for message in self.ws:
            event = json.loads(message)
            await self.handle_event(event)
    
    async def handle_event(self, event: dict):
        """Handle different event types"""
        event_type = event.get("type")
        
        if event_type == "response.audio.delta":
            # Play audio chunk
            audio_data = base64.b64decode(event["delta"])
            self.play_audio(audio_data)
        
        elif event_type == "response.text.delta":
            # Print text response
            print(event["delta"], end="", flush=True)
        
        elif event_type == "response.done":
            print("\n[Response complete]")
        
        elif event_type == "error":
            print(f"Error: {event['error']}")
    
    def play_audio(self, audio_data: bytes):
        """Play audio through speakers"""
        stream = self.audio.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=24000,
            output=True
        )
        stream.write(audio_data)
        stream.close()
    
    async def start_conversation(self):
        """Start a voice conversation"""
        await self.connect()
        
        # Start receiving in background
        receive_task = asyncio.create_task(self.receive_messages())
        
        # Capture and send microphone audio
        stream = self.audio.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=24000,
            input=True,
            frames_per_buffer=1024
        )
        
        try:
            while True:
                audio_data = stream.read(1024)
                await self.send_audio(audio_data)
                await asyncio.sleep(0.01)
        finally:
            stream.close()
            receive_task.cancel()
            await self.ws.close()

# Usage
async def main():
    agent = RealtimeVoiceAgent(api_key=os.getenv("OPENAI_API_KEY"))
    await agent.start_conversation()

asyncio.run(main())

Realtime API vs. Sequential Pipeline: Edge Cases

When Realtime API breaks down:
  • Long tool calls: If your function takes more than 2-3 seconds (e.g., database queries, external API calls), the silence feels unnatural. The sequential pipeline handles this better because you can play a “thinking” audio clip while processing.
  • Multi-language in one session: Realtime API voice is configured per session. If a user switches languages mid-conversation, you need to update the session — which can cause a brief disruption.
  • Noisy environments: Server-side VAD (voice activity detection) may trigger on background noise, causing the model to respond to non-speech. In noisy use cases, use client-side VAD with a higher threshold and send input_audio_buffer.commit manually.
  • Cost at scale: A 10-minute conversation costs roughly 3.00withRealtimeAPIvs.3.00 with Realtime API vs. 0.30 with the sequential pipeline. At 10,000 conversations/day, that is 30K/dayvs.30K/day vs. 3K/day.

Voice Agent with Function Calling

class RealtimeVoiceAgentWithTools(RealtimeVoiceAgent):
    """Voice agent with function calling capabilities"""
    
    def __init__(self, api_key: str):
        super().__init__(api_key)
        self.tools = {
            "get_weather": self.get_weather,
            "set_reminder": self.set_reminder,
            "search_web": self.search_web
        }
    
    async def connect(self):
        await super().connect()
        
        # Add tools to session
        await self.ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "tools": [
                    {
                        "type": "function",
                        "name": "get_weather",
                        "description": "Get current weather for a location",
                        "parameters": {
                            "type": "object",
                            "properties": {
                                "location": {"type": "string"}
                            },
                            "required": ["location"]
                        }
                    },
                    {
                        "type": "function",
                        "name": "set_reminder",
                        "description": "Set a reminder",
                        "parameters": {
                            "type": "object",
                            "properties": {
                                "message": {"type": "string"},
                                "time": {"type": "string"}
                            },
                            "required": ["message", "time"]
                        }
                    }
                ]
            }
        }))
    
    async def handle_event(self, event: dict):
        event_type = event.get("type")
        
        if event_type == "response.function_call_arguments.done":
            # Execute function
            func_name = event["name"]
            args = json.loads(event["arguments"])
            
            if func_name in self.tools:
                result = await self.tools[func_name](**args)
                
                # Send result back
                await self.ws.send(json.dumps({
                    "type": "conversation.item.create",
                    "item": {
                        "type": "function_call_output",
                        "call_id": event["call_id"],
                        "output": json.dumps(result)
                    }
                }))
                
                # Continue response
                await self.ws.send(json.dumps({
                    "type": "response.create"
                }))
        else:
            await super().handle_event(event)
    
    async def get_weather(self, location: str) -> dict:
        return {"location": location, "temp": "72°F", "condition": "Sunny"}
    
    async def set_reminder(self, message: str, time: str) -> dict:
        return {"status": "set", "message": message, "time": time}
    
    async def search_web(self, query: str) -> dict:
        return {"results": f"Search results for: {query}"}

Image Generation

Image generation is the “output” side of multimodal AI. DALL-E 3 is significantly better than DALL-E 2 at following detailed prompts and rendering text, but it costs more and only generates one image at a time. A practical workflow: use DALL-E 3 for final images and DALL-E 2 for quick variations and experiments.

Image Generation Model Comparison

FeatureDALL-E 3DALL-E 2Stable Diffusion (self-hosted)
Prompt adherenceExcellent — follows detailed promptsModerate — misses detailsGood with proper prompting
Text renderingGood (readable text in images)PoorPoor without ControlNet
Cost per image$0.04-0.12$0.02Hardware cost only (~$0.001)
Speed5-15s3-8s2-10s (GPU dependent)
Max images per request110Unlimited
Editing/inpaintingNot supportedSupportedSupported
VariationsNot supportedSupportedSupported
Style controlvivid or naturalNoneFull control via models/LoRA
Self-hostableNoNoYes
Best forFinal production imagesRapid prototyping, variationsHigh-volume, custom styles

DALL-E 3 Integration

def generate_image(
    prompt: str,
    size: str = "1024x1024",  # 1024x1024, 1792x1024, 1024x1792
    quality: str = "standard",  # standard or hd
    style: str = "vivid"  # vivid or natural
) -> str:
    """Generate an image with DALL-E 3"""
    response = client.images.generate(
        model="dall-e-3",
        prompt=prompt,
        size=size,
        quality=quality,
        style=style,
        n=1
    )
    
    return response.data[0].url

# Generate image
url = generate_image(
    "A futuristic city with flying cars and neon lights, cyberpunk style",
    size="1792x1024",
    quality="hd"
)
print(f"Image URL: {url}")

# Generate and save
import requests

def generate_and_save(prompt: str, output_path: str, **kwargs) -> str:
    url = generate_image(prompt, **kwargs)
    
    response = requests.get(url)
    with open(output_path, "wb") as f:
        f.write(response.content)
    
    return output_path

Image Editing

def edit_image(
    image_path: str,
    mask_path: str,
    prompt: str,
    size: str = "1024x1024"
) -> str:
    """Edit an image using DALL-E 2"""
    with open(image_path, "rb") as img, open(mask_path, "rb") as mask:
        response = client.images.edit(
            model="dall-e-2",
            image=img,
            mask=mask,
            prompt=prompt,
            size=size,
            n=1
        )
    
    return response.data[0].url

def create_variations(image_path: str, n: int = 4) -> list[str]:
    """Create variations of an image"""
    with open(image_path, "rb") as f:
        response = client.images.create_variation(
            model="dall-e-2",
            image=f,
            n=n,
            size="1024x1024"
        )
    
    return [img.url for img in response.data]

Multimodal RAG

Traditional RAG only handles text, which means it is blind to charts, diagrams, screenshots, and images embedded in documents. Multimodal RAG fixes this by extracting both text and images from documents, then feeding both to a vision-capable model during retrieval. This is particularly powerful for technical documentation, financial reports with charts, and scientific papers with figures — any domain where the visual content carries information that the text alone cannot convey. Combine vision with RAG for document understanding:
from dataclasses import dataclass

@dataclass
class MultimodalDocument:
    text: str
    images: list[str]  # Base64 encoded images
    metadata: dict

class MultimodalRAG:
    """RAG system that handles text and images"""
    
    def __init__(self):
        self.documents: list[MultimodalDocument] = []
    
    def add_pdf_with_images(self, pdf_path: str):
        """Extract text and images from PDF"""
        import fitz  # PyMuPDF
        
        doc = fitz.open(pdf_path)
        
        for page in doc:
            text = page.get_text()
            images = []
            
            for img in page.get_images():
                xref = img[0]
                pix = fitz.Pixmap(doc, xref)
                images.append(base64.b64encode(pix.tobytes()).decode())
            
            self.documents.append(MultimodalDocument(
                text=text,
                images=images,
                metadata={"page": page.number, "source": pdf_path}
            ))
    
    def query(self, question: str) -> str:
        """Query with multimodal context.
        
        Pitfall: Each image consumes significant tokens (a high-detail image
        can use 1000+ tokens). Limit images aggressively or you will hit
        context limits and cost will spike. A good rule of thumb: 2-3 images
        per query max, and use "low" detail unless the image has fine text.
        """
        # Build multimodal context
        content = [
            {"type": "text", "text": f"Question: {question}\n\nContext from documents:"}
        ]
        
        # Add relevant text and images
        # Limit to 5 docs and 2 images each to stay within context window
        for doc in self.documents[:5]:
            content.append({
                "type": "text",
                "text": doc.text[:1000]
            })
            
            for img in doc.images[:2]:
                content.append({
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{img}"}
                })
        
        content.append({
            "type": "text",
            "text": "Based on the above context (text and images), answer the question."
        })
        
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": content}],
            max_tokens=1000
        )
        
        return response.choices[0].message.content

Multimodal RAG Edge Cases

Edge case — images without text context: If a PDF page is mostly a diagram with no surrounding text, your text-based retrieval will never find it. Solution: generate a text description of each image at ingestion time using GPT-4o Vision and store that description alongside the image for retrieval. Edge case — token budget explosion: A single high-detail image consumes 1,000+ tokens. A 10-page document with 3 images per page = 30,000+ tokens just for images, before any text. Set strict limits: maximum 3 images per query, use detail: "low" for initial retrieval, and only switch to detail: "high" for the final answer if the image is relevant. Edge case — screenshots of code: Vision models can read code from screenshots, but they make subtle errors (confusing 0 and O, missing indentation). For code-heavy documents, prefer OCR followed by syntax-aware cleanup over direct vision analysis.

Key Takeaways

Vision is Powerful

GPT-4o can analyze images, charts, screenshots, and documents

Audio is Easy

Whisper + TTS = complete audio pipeline in a few lines

Real-time is Here

Build voice assistants with the Realtime API

Combine Modalities

Multimodal RAG unlocks powerful document understanding

What’s Next

DSPy Framework

Learn declarative AI programming with Stanford’s DSPy framework

Interview Deep-Dive

Strong Answer:
  • The core challenge is that traditional text-only RAG is blind to visual content, and a significant portion of information in real-world documents lives in charts, diagrams, tables, and images. A financial report where the key trend is only visible in a line chart, or a technical manual where the architecture diagram is the most information-dense element on the page — text-only RAG misses all of this.
  • My architecture has four stages. Stage one is multimodal extraction: I use a library like PyMuPDF or pdf2image to extract both the raw text and the images from each page. For tables, I extract them as images rather than trying to parse the table structure from text, because vision models are now better at reading tables from screenshots than most table-parsing libraries are at reconstructing them from PDF internals.
  • Stage two is dual-track embedding. The text chunks get embedded with a text embedding model (text-embedding-3-small). The images get described by a vision model — I send each image to GPT-4o with a prompt like “Describe this chart/diagram in detail, including all data points, labels, trends, and key takeaways” and embed the resulting description. This way, both text and visual content live in the same vector space and can be retrieved by the same query.
  • Stage three is retrieval with multimodal context. When a user asks a question, I retrieve the top-k most relevant chunks (both text and image descriptions). For any retrieved image description, I also include the original image as a base64-encoded visual in the generation prompt. This is critical — the description alone loses nuance, but the description helps with retrieval while the original image provides full fidelity during generation.
  • Stage four is generation with explicit grounding. The prompt instructs the model to answer based on both the text context and the visual content, and to cite which specific document, page, or figure supports each claim. This grounding instruction reduces hallucination and gives users a way to verify the answer.
  • The main cost consideration is that each image in the generation prompt consumes 85-1,700 tokens depending on the detail level. I limit to 2-3 images per query and use detail: "low" (85 tokens fixed) for simple images like logos or icons, reserving detail: "high" for charts and tables where fine detail matters.
Follow-up: How do you handle the situation where the answer requires reasoning across a text passage on page 3 and a chart on page 7?This is the cross-reference problem, and it is genuinely hard. Single-pass retrieval will retrieve either the text or the chart but rarely both, because the query may be semantically close to one but not the other. My approach is two-pass retrieval. The first pass retrieves based on the original query. The second pass takes the content retrieved in pass one and generates a follow-up retrieval query designed to find complementary information. If pass one found the text passage about “Q3 revenue growth of 15%,” pass two searches for “Q3 revenue chart” or “revenue trend visualization.” This iterative retrieval significantly improves recall for cross-reference questions. The alternative approach is to store page-level context that links nearby text and images — when you retrieve a text chunk from page 3, you also pull any images from pages 2-4 as potentially relevant context. This is cheaper (no second retrieval pass) but less precise.
Strong Answer:
  • The traditional pipeline has three sequential steps: Whisper transcribes speech to text (200-500ms), the LLM generates a response from that text (500-2000ms), and TTS converts the response to speech (200-500ms). Total round-trip is 900-3000ms minimum. Each step introduces latency and each is a separate API call with its own failure surface. The advantages are: complete control over each step (you can modify the transcript before sending to the LLM, you can post-process the text before TTS), ability to use different providers for each component (Whisper for STT, Claude for reasoning, ElevenLabs for high-quality voice), and lower cost because you only pay for the compute you use.
  • The Realtime API collapses all three steps into a single WebSocket connection. Audio goes in, audio comes out, with sub-500ms round-trip latency. This latency difference is transformative — 500ms feels like a natural conversational pause, while 2000ms feels like talking to someone with a bad phone connection. The Realtime API also maintains conversation state natively, handles voice activity detection (knowing when the user has stopped speaking), and supports interruption (the user can speak while the model is still responding, and it will stop and process the new input).
  • The trade-offs: the Realtime API is more expensive — roughly 0.06perminuteforaudioinputplus0.06 per minute for audio input plus 0.24 per minute for audio output at current pricing. A 10-minute conversation costs about 3.Thesameconversationthroughthepipelineapproachmightcost3. The same conversation through the pipeline approach might cost 0.50-0.80. You also lose visibility into the intermediate steps — you cannot inspect the transcript or the text response, which makes debugging harder. And you are locked into OpenAI’s ecosystem; the pipeline approach lets you swap any component.
  • My recommendation is: use the Realtime API for interactive voice experiences where latency is the primary UX differentiator — phone agents, voice assistants, real-time translation. Use the pipeline approach for batch processing (transcribing meetings, generating podcast summaries), for applications where you need intermediate text for logging or compliance, and for applications where cost sensitivity outweighs latency requirements.
Follow-up: How do you handle the case where a voice agent needs to call external tools (check a database, look up an order) mid-conversation without an awkward pause?This is one of the most challenging UX problems in voice AI. When the agent needs to call a tool, there is an unavoidable delay while the tool executes. The Realtime API supports function calling within the WebSocket session — the model emits a function call event, you execute the function and send the result back, and the model continues generating audio. But if the tool takes 2 seconds, the user hears silence for 2 seconds, which feels like the call dropped. My approach is conversational bridging: I configure the agent to generate a natural filler phrase before the tool call — “Let me check that for you” or “One moment while I pull up your order” — which buys 2-3 seconds of perceived activity. For longer tool calls (5+ seconds), I use a streaming approach: send a partial response (“I found your order, it was placed on…”) while still waiting for full details, then update with complete information. The implementation requires careful management of the conversation state in the WebSocket session — you are essentially managing concurrent audio generation and tool execution on the same connection.
Strong Answer:
  • The cost of vision is primarily driven by the detail parameter and the image resolution. At detail: "low", every image costs a fixed 85 tokens regardless of resolution — roughly $0.0002 per image at GPT-4o pricing. At detail: "high", the cost scales with resolution: a 1024x1024 image costs roughly 765 tokens, a 2048x2048 image costs roughly 1,700 tokens. The difference is 20x in token consumption.
  • For batch processing thousands of images, the optimization strategy is a two-stage pipeline. Stage one is triage: run every image through the model at detail: "low" with a simple classification prompt — “Does this image contain text, charts, or detailed data? Answer yes or no.” Images that answer “no” (simple photos, logos, decorative images) are processed entirely at low detail. Images that answer “yes” are promoted to high detail for a second pass. In a typical document processing workload, 60-70% of images can be fully processed at low detail, saving 85% of token costs on those images.
  • The second optimization is resolution management. Images larger than 2048x2048 are automatically resized by the API, but you are still charged for the upload bandwidth and the API’s resizing is not always optimal. I pre-resize images to the minimum resolution that preserves the information I need. For OCR tasks, 1024px on the long edge is usually sufficient. For chart reading, 1536px preserves axis labels and data points. For simple classification (“is this a dog or a cat?”), 512px is plenty.
  • The third optimization is batching related images. Instead of sending 10 images as 10 separate API calls (each with its own system prompt and overhead), I send them as a single message with 10 image content blocks. The system prompt tokens are paid once instead of 10 times. For a 500-token system prompt, this saves 4,500 tokens across 10 images.
  • At scale — say 100,000 images per month — the difference between naive high-detail processing (4,000/month)andanoptimizedpipeline(4,000/month) and an optimized pipeline (400/month) is an order of magnitude. I have seen teams blow their entire AI budget on vision costs because they defaulted to detail: "high" for everything.
Follow-up: What are the accuracy limitations of vision models that you have encountered in production?Three consistent limitations. First, small text: vision models struggle with text smaller than about 10px in the image, especially handwritten text or text at angles. If your use case involves reading fine print on product labels or small annotations on engineering drawings, you need high resolution and even then accuracy drops to 70-80%. I always validate OCR results from vision models against a traditional OCR engine like Tesseract for critical applications. Second, spatial reasoning: “Is element A to the left of element B?” or “Which bar in this chart is tallest?” — vision models answer these incorrectly about 15-20% of the time. They are much better at describing what is in an image than reasoning about spatial relationships within it. Third, counting: “How many people are in this photo?” is surprisingly unreliable above 6-7 objects. The model might say 8 when there are 11. For any application that requires precise counting, I do not trust the vision model and use a dedicated object detection model instead.