Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
December 2025 Update: Comprehensive guide to multimodal AI including GPT-4 Vision, audio processing, and real-time voice with OpenAI’s Realtime API.
What is Multimodal AI?
Think of how humans understand the world: you don’t just read text — you see images, hear sounds, watch videos, and combine all of those signals to form understanding. A doctor doesn’t just read a patient’s chart; they look at X-rays, listen to heart sounds, and observe body language. Multimodal AI works the same way — it processes and reasons across multiple types of content simultaneously, which unlocks capabilities that text-only models simply cannot match. Multimodal AI processes and generates multiple types of content:| Capability | Use Cases | Models | Typical Latency | Cost Range |
|---|---|---|---|---|
| Vision | Image analysis, OCR, diagram understanding | GPT-4o, Claude 3.5, Gemini | 1-5s per image | $0.01-0.04 per image |
| Audio | Transcription, TTS, voice understanding | Whisper, TTS, Realtime API | 0.5-2s per minute of audio | $0.006/min (Whisper) |
| Video | Scene analysis, content moderation | Gemini 1.5 Pro | 5-30s per minute of video | $0.05-0.20 per minute |
| Real-time Voice | Voice assistants, phone agents | Realtime API, Gemini Live | Sub-500ms round-trip | $0.06/min (audio input) |
Choosing the Right Modality Approach
| Scenario | Recommended Approach | Why |
|---|---|---|
| One-off image analysis | Direct base64 or URL to GPT-4o | Simplest path, high quality |
| Batch image processing (100+ images) | Queue-based with detail: "low" | 85% cheaper, sufficient for classification |
| Real-time conversation | Realtime API over WebSocket | Sub-500ms latency, natural feel |
| Transcribe then respond | Whisper + Chat Completions + TTS | Cheaper, more control over each step |
| Document understanding with charts | Multimodal RAG with high detail | Charts carry information text alone cannot convey |
| Accessibility alt-text generation | GPT-4o Vision with specific prompt | Vision models excel at descriptive captions |
Vision: Image Understanding
Analyzing Images with GPT-4o
Vision Detail Levels: When to Use What
Thedetail parameter is the single biggest cost lever for vision tasks. Most teams default to "high" and overpay by 5-10x.
| Detail Level | Token Cost | Best For | Avoid When |
|---|---|---|---|
"low" | ~85 tokens fixed | Simple classification, presence detection, general scene description | Image contains fine text, dense tables, or small UI elements |
"high" | 85-1,700+ tokens (scales with resolution) | OCR, chart reading, UI analysis, medical imaging | Batch processing thousands of images, simple yes/no questions |
"auto" | Varies | When you cannot predict image complexity | You need predictable costs per request |
Analyzing Images from URLs
Vision Use Cases
Audio: Speech and Sound
Audio processing with LLMs is like having a universal translator that also takes dictation. Whisper handles speech-to-text (including across languages), TTS converts text back to natural-sounding speech, and the two together form a complete audio pipeline. The key insight: Whisper is not just transcription — it understands context, handles accents, and can even translate between languages in a single step.Audio Pipeline Decision Framework
Before writing code, decide which pipeline architecture you need:| Architecture | Latency | Cost | Quality | When to Use |
|---|---|---|---|---|
| Whisper -> Chat -> TTS (sequential) | 3-8s total | Lowest | Full control over each step | Async processing, batch jobs, cost-sensitive |
| Realtime API (streaming) | Sub-500ms | Highest (0.24/min out) | Most natural conversation | Voice assistants, phone agents, live customer support |
| Whisper -> Chat (text response) | 2-5s | Low | N/A (no audio output) | Meeting transcription with Q&A, voice search |
| Chat -> TTS (audio output only) | 1-3s | Medium | Good for narration | Podcast generation, audiobook creation, notifications |
Speech-to-Text with Whisper
Text-to-Speech
Audio Translation
Real-Time Voice
OpenAI Realtime API
The Realtime API is fundamentally different from the transcribe-then-generate-then-speak pipeline. Instead of three sequential steps (each adding latency), it processes audio input and produces audio output in a single streaming connection — like a phone call rather than a walkie-talkie. This cuts round-trip latency from 3-5 seconds down to under 500ms, which is the threshold where conversations feel natural. For real-time voice conversations with AI:Realtime API vs. Sequential Pipeline: Edge Cases
When Realtime API breaks down:- Long tool calls: If your function takes more than 2-3 seconds (e.g., database queries, external API calls), the silence feels unnatural. The sequential pipeline handles this better because you can play a “thinking” audio clip while processing.
- Multi-language in one session: Realtime API voice is configured per session. If a user switches languages mid-conversation, you need to update the session — which can cause a brief disruption.
- Noisy environments: Server-side VAD (voice activity detection) may trigger on background noise, causing the model to respond to non-speech. In noisy use cases, use client-side VAD with a higher threshold and send
input_audio_buffer.commitmanually. - Cost at scale: A 10-minute conversation costs roughly 0.30 with the sequential pipeline. At 10,000 conversations/day, that is 3K/day.
Voice Agent with Function Calling
Image Generation
Image generation is the “output” side of multimodal AI. DALL-E 3 is significantly better than DALL-E 2 at following detailed prompts and rendering text, but it costs more and only generates one image at a time. A practical workflow: use DALL-E 3 for final images and DALL-E 2 for quick variations and experiments.Image Generation Model Comparison
| Feature | DALL-E 3 | DALL-E 2 | Stable Diffusion (self-hosted) |
|---|---|---|---|
| Prompt adherence | Excellent — follows detailed prompts | Moderate — misses details | Good with proper prompting |
| Text rendering | Good (readable text in images) | Poor | Poor without ControlNet |
| Cost per image | $0.04-0.12 | $0.02 | Hardware cost only (~$0.001) |
| Speed | 5-15s | 3-8s | 2-10s (GPU dependent) |
| Max images per request | 1 | 10 | Unlimited |
| Editing/inpainting | Not supported | Supported | Supported |
| Variations | Not supported | Supported | Supported |
| Style control | vivid or natural | None | Full control via models/LoRA |
| Self-hostable | No | No | Yes |
| Best for | Final production images | Rapid prototyping, variations | High-volume, custom styles |
DALL-E 3 Integration
Image Editing
Multimodal RAG
Traditional RAG only handles text, which means it is blind to charts, diagrams, screenshots, and images embedded in documents. Multimodal RAG fixes this by extracting both text and images from documents, then feeding both to a vision-capable model during retrieval. This is particularly powerful for technical documentation, financial reports with charts, and scientific papers with figures — any domain where the visual content carries information that the text alone cannot convey. Combine vision with RAG for document understanding:Multimodal RAG Edge Cases
Edge case — images without text context: If a PDF page is mostly a diagram with no surrounding text, your text-based retrieval will never find it. Solution: generate a text description of each image at ingestion time using GPT-4o Vision and store that description alongside the image for retrieval. Edge case — token budget explosion: A single high-detail image consumes 1,000+ tokens. A 10-page document with 3 images per page = 30,000+ tokens just for images, before any text. Set strict limits: maximum 3 images per query, usedetail: "low" for initial retrieval, and only switch to detail: "high" for the final answer if the image is relevant.
Edge case — screenshots of code: Vision models can read code from screenshots, but they make subtle errors (confusing 0 and O, missing indentation). For code-heavy documents, prefer OCR followed by syntax-aware cleanup over direct vision analysis.
Key Takeaways
Vision is Powerful
GPT-4o can analyze images, charts, screenshots, and documents
Audio is Easy
Whisper + TTS = complete audio pipeline in a few lines
Real-time is Here
Build voice assistants with the Realtime API
Combine Modalities
Multimodal RAG unlocks powerful document understanding
What’s Next
DSPy Framework
Learn declarative AI programming with Stanford’s DSPy framework
Interview Deep-Dive
You are designing a document understanding system that processes PDFs containing text, tables, charts, and images. How do you architect a multimodal RAG pipeline for this?
You are designing a document understanding system that processes PDFs containing text, tables, charts, and images. How do you architect a multimodal RAG pipeline for this?
Strong Answer:
- The core challenge is that traditional text-only RAG is blind to visual content, and a significant portion of information in real-world documents lives in charts, diagrams, tables, and images. A financial report where the key trend is only visible in a line chart, or a technical manual where the architecture diagram is the most information-dense element on the page — text-only RAG misses all of this.
- My architecture has four stages. Stage one is multimodal extraction: I use a library like PyMuPDF or pdf2image to extract both the raw text and the images from each page. For tables, I extract them as images rather than trying to parse the table structure from text, because vision models are now better at reading tables from screenshots than most table-parsing libraries are at reconstructing them from PDF internals.
- Stage two is dual-track embedding. The text chunks get embedded with a text embedding model (text-embedding-3-small). The images get described by a vision model — I send each image to GPT-4o with a prompt like “Describe this chart/diagram in detail, including all data points, labels, trends, and key takeaways” and embed the resulting description. This way, both text and visual content live in the same vector space and can be retrieved by the same query.
- Stage three is retrieval with multimodal context. When a user asks a question, I retrieve the top-k most relevant chunks (both text and image descriptions). For any retrieved image description, I also include the original image as a base64-encoded visual in the generation prompt. This is critical — the description alone loses nuance, but the description helps with retrieval while the original image provides full fidelity during generation.
- Stage four is generation with explicit grounding. The prompt instructs the model to answer based on both the text context and the visual content, and to cite which specific document, page, or figure supports each claim. This grounding instruction reduces hallucination and gives users a way to verify the answer.
- The main cost consideration is that each image in the generation prompt consumes 85-1,700 tokens depending on the detail level. I limit to 2-3 images per query and use
detail: "low"(85 tokens fixed) for simple images like logos or icons, reservingdetail: "high"for charts and tables where fine detail matters.
Compare the Realtime API approach to voice (single-pass audio-to-audio) versus the traditional pipeline approach (Whisper STT, then LLM, then TTS). What are the trade-offs?
Compare the Realtime API approach to voice (single-pass audio-to-audio) versus the traditional pipeline approach (Whisper STT, then LLM, then TTS). What are the trade-offs?
Strong Answer:
- The traditional pipeline has three sequential steps: Whisper transcribes speech to text (200-500ms), the LLM generates a response from that text (500-2000ms), and TTS converts the response to speech (200-500ms). Total round-trip is 900-3000ms minimum. Each step introduces latency and each is a separate API call with its own failure surface. The advantages are: complete control over each step (you can modify the transcript before sending to the LLM, you can post-process the text before TTS), ability to use different providers for each component (Whisper for STT, Claude for reasoning, ElevenLabs for high-quality voice), and lower cost because you only pay for the compute you use.
- The Realtime API collapses all three steps into a single WebSocket connection. Audio goes in, audio comes out, with sub-500ms round-trip latency. This latency difference is transformative — 500ms feels like a natural conversational pause, while 2000ms feels like talking to someone with a bad phone connection. The Realtime API also maintains conversation state natively, handles voice activity detection (knowing when the user has stopped speaking), and supports interruption (the user can speak while the model is still responding, and it will stop and process the new input).
- The trade-offs: the Realtime API is more expensive — roughly 0.24 per minute for audio output at current pricing. A 10-minute conversation costs about 0.50-0.80. You also lose visibility into the intermediate steps — you cannot inspect the transcript or the text response, which makes debugging harder. And you are locked into OpenAI’s ecosystem; the pipeline approach lets you swap any component.
- My recommendation is: use the Realtime API for interactive voice experiences where latency is the primary UX differentiator — phone agents, voice assistants, real-time translation. Use the pipeline approach for batch processing (transcribing meetings, generating podcast summaries), for applications where you need intermediate text for logging or compliance, and for applications where cost sensitivity outweighs latency requirements.
What are the token cost implications of vision in LLM applications, and how do you optimize for cost when processing thousands of images?
What are the token cost implications of vision in LLM applications, and how do you optimize for cost when processing thousands of images?
Strong Answer:
- The cost of vision is primarily driven by the
detailparameter and the image resolution. Atdetail: "low", every image costs a fixed 85 tokens regardless of resolution — roughly $0.0002 per image at GPT-4o pricing. Atdetail: "high", the cost scales with resolution: a 1024x1024 image costs roughly 765 tokens, a 2048x2048 image costs roughly 1,700 tokens. The difference is 20x in token consumption. - For batch processing thousands of images, the optimization strategy is a two-stage pipeline. Stage one is triage: run every image through the model at
detail: "low"with a simple classification prompt — “Does this image contain text, charts, or detailed data? Answer yes or no.” Images that answer “no” (simple photos, logos, decorative images) are processed entirely at low detail. Images that answer “yes” are promoted to high detail for a second pass. In a typical document processing workload, 60-70% of images can be fully processed at low detail, saving 85% of token costs on those images. - The second optimization is resolution management. Images larger than 2048x2048 are automatically resized by the API, but you are still charged for the upload bandwidth and the API’s resizing is not always optimal. I pre-resize images to the minimum resolution that preserves the information I need. For OCR tasks, 1024px on the long edge is usually sufficient. For chart reading, 1536px preserves axis labels and data points. For simple classification (“is this a dog or a cat?”), 512px is plenty.
- The third optimization is batching related images. Instead of sending 10 images as 10 separate API calls (each with its own system prompt and overhead), I send them as a single message with 10 image content blocks. The system prompt tokens are paid once instead of 10 times. For a 500-token system prompt, this saves 4,500 tokens across 10 images.
- At scale — say 100,000 images per month — the difference between naive high-detail processing (400/month) is an order of magnitude. I have seen teams blow their entire AI budget on vision costs because they defaulted to
detail: "high"for everything.