Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Why Structured Outputs?
LLMs naturally output free-form text, but your application code needs structured data it can actually parse and act on. Without structured outputs, you end up writing brittle regex parsers that break whenever the model rephrases slightly. Structured outputs turn the LLM from a “text generator you hope returns JSON” into a “reliable function that always returns the exact schema you need.”| Method | Reliability | Flexibility | Best For |
|---|---|---|---|
| JSON Mode | High | Medium | Simple structures |
| Function Calling | Very High | High | Tool integration |
| Structured Outputs | Guaranteed | High | Complex schemas |
OpenAI JSON Mode
The simplest way to get JSON:JSON Mode with Schema Instructions
OpenAI Structured Outputs (Guaranteed Schema)
JSON mode says “return JSON” but doesn’t enforce a specific shape. Structured outputs go further: they constrain the model’s token generation to only produce valid JSON matching your exact Pydantic schema. This isn’t just prompting — it’s constrained decoding at the token level, meaning the output literally cannot deviate from your schema. No more “sometimes it adds an extra field” or “occasionally it uses a string instead of an integer.”Complex Nested Schemas
Instructor: Pydantic + LLMs
Instructor wraps the OpenAI client to add automatic retries on validation failure, streaming of partial objects, and support for complex nested Pydantic models. It is the most practical library for production structured outputs — think of it as “Pydantic’s best friend for LLMs.” When the model returns invalid data, Instructor automatically feeds the validation error back to the model and asks it to fix its response. This retry loop is what makes extraction reliable in production.Instructor with Validation and Retries
Streaming with Instructor
Function Calling for Structured Output
Function calling serves double duty: it gives the model tools to interact with the world, AND it produces guaranteed-structured JSON arguments. If you need the model to both decide what action to take and provide structured data for that action, function calling is the right choice. Thestrict: true flag enables constrained decoding so the arguments always match your schema exactly:
Building a Structured Output Pipeline
Choosing Your Structured Output Method
| Method | Schema Guarantee | Streaming | Retry on Failure | Best For |
|---|---|---|---|---|
| JSON mode | JSON only, no schema enforcement | Yes | No (manual) | Quick prototypes, simple extractions |
| Structured Outputs | Exact schema via constrained decoding | Yes (partial) | No (manual) | Guaranteed schema compliance without external deps |
| Function Calling (strict) | Exact schema via constrained decoding | Yes (fragments) | No (manual) | When the model also decides which action to take |
| Instructor | Pydantic validation + auto-retry | Yes (create_partial) | Yes (automatic) | Production extraction with validation logic |
- “I just need valid JSON” — use JSON mode with schema instructions in the prompt. Simplest, but the model can return unexpected field names or types.
- “I need exact schema compliance, no exceptions” — use Structured Outputs or function calling with
strict: true. Constrained decoding at the token level means the output literally cannot deviate. - “I need validation beyond types (email format, ranges, business rules)” — use Instructor. Pydantic validators catch domain-specific violations and auto-retry lets the model self-correct.
- “The model chooses from multiple possible actions” — use function calling. The model selects the function AND provides structured arguments.
Handling Edge Cases
Partial Extraction
Real-world text is messy. A contact card might have an email but no phone number. A resume might list skills but omit graduation dates. Your schema needs to handle missing data gracefully rather than failing or hallucinating values. Use Optional fields with None defaults, and add explicit fields for the model to report what it found and what it couldn’t.Union Types for Variable Outputs
Key Takeaways
Use Structured Outputs
Pydantic is Essential
Instructor for Production
Plan for Edge Cases
What’s Next
LLM Caching
Interview Deep-Dive
Explain the difference between JSON mode, function calling with strict mode, and OpenAI structured outputs. When would you use each, and what guarantees does each actually provide?
Explain the difference between JSON mode, function calling with strict mode, and OpenAI structured outputs. When would you use each, and what guarantees does each actually provide?
- JSON mode (
response_format: {"type": "json_object"}) is the simplest. It constrains the model to produce valid JSON, but it does not enforce any specific schema. The model might return{"name": "John"}or{"full_name": "John", "extra_field": true}— both are valid JSON, but your downstream code might break on unexpected keys or missing fields. JSON mode also requires you to mention “JSON” in your prompt, or it silently ignores the setting and returns plain text. This is a production footgun that has bitten many teams. - Function calling with
strict: trueconstrains the model to produce JSON matching a specific JSON Schema you define in the tool definition. The model’s output arguments will always conform to your schema — correct types, required fields present, enum values respected. The mechanism is constrained decoding: at each token generation step, the model can only select tokens that would produce valid JSON matching your schema. This is not prompt engineering — it is a hard constraint on the token sampling process. - OpenAI structured outputs (
response_format: YourPydanticModel) provide the same constrained-decoding guarantee but for the response body itself, not tool arguments. You define a Pydantic model, and the model’s entire response is guaranteed to parse into that model. It is function calling’s strict mode applied to the response rather than to tool parameters. - When to use each: JSON mode when you need quick prototyping and the schema is simple enough that the model rarely deviates. Function calling when the model needs to both decide what action to take and provide structured arguments. Structured outputs when you need guaranteed schema compliance for extraction, classification, or any task where the output is structured data rather than a tool invocation.
- The key insight most people miss: JSON mode is a prompt-level hint, while strict function calling and structured outputs are token-level constraints. The difference in reliability is not 95% versus 99% — it is “usually works” versus “mathematically cannot produce an invalid output.” For production pipelines where a schema violation crashes a downstream system, only the constrained-decoding approaches are acceptable.
rating: float with ge=1, le=5, the model will always produce a number between 1 and 5. But if the review says “terrible product, complete waste of money” and the model outputs {"rating": 4.5}, the output is schema-valid but factually wrong. Structured outputs constrain the format, not the reasoning. Another concrete example: an email: str field will always be a string, but it might be "not_provided" instead of an actual email address — schema-valid, but useless for your downstream system expecting a real email. This is why Instructor’s retry-on-validation-failure pattern is so valuable in production: Pydantic validators can catch semantic issues (regex for email format, range checks, cross-field consistency) and feed errors back to the model for self-correction. The constrained decoding handles the structural guarantees; Pydantic validators handle the semantic ones.You are building an extraction pipeline that processes 10,000 customer support emails per day, extracting structured data (sentiment, category, urgency, entities) into a Pydantic model. How do you design this for reliability, and what happens when extraction fails?
You are building an extraction pipeline that processes 10,000 customer support emails per day, extracting structured data (sentiment, category, urgency, entities) into a Pydantic model. How do you design this for reliability, and what happens when extraction fails?
- The pipeline has three layers of defense: schema design, retry logic, and graceful degradation. Starting with schema design: every field that might not be present in the source email should be
Optionalwith aNonedefault. If you makecustomer_phone: strrequired, the pipeline will struggle on emails that do not mention a phone number — the model either hallucinates one or the validation fails. Use Optional liberally and add anextraction_confidence: floatfield so the model can signal uncertainty. - For retry logic, use Instructor with
max_retries=3. When Pydantic validation fails (say the model returns “high” for anurgency: intfield), Instructor sends the validation error message back to the model: “urgency must be an integer, got string ‘high’.” The model self-corrects on the next attempt. In my experience, 95% of failures are resolved on the first retry. The 5% that fail all three retries are genuinely ambiguous inputs that probably need human review anyway. - For the 10,000 emails/day throughput, batch processing with async concurrency is essential. Use
asyncio.gatherto process 20-50 emails concurrently. At 1 second per extraction, sequential processing takes ~3 hours. With 50-way concurrency, it takes ~3.5 minutes. But watch your rate limits — 50 concurrent requests togpt-4ocan hit the TPM (tokens per minute) limit quickly. Implement a semaphore or rate-limiting wrapper. - Graceful degradation: not every extraction needs to succeed. Build a dead-letter queue for emails that fail all retries. Route them to human review. Track the failure rate — if it is under 2%, you are in good shape. If it spikes above 5%, something changed (new email format, language shift, adversarial content) and you need to investigate.
- Cost optimization: use
gpt-4o-minifor the initial extraction pass. For the 2-5% that fail validation, retry withgpt-4o. This gives you the cost efficiency of the small model for the 95% easy cases and the capability of the large model for the hard cases. At 10K emails/day, this can save $50-100/day versus running everything throughgpt-4o.
urgency: int = Field(description="Urgency 1-5"), use urgency: int = Field(description="Urgency 1-5 where 5=critical system outage or revenue loss, 4=customer-impacting issue, 3=important but not time-sensitive, 2=minor request, 1=informational only"). Explicit scale definitions with examples dramatically improve calibration. Second, add few-shot examples in the system prompt showing correctly labeled emails at each urgency level, especially for the boundaries between 3 and 4 where the model struggles. Third, if prompt engineering is insufficient, add a post-processing rule layer: if the extracted text contains specific keywords (“outage”, “system down”, “revenue impact”, “SLA breach”), override the model’s urgency to 5 regardless of what it predicted. This hybrid of LLM extraction plus rule-based overrides is common in production systems where certain signals are too important to leave to model judgment.Instructor adds automatic retries when Pydantic validation fails. Walk me through exactly what happens during a retry -- what data is sent back to the model, and why does this work?
Instructor adds automatic retries when Pydantic validation fails. Walk me through exactly what happens during a retry -- what data is sent back to the model, and why does this work?
- When the model’s first response fails Pydantic validation, Instructor constructs a new message in the conversation that includes two things: the model’s original (invalid) response and the specific validation error message from Pydantic. For example, if the model returned
{"email": "not-an-email"}and the Pydantic validator requires an ”@” symbol, the retry message would be something like: “Your previous response failed validation: Value error, Invalid email format — must contain ’@’. Please fix the response and try again.” - The model sees its own failed attempt plus the error message, and it self-corrects. This works because the model can reason about its mistakes when given explicit error feedback. It is analogous to a human coder seeing a compiler error and fixing the code — the error message tells them exactly what is wrong.
- Under the hood, Instructor appends the failed assistant message and a new user message with the validation error to the conversation history, then makes another API call. This means each retry costs additional tokens — both the input tokens for the growing conversation history and the output tokens for the new response. Three retries on a complex schema can cost 3-4x the original request.
- The reason this is effective is that most validation failures are not reasoning failures but formatting failures. The model knew the email was “john@example.com” but returned it in the wrong field, or it knew the urgency was 5 but output “5” as a string instead of an integer. The validation error points to the exact formatting issue, and the model fixes it trivially.
- Where this breaks down: if the validation failure is a genuine reasoning error (the model truly does not know the correct value), retries will not help. The model will either hallucinate a plausible-looking value that passes validation, or it will flip-flop between different wrong values. For semantic correctness, you need better prompts or a more capable model, not more retries.
Your structured output pipeline works perfectly in development but fails on 20% of production inputs. The Pydantic models are identical. What is different about production data, and how do you diagnose it?
Your structured output pipeline works perfectly in development but fails on 20% of production inputs. The Pydantic models are identical. What is different about production data, and how do you diagnose it?
- The most common reason structured output works in dev but fails in production is input distribution mismatch. Development test cases are clean, well-formed, and representative of the happy path. Production data is messy: emails with HTML artifacts, documents with tables that got mangled during text extraction, inputs in unexpected languages, text that is too long and gets truncated by token limits, or adversarial inputs from users testing the system.
- My diagnostic process: First, pull the failing 20% and compare them to the passing 80%. Look for patterns in input length (are the failures systematically longer?), language (are non-English inputs failing?), format (do inputs with bullet points or tables fail more?), and content (are certain topics harder to extract?). This segmentation usually reveals the root cause within an hour.
- Second most common cause: token limit truncation. In development, your test inputs are short. In production, a customer email might include a forwarded thread with 50 messages. The model’s context window fills up, the system prompt and schema instructions get pushed out, and the model loses the extraction instructions. Fix: truncate or summarize long inputs before extraction, and always put your schema instructions in the system prompt (which the model prioritizes) rather than the user prompt.
- Third cause: the production inputs contain characters or formatting that confuse the model. Markdown tables, code blocks, HTML tags, Unicode special characters, or extremely long single-line strings without whitespace. The model’s tokenizer handles these differently, and the extraction quality degrades. Pre-processing the input to strip HTML, normalize whitespace, and break extremely long lines resolves most of these issues.
- Fourth cause: schema rigidity. Your schema requires specific enum values like
category: Literal["billing", "technical", "general"], but production users write about topics that do not cleanly fit any category. The model forces a classification and often picks wrong. Add an “other” category, or switch from strict enums to a string field with description guidance and validate downstream.
classification_reasoning: str field to the schema. When the model must explain why it chose “other,” it forces itself to consider whether one of the real categories actually fits. Often the reasoning reveals the correct category, and the model self-corrects. Operationally, monitor the “other” rate weekly. If it exceeds 10-15%, review the “other” examples — you likely need a new category that your original schema did not anticipate.