Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

December 2025 Update: Master structured outputs with OpenAI’s JSON mode, Pydantic integration, and schema validation techniques.

Why Structured Outputs?

LLMs naturally output free-form text, but your application code needs structured data it can actually parse and act on. Without structured outputs, you end up writing brittle regex parsers that break whenever the model rephrases slightly. Structured outputs turn the LLM from a “text generator you hope returns JSON” into a “reliable function that always returns the exact schema you need.”
Free-form Output                 Structured Output
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
"The sentiment is positive       {"sentiment": "positive",
and confidence is around          "confidence": 0.92,
92%. The key topics are           "topics": ["service", "quality"]}
service and quality."
MethodReliabilityFlexibilityBest For
JSON ModeHighMediumSimple structures
Function CallingVery HighHighTool integration
Structured OutputsGuaranteedHighComplex schemas

OpenAI JSON Mode

The simplest way to get JSON:
from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "Extract information and return valid JSON."
        },
        {
            "role": "user", 
            "content": "John Smith, 35 years old, software engineer at Google"
        }
    ],
    response_format={"type": "json_object"}
)

import json
data = json.loads(response.choices[0].message.content)
print(data)
# {"name": "John Smith", "age": 35, "job": "software engineer", "company": "Google"}
Pitfall: JSON mode requires you to mention “JSON” somewhere in the prompt. If you forget, the API silently returns plain text instead of JSON, and your json.loads() call crashes in production. Always include format instructions like “Return valid JSON” in your system prompt.

JSON Mode with Schema Instructions

def extract_with_schema(text: str, schema: dict) -> dict:
    """Extract structured data following a schema"""
    
    schema_str = json.dumps(schema, indent=2)
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": f"""Extract information from the text and return JSON matching this schema:

{schema_str}

Rules:
- Return ONLY valid JSON
- Include all required fields
- Use null for missing optional fields
- Match the exact field names and types"""
            },
            {"role": "user", "content": text}
        ],
        response_format={"type": "json_object"}
    )
    
    return json.loads(response.choices[0].message.content)

# Usage
schema = {
    "name": "string",
    "email": "string", 
    "age": "integer",
    "interests": ["string"]
}

result = extract_with_schema(
    "Jane Doe (jane@example.com) is 28 and loves hiking and photography",
    schema
)

OpenAI Structured Outputs (Guaranteed Schema)

JSON mode says “return JSON” but doesn’t enforce a specific shape. Structured outputs go further: they constrain the model’s token generation to only produce valid JSON matching your exact Pydantic schema. This isn’t just prompting — it’s constrained decoding at the token level, meaning the output literally cannot deviate from your schema. No more “sometimes it adds an extra field” or “occasionally it uses a string instead of an integer.”
from pydantic import BaseModel
from typing import Optional, List

class Person(BaseModel):
    name: str
    age: int
    email: Optional[str] = None
    skills: List[str]

response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "system", "content": "Extract person information."},
        {"role": "user", "content": "Bob is 30, knows Python and JavaScript, email: bob@test.com"}
    ],
    response_format=Person
)

person = response.choices[0].message.parsed
print(f"Name: {person.name}, Age: {person.age}")
print(f"Skills: {', '.join(person.skills)}")

Complex Nested Schemas

from pydantic import BaseModel, Field
from typing import List, Optional
from enum import Enum

class Priority(str, Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"

class Task(BaseModel):
    title: str = Field(description="Short task title")
    description: str = Field(description="Detailed task description")
    priority: Priority
    estimated_hours: float = Field(ge=0, le=100)

class ProjectPlan(BaseModel):
    project_name: str
    objective: str
    tasks: List[Task]
    total_hours: float
    risks: List[str]
    dependencies: Optional[List[str]] = None

response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {
            "role": "system",
            "content": "You are a project planning assistant. Create detailed project plans."
        },
        {
            "role": "user",
            "content": "Plan a website redesign project for a small business"
        }
    ],
    response_format=ProjectPlan
)

plan = response.choices[0].message.parsed
print(f"Project: {plan.project_name}")
for task in plan.tasks:
    print(f"  - [{task.priority.value}] {task.title}: {task.estimated_hours}h")

Instructor: Pydantic + LLMs

Instructor wraps the OpenAI client to add automatic retries on validation failure, streaming of partial objects, and support for complex nested Pydantic models. It is the most practical library for production structured outputs — think of it as “Pydantic’s best friend for LLMs.” When the model returns invalid data, Instructor automatically feeds the validation error back to the model and asks it to fix its response. This retry loop is what makes extraction reliable in production.
pip install instructor
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List

# Patch OpenAI client
client = instructor.from_openai(OpenAI())

class ExtractedEntity(BaseModel):
    name: str
    entity_type: str = Field(description="person, organization, location, etc.")
    context: str = Field(description="How this entity is mentioned")

class DocumentAnalysis(BaseModel):
    summary: str = Field(description="One paragraph summary")
    entities: List[ExtractedEntity]
    sentiment: str = Field(description="positive, negative, or neutral")
    key_topics: List[str]
    language: str

# Use like normal, but with response_model
analysis = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Analyze this article: Apple announced..."}
    ],
    response_model=DocumentAnalysis
)

print(analysis.summary)
for entity in analysis.entities:
    print(f"  {entity.entity_type}: {entity.name}")

Instructor with Validation and Retries

from pydantic import BaseModel, Field, field_validator
from typing import List
import instructor

class ValidatedExtraction(BaseModel):
    email: str
    phone: str
    website: str
    
    @field_validator("email")
    @classmethod
    def validate_email(cls, v):
        if "@" not in v:
            raise ValueError("Invalid email format")
        return v.lower()
    
    @field_validator("phone")
    @classmethod
    def validate_phone(cls, v):
        digits = "".join(c for c in v if c.isdigit())
        if len(digits) < 10:
            raise ValueError("Phone must have at least 10 digits")
        return digits

client = instructor.from_openai(OpenAI())

# Instructor will retry if validation fails -- the key production feature.
# When the model returns "555-123" (too few digits), the validator rejects it,
# Instructor sends the error message back to the model ("Phone must have at 
# least 10 digits"), and the model self-corrects on the next attempt.
result = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Contact: john@example.com, 555-123-4567, www.example.com"}
    ],
    response_model=ValidatedExtraction,
    max_retries=3  # Usually succeeds on retry 1; 3 is a safe ceiling
)

Streaming with Instructor

from pydantic import BaseModel
from typing import List
import instructor

class StoryChapter(BaseModel):
    title: str
    content: str
    characters: List[str]

class Story(BaseModel):
    title: str
    chapters: List[StoryChapter]

client = instructor.from_openai(OpenAI())

# Stream partial objects
for partial in client.chat.completions.create_partial(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Write a 3 chapter short story about a robot"}
    ],
    response_model=Story
):
    # Partial object updates as it streams
    if partial.title:
        print(f"Title: {partial.title}")
    if partial.chapters:
        print(f"Chapters so far: {len(partial.chapters)}")

Function Calling for Structured Output

Function calling serves double duty: it gives the model tools to interact with the world, AND it produces guaranteed-structured JSON arguments. If you need the model to both decide what action to take and provide structured data for that action, function calling is the right choice. The strict: true flag enables constrained decoding so the arguments always match your schema exactly:
from openai import OpenAI
import json

client = OpenAI()

tools = [
    {
        "type": "function",
        "function": {
            "name": "create_calendar_event",
            "description": "Create a calendar event",
            "strict": True,  # Enable strict mode
            "parameters": {
                "type": "object",
                "properties": {
                    "title": {
                        "type": "string",
                        "description": "Event title"
                    },
                    "start_time": {
                        "type": "string",
                        "description": "Start time in ISO format"
                    },
                    "end_time": {
                        "type": "string", 
                        "description": "End time in ISO format"
                    },
                    "attendees": {
                        "type": "array",
                        "items": {"type": "string"},
                        "description": "List of attendee emails"
                    },
                    "location": {
                        "type": "string",
                        "description": "Event location"
                    }
                },
                "required": ["title", "start_time", "end_time"],
                "additionalProperties": False
            }
        }
    }
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Schedule a team meeting tomorrow at 2pm for 1 hour with alice@co.com and bob@co.com"}
    ],
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "create_calendar_event"}}
)

# Parse the structured function call
tool_call = response.choices[0].message.tool_calls[0]
event_data = json.loads(tool_call.function.arguments)
print(event_data)

Building a Structured Output Pipeline

from pydantic import BaseModel, Field
from typing import List, Optional, Any
from openai import OpenAI
import instructor
from enum import Enum

class OutputFormat(str, Enum):
    JSON = "json"
    PYDANTIC = "pydantic"
    FUNCTION_CALL = "function_call"

class StructuredOutputPipeline:
    """Unified pipeline for structured outputs"""
    
    def __init__(self, model: str = "gpt-4o"):
        self.model = model
        self.raw_client = OpenAI()
        self.instructor_client = instructor.from_openai(OpenAI())
    
    def extract(
        self,
        text: str,
        schema: type[BaseModel],
        system_prompt: str = "Extract information accurately.",
        max_retries: int = 2
    ) -> BaseModel:
        """Extract structured data using Instructor"""
        return self.instructor_client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": text}
            ],
            response_model=schema,
            max_retries=max_retries
        )
    
    def extract_batch(
        self,
        texts: List[str],
        schema: type[BaseModel],
        system_prompt: str = "Extract information accurately."
    ) -> List[BaseModel]:
        """Extract from multiple texts"""
        import asyncio
        from openai import AsyncOpenAI
        
        async_client = instructor.from_openai(AsyncOpenAI())
        
        async def extract_one(text: str) -> BaseModel:
            return await async_client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": text}
                ],
                response_model=schema
            )
        
        async def extract_all():
            return await asyncio.gather(*[extract_one(t) for t in texts])
        
        return asyncio.run(extract_all())
    
    def extract_with_confidence(
        self,
        text: str,
        schema: type[BaseModel]
    ) -> tuple[BaseModel, float]:
        """Extract with confidence score"""
        
        class WithConfidence(BaseModel):
            data: schema
            confidence: float = Field(ge=0, le=1, description="Confidence in extraction accuracy")
            reasoning: str = Field(description="Why this confidence level")
        
        result = self.instructor_client.chat.completions.create(
            model=self.model,
            messages=[
                {
                    "role": "system",
                    "content": "Extract information and rate your confidence (0-1) in the extraction accuracy."
                },
                {"role": "user", "content": text}
            ],
            response_model=WithConfidence
        )
        
        return result.data, result.confidence

# Usage
pipeline = StructuredOutputPipeline()

class ProductReview(BaseModel):
    product_name: str
    rating: float = Field(ge=1, le=5)
    pros: List[str]
    cons: List[str]
    recommendation: bool

review, confidence = pipeline.extract_with_confidence(
    "The new iPhone 15 is amazing! Great camera, fast processor. "
    "Battery could be better though. 4.5 stars, definitely recommend!",
    ProductReview
)

print(f"Product: {review.product_name}")
print(f"Rating: {review.rating}/5")
print(f"Confidence: {confidence:.0%}")

Choosing Your Structured Output Method

MethodSchema GuaranteeStreamingRetry on FailureBest For
JSON modeJSON only, no schema enforcementYesNo (manual)Quick prototypes, simple extractions
Structured OutputsExact schema via constrained decodingYes (partial)No (manual)Guaranteed schema compliance without external deps
Function Calling (strict)Exact schema via constrained decodingYes (fragments)No (manual)When the model also decides which action to take
InstructorPydantic validation + auto-retryYes (create_partial)Yes (automatic)Production extraction with validation logic
Decision framework:
  • “I just need valid JSON” — use JSON mode with schema instructions in the prompt. Simplest, but the model can return unexpected field names or types.
  • “I need exact schema compliance, no exceptions” — use Structured Outputs or function calling with strict: true. Constrained decoding at the token level means the output literally cannot deviate.
  • “I need validation beyond types (email format, ranges, business rules)” — use Instructor. Pydantic validators catch domain-specific violations and auto-retry lets the model self-correct.
  • “The model chooses from multiple possible actions” — use function calling. The model selects the function AND provides structured arguments.

Handling Edge Cases

Partial Extraction

Real-world text is messy. A contact card might have an email but no phone number. A resume might list skills but omit graduation dates. Your schema needs to handle missing data gracefully rather than failing or hallucinating values. Use Optional fields with None defaults, and add explicit fields for the model to report what it found and what it couldn’t.
from pydantic import BaseModel, Field
from typing import Optional, List

class FlexibleExtraction(BaseModel):
    """Schema that handles missing data gracefully -- prefer Optional fields
    over required fields when the source text may not contain the data."""
    
    name: Optional[str] = Field(None, description="Person's name if mentioned")
    email: Optional[str] = Field(None, description="Email if found")
    phone: Optional[str] = Field(None, description="Phone if found")
    
    extracted_fields: List[str] = Field(
        default_factory=list,
        description="List of fields that were successfully extracted"
    )
    missing_fields: List[str] = Field(
        default_factory=list,
        description="List of fields that could not be found"
    )
    extraction_notes: str = Field(
        "",
        description="Any notes about ambiguity or uncertainty"
    )

Union Types for Variable Outputs

from pydantic import BaseModel
from typing import Union, Literal

class SuccessResponse(BaseModel):
    status: Literal["success"]
    data: dict
    message: str

class ErrorResponse(BaseModel):
    status: Literal["error"]
    error_code: str
    error_message: str

class APIResponse(BaseModel):
    response: Union[SuccessResponse, ErrorResponse]

Key Takeaways

Use Structured Outputs

OpenAI’s structured outputs guarantee schema compliance

Pydantic is Essential

Pydantic provides validation, type safety, and clear schemas

Instructor for Production

Instructor adds retries, streaming, and validation on top of OpenAI

Plan for Edge Cases

Use Optional fields and flexible schemas for real-world data

What’s Next

LLM Caching

Learn caching strategies to reduce costs and latency

Interview Deep-Dive

Strong Answer:
  • JSON mode (response_format: {"type": "json_object"}) is the simplest. It constrains the model to produce valid JSON, but it does not enforce any specific schema. The model might return {"name": "John"} or {"full_name": "John", "extra_field": true} — both are valid JSON, but your downstream code might break on unexpected keys or missing fields. JSON mode also requires you to mention “JSON” in your prompt, or it silently ignores the setting and returns plain text. This is a production footgun that has bitten many teams.
  • Function calling with strict: true constrains the model to produce JSON matching a specific JSON Schema you define in the tool definition. The model’s output arguments will always conform to your schema — correct types, required fields present, enum values respected. The mechanism is constrained decoding: at each token generation step, the model can only select tokens that would produce valid JSON matching your schema. This is not prompt engineering — it is a hard constraint on the token sampling process.
  • OpenAI structured outputs (response_format: YourPydanticModel) provide the same constrained-decoding guarantee but for the response body itself, not tool arguments. You define a Pydantic model, and the model’s entire response is guaranteed to parse into that model. It is function calling’s strict mode applied to the response rather than to tool parameters.
  • When to use each: JSON mode when you need quick prototyping and the schema is simple enough that the model rarely deviates. Function calling when the model needs to both decide what action to take and provide structured arguments. Structured outputs when you need guaranteed schema compliance for extraction, classification, or any task where the output is structured data rather than a tool invocation.
  • The key insight most people miss: JSON mode is a prompt-level hint, while strict function calling and structured outputs are token-level constraints. The difference in reliability is not 95% versus 99% — it is “usually works” versus “mathematically cannot produce an invalid output.” For production pipelines where a schema violation crashes a downstream system, only the constrained-decoding approaches are acceptable.
Follow-up: Structured outputs guarantee schema compliance, but can the data inside the schema still be wrong? Give me a concrete example of a failure mode that structured outputs do NOT protect against.Absolutely. Structured outputs guarantee syntactic validity, not semantic correctness. If your schema has rating: float with ge=1, le=5, the model will always produce a number between 1 and 5. But if the review says “terrible product, complete waste of money” and the model outputs {"rating": 4.5}, the output is schema-valid but factually wrong. Structured outputs constrain the format, not the reasoning. Another concrete example: an email: str field will always be a string, but it might be "not_provided" instead of an actual email address — schema-valid, but useless for your downstream system expecting a real email. This is why Instructor’s retry-on-validation-failure pattern is so valuable in production: Pydantic validators can catch semantic issues (regex for email format, range checks, cross-field consistency) and feed errors back to the model for self-correction. The constrained decoding handles the structural guarantees; Pydantic validators handle the semantic ones.
Strong Answer:
  • The pipeline has three layers of defense: schema design, retry logic, and graceful degradation. Starting with schema design: every field that might not be present in the source email should be Optional with a None default. If you make customer_phone: str required, the pipeline will struggle on emails that do not mention a phone number — the model either hallucinates one or the validation fails. Use Optional liberally and add an extraction_confidence: float field so the model can signal uncertainty.
  • For retry logic, use Instructor with max_retries=3. When Pydantic validation fails (say the model returns “high” for an urgency: int field), Instructor sends the validation error message back to the model: “urgency must be an integer, got string ‘high’.” The model self-corrects on the next attempt. In my experience, 95% of failures are resolved on the first retry. The 5% that fail all three retries are genuinely ambiguous inputs that probably need human review anyway.
  • For the 10,000 emails/day throughput, batch processing with async concurrency is essential. Use asyncio.gather to process 20-50 emails concurrently. At 1 second per extraction, sequential processing takes ~3 hours. With 50-way concurrency, it takes ~3.5 minutes. But watch your rate limits — 50 concurrent requests to gpt-4o can hit the TPM (tokens per minute) limit quickly. Implement a semaphore or rate-limiting wrapper.
  • Graceful degradation: not every extraction needs to succeed. Build a dead-letter queue for emails that fail all retries. Route them to human review. Track the failure rate — if it is under 2%, you are in good shape. If it spikes above 5%, something changed (new email format, language shift, adversarial content) and you need to investigate.
  • Cost optimization: use gpt-4o-mini for the initial extraction pass. For the 2-5% that fail validation, retry with gpt-4o. This gives you the cost efficiency of the small model for the 95% easy cases and the capability of the large model for the hard cases. At 10K emails/day, this can save $50-100/day versus running everything through gpt-4o.
Follow-up: You notice that the model consistently extracts “urgency: 3” (medium) for emails that are clearly urgent based on keywords like “system down” and “revenue impact.” How do you fix this?This is a calibration problem — the model’s interpretation of your urgency scale does not match your business definition. Three approaches, in order of increasing effort and effectiveness: First, improve the field description in the Pydantic model. Instead of urgency: int = Field(description="Urgency 1-5"), use urgency: int = Field(description="Urgency 1-5 where 5=critical system outage or revenue loss, 4=customer-impacting issue, 3=important but not time-sensitive, 2=minor request, 1=informational only"). Explicit scale definitions with examples dramatically improve calibration. Second, add few-shot examples in the system prompt showing correctly labeled emails at each urgency level, especially for the boundaries between 3 and 4 where the model struggles. Third, if prompt engineering is insufficient, add a post-processing rule layer: if the extracted text contains specific keywords (“outage”, “system down”, “revenue impact”, “SLA breach”), override the model’s urgency to 5 regardless of what it predicted. This hybrid of LLM extraction plus rule-based overrides is common in production systems where certain signals are too important to leave to model judgment.
Strong Answer:
  • When the model’s first response fails Pydantic validation, Instructor constructs a new message in the conversation that includes two things: the model’s original (invalid) response and the specific validation error message from Pydantic. For example, if the model returned {"email": "not-an-email"} and the Pydantic validator requires an ”@” symbol, the retry message would be something like: “Your previous response failed validation: Value error, Invalid email format — must contain ’@’. Please fix the response and try again.”
  • The model sees its own failed attempt plus the error message, and it self-corrects. This works because the model can reason about its mistakes when given explicit error feedback. It is analogous to a human coder seeing a compiler error and fixing the code — the error message tells them exactly what is wrong.
  • Under the hood, Instructor appends the failed assistant message and a new user message with the validation error to the conversation history, then makes another API call. This means each retry costs additional tokens — both the input tokens for the growing conversation history and the output tokens for the new response. Three retries on a complex schema can cost 3-4x the original request.
  • The reason this is effective is that most validation failures are not reasoning failures but formatting failures. The model knew the email was “john@example.com” but returned it in the wrong field, or it knew the urgency was 5 but output “5” as a string instead of an integer. The validation error points to the exact formatting issue, and the model fixes it trivially.
  • Where this breaks down: if the validation failure is a genuine reasoning error (the model truly does not know the correct value), retries will not help. The model will either hallucinate a plausible-looking value that passes validation, or it will flip-flop between different wrong values. For semantic correctness, you need better prompts or a more capable model, not more retries.
Follow-up: You are using Instructor with max_retries=3 in a latency-sensitive application. Each retry adds 500ms. How do you balance reliability with latency?The key insight is that most requests succeed on the first try. With well-designed schemas and good prompts, the first-attempt success rate is typically 95-98%. So the average latency overhead is negligible — only 2-5% of requests incur retry latency. But the p99 latency is significantly worse (up to 2 seconds of retry overhead). If your SLA cares about p99, consider two strategies. First, use structured outputs (constrained decoding) instead of Instructor retries for schema compliance — constrained decoding guarantees success on the first attempt with zero retry overhead. Reserve Instructor retries for semantic validation (custom Pydantic validators) where constrained decoding cannot help. Second, implement a timeout with fallback: if the first retry does not succeed within 1 second, return a partial result with a flag indicating low confidence, and queue the full extraction for async processing. The user gets a fast (possibly incomplete) response immediately, and the complete extraction arrives later. This is the pattern used by real-time extraction features in products like email clients — show what you can extract quickly, backfill the rest.
Strong Answer:
  • The most common reason structured output works in dev but fails in production is input distribution mismatch. Development test cases are clean, well-formed, and representative of the happy path. Production data is messy: emails with HTML artifacts, documents with tables that got mangled during text extraction, inputs in unexpected languages, text that is too long and gets truncated by token limits, or adversarial inputs from users testing the system.
  • My diagnostic process: First, pull the failing 20% and compare them to the passing 80%. Look for patterns in input length (are the failures systematically longer?), language (are non-English inputs failing?), format (do inputs with bullet points or tables fail more?), and content (are certain topics harder to extract?). This segmentation usually reveals the root cause within an hour.
  • Second most common cause: token limit truncation. In development, your test inputs are short. In production, a customer email might include a forwarded thread with 50 messages. The model’s context window fills up, the system prompt and schema instructions get pushed out, and the model loses the extraction instructions. Fix: truncate or summarize long inputs before extraction, and always put your schema instructions in the system prompt (which the model prioritizes) rather than the user prompt.
  • Third cause: the production inputs contain characters or formatting that confuse the model. Markdown tables, code blocks, HTML tags, Unicode special characters, or extremely long single-line strings without whitespace. The model’s tokenizer handles these differently, and the extraction quality degrades. Pre-processing the input to strip HTML, normalize whitespace, and break extremely long lines resolves most of these issues.
  • Fourth cause: schema rigidity. Your schema requires specific enum values like category: Literal["billing", "technical", "general"], but production users write about topics that do not cleanly fit any category. The model forces a classification and often picks wrong. Add an “other” category, or switch from strict enums to a string field with description guidance and validate downstream.
Follow-up: You add an “other” category and the model starts classifying 30% of inputs as “other” because it is the easy escape hatch. How do you prevent this while still handling genuine edge cases?This is a classic precision-recall trade-off in classification. Two approaches that work well together: First, make “other” costly in the prompt. Add to the field description: “Use ‘other’ ONLY if the input genuinely does not fit any of the defined categories. If you are unsure between two categories, pick the closest match rather than defaulting to ‘other’.” This prompt pressure reduces spurious “other” classifications by 50-70% in my experience. Second, add a classification_reasoning: str field to the schema. When the model must explain why it chose “other,” it forces itself to consider whether one of the real categories actually fits. Often the reasoning reveals the correct category, and the model self-corrects. Operationally, monitor the “other” rate weekly. If it exceeds 10-15%, review the “other” examples — you likely need a new category that your original schema did not anticipate.