Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Text classification is a fundamental NLP task with broad applications. This chapter covers implementing classification systems using LLMs, from zero-shot to production-grade solutions. The mental model is simple: classification is sorting mail. You have a pile of incoming text and a set of labeled bins. The challenge is not building a sorter — it is building one that handles the letter that could go in three bins, the package with no return address, and the ten-thousand-piece batch that needs to be sorted before lunch. Each pattern below handles a progressively harder version of that problem.

Zero-Shot Classification

Zero-shot means “no examples needed” — you hand the model a piece of text and a list of labels, and it classifies based purely on its understanding of the words. This is remarkably powerful for prototyping: you can go from idea to working classifier in five minutes. The trade-off is that without examples, the model is guessing based on its general knowledge, which works well for obvious categories (“sports” vs. “politics”) but breaks down for domain-specific or ambiguous labels.

Basic Zero-Shot

from openai import OpenAI
import json


class ZeroShotClassifier:
    """Classify text without training examples."""
    
    def __init__(self, model: str = "gpt-4o-mini"):
        self.client = OpenAI()
        self.model = model
    
    def classify(
        self,
        text: str,
        labels: list[str],
        label_descriptions: dict[str, str] = None
    ) -> dict:
        """Classify text into one of the provided labels."""
        descriptions = ""
        if label_descriptions:
            descriptions = "\n".join([
                f"- {label}: {desc}"
                for label, desc in label_descriptions.items()
            ])
            descriptions = f"\nLabel descriptions:\n{descriptions}"
        
        prompt = f"""Classify the following text into exactly one of these categories: {', '.join(labels)}
{descriptions}

Text: {text}

Return JSON:
{{
    "label": "chosen_label",
    "confidence": 0.0-1.0,
    "reasoning": "brief explanation"
}}"""
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )
        
        return json.loads(response.choices[0].message.content)
    
    def classify_multi_label(
        self,
        text: str,
        labels: list[str],
        max_labels: int = None
    ) -> dict:
        """Classify text into multiple applicable labels."""
        limit = f"Select up to {max_labels} labels." if max_labels else "Select all applicable labels."
        
        prompt = f"""Classify the following text into applicable categories.
{limit}

Available labels: {', '.join(labels)}

Text: {text}

Return JSON:
{{
    "labels": ["label1", "label2"],
    "confidences": {{"label1": 0.9, "label2": 0.7}},
    "reasoning": "brief explanation"
}}"""
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )
        
        return json.loads(response.choices[0].message.content)


# Usage
classifier = ZeroShotClassifier()

# Single-label classification
text = "The new iPhone 15 features a titanium design and improved camera system."

result = classifier.classify(
    text,
    labels=["technology", "sports", "politics", "entertainment", "business"],
    label_descriptions={
        "technology": "Products, software, hardware, tech companies",
        "business": "Markets, economics, corporate news"
    }
)

print(f"Label: {result['label']} (confidence: {result['confidence']:.0%})")
print(f"Reasoning: {result['reasoning']}")

# Multi-label classification
text = "Apple announced record iPhone sales, beating analyst expectations."

result = classifier.classify_multi_label(
    text,
    labels=["technology", "business", "finance", "product_launch", "earnings"],
    max_labels=3
)

print(f"Labels: {result['labels']}")

Few-Shot Classification

Few-shot classification is the “show, don’t tell” approach. Instead of describing what “positive” means, you show the model three positive reviews and let it generalize. This is the sweet spot for most production use cases: significantly more accurate than zero-shot, but without the cost and complexity of fine-tuning a model. The key insight is that example selection matters more than example quantity — three well-chosen, diverse examples per class typically outperform ten similar ones.
from openai import OpenAI
import json
from dataclasses import dataclass


@dataclass
class Example:
    """A labeled example for few-shot learning."""
    text: str
    label: str


class FewShotClassifier:
    """Classify text using in-context examples."""
    
    def __init__(self, model: str = "gpt-4o-mini"):
        self.client = OpenAI()
        self.model = model
        self.examples: dict[str, list[Example]] = {}
    
    def add_examples(self, label: str, examples: list[str]):
        """Add examples for a label."""
        if label not in self.examples:
            self.examples[label] = []
        
        for text in examples:
            self.examples[label].append(Example(text=text, label=label))
    
    def _build_examples_prompt(self, k_per_class: int = 2) -> str:
        """Build examples section of prompt."""
        lines = ["Examples:"]
        
        for label, exs in self.examples.items():
            for ex in exs[:k_per_class]:
                lines.append(f"Text: {ex.text}")
                lines.append(f"Label: {ex.label}")
                lines.append("")
        
        return "\n".join(lines)
    
    def classify(
        self,
        text: str,
        k_per_class: int = 2
    ) -> dict:
        """Classify text using few-shot examples."""
        labels = list(self.examples.keys())
        examples_prompt = self._build_examples_prompt(k_per_class)
        
        prompt = f"""Classify the text into one of these categories: {', '.join(labels)}

{examples_prompt}

Now classify this text:
Text: {text}

Return JSON:
{{
    "label": "category",
    "confidence": 0.0-1.0
}}"""
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )
        
        return json.loads(response.choices[0].message.content)
    
    def classify_batch(
        self,
        texts: list[str],
        k_per_class: int = 2
    ) -> list[dict]:
        """Classify multiple texts efficiently."""
        labels = list(self.examples.keys())
        examples_prompt = self._build_examples_prompt(k_per_class)
        
        texts_formatted = "\n".join([
            f"{i+1}. {text}" for i, text in enumerate(texts)
        ])
        
        prompt = f"""Classify each text into one of these categories: {', '.join(labels)}

{examples_prompt}

Texts to classify:
{texts_formatted}

Return JSON:
{{
    "classifications": [
        {{"index": 1, "label": "category", "confidence": 0.9}}
    ]
}}"""
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )
        
        data = json.loads(response.choices[0].message.content)
        return data.get("classifications", [])


# Usage
classifier = FewShotClassifier()

# Add training examples
classifier.add_examples("positive", [
    "This product is amazing! Best purchase ever.",
    "Exceeded all my expectations. Highly recommend!",
    "Love it! Great quality and fast shipping."
])

classifier.add_examples("negative", [
    "Terrible quality. Broke after one use.",
    "Waste of money. Don't buy this.",
    "Very disappointed. Nothing like the pictures."
])

classifier.add_examples("neutral", [
    "It's okay. Does what it's supposed to do.",
    "Average product. Nothing special.",
    "Decent for the price. Not great, not bad."
])

# Classify new text
result = classifier.classify(
    "Pretty good product but shipping took forever.",
    k_per_class=2
)

print(f"Label: {result['label']} ({result['confidence']:.0%})")

# Batch classification
texts = [
    "Absolutely love this!",
    "Meh, it's fine I guess",
    "Returning immediately"
]

results = classifier.classify_batch(texts)
for r in results:
    print(f"{r['index']}: {r['label']}")

Hierarchical Classification

Real-world classification rarely has flat categories. An article about the Apple Watch is not just “Technology” — it is Technology, then Hardware, then Wearables. Hierarchical classification walks down a taxonomy tree, getting more specific at each level. This mirrors how humans categorize: “Is this tech or business? OK, tech — is it hardware or software? Hardware — what kind?” The advantage is that you can fall back to a parent category when the model is uncertain about the leaf, rather than forcing a bad specific answer.
from openai import OpenAI
import json
from dataclasses import dataclass


@dataclass
class TaxonomyNode:
    """A node in classification taxonomy.
    
    Tip: Keep your taxonomy no deeper than 3-4 levels. Each level
    multiplies the chance of misclassification, and most business
    use cases don't need more granularity than that.
    """
    name: str
    description: str = ""
    children: list = None
    
    def __post_init__(self):
        if self.children is None:
            self.children = []


class HierarchicalClassifier:
    """Classify text using hierarchical taxonomy."""
    
    def __init__(self, model: str = "gpt-4o-mini"):
        self.client = OpenAI()
        self.model = model
        self.taxonomy: dict[str, TaxonomyNode] = {}
    
    def set_taxonomy(self, taxonomy: dict):
        """Set the classification taxonomy."""
        self.taxonomy = taxonomy
    
    def _taxonomy_to_text(self, node: dict, level: int = 0) -> str:
        """Convert taxonomy to text representation."""
        indent = "  " * level
        lines = []
        
        for name, data in node.items():
            desc = data.get("description", "")
            lines.append(f"{indent}- {name}: {desc}")
            
            if "children" in data:
                lines.append(self._taxonomy_to_text(data["children"], level + 1))
        
        return "\n".join(lines)
    
    def classify_hierarchical(self, text: str) -> dict:
        """Classify text through taxonomy hierarchy."""
        taxonomy_text = self._taxonomy_to_text(self.taxonomy)
        
        prompt = f"""Classify this text through the following taxonomy hierarchy.
Select the most specific applicable category at each level.

Taxonomy:
{taxonomy_text}

Text: {text}

Return JSON:
{{
    "path": ["level1", "level2", "level3"],
    "confidence": 0.0-1.0,
    "reasoning": "explanation"
}}"""
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )
        
        return json.loads(response.choices[0].message.content)
    
    def classify_with_fallback(self, text: str) -> dict:
        """Classify with fallback to parent category if uncertain."""
        result = self.classify_hierarchical(text)
        
        # If low confidence on deepest level, truncate path
        if result["confidence"] < 0.7 and len(result["path"]) > 1:
            prompt = f"""The classification "{result['path']}" has low confidence.
Should we use a broader category?

Text: {text}
Current path: {result['path']}
Confidence: {result['confidence']}

Return JSON:
{{
    "use_parent": true/false,
    "final_path": ["category", "subcategory"],
    "confidence": 0.0-1.0
}}"""
            
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[{"role": "user", "content": prompt}],
                response_format={"type": "json_object"}
            )
            
            fallback = json.loads(response.choices[0].message.content)
            if fallback.get("use_parent"):
                result["path"] = fallback["final_path"]
                result["confidence"] = fallback["confidence"]
                result["used_fallback"] = True
        
        return result


# Usage
classifier = HierarchicalClassifier()

classifier.set_taxonomy({
    "Technology": {
        "description": "Tech-related content",
        "children": {
            "Software": {
                "description": "Software and applications",
                "children": {
                    "Mobile Apps": {"description": "Smartphone applications"},
                    "Web Apps": {"description": "Browser-based applications"},
                    "Desktop Software": {"description": "Computer programs"}
                }
            },
            "Hardware": {
                "description": "Physical devices",
                "children": {
                    "Smartphones": {"description": "Mobile phones"},
                    "Computers": {"description": "PCs and laptops"},
                    "Wearables": {"description": "Smartwatches, fitness trackers"}
                }
            }
        }
    },
    "Business": {
        "description": "Business and finance",
        "children": {
            "Startups": {"description": "New companies and ventures"},
            "Enterprise": {"description": "Large corporations"},
            "Markets": {"description": "Stock markets and trading"}
        }
    }
})

text = "The new Apple Watch Series 9 features an improved heart rate sensor."

result = classifier.classify_hierarchical(text)
print(f"Path: {' > '.join(result['path'])}")
print(f"Confidence: {result['confidence']:.0%}")

Confidence Calibration

Here is a dirty secret about LLM confidence scores: they are often poorly calibrated. When a model says “90% confident,” it might be right only 70% of the time. Calibration fixes this gap between stated confidence and actual accuracy. This matters enormously in production — if you are auto-routing support tickets based on classification, a miscalibrated confidence score means either too many tickets get sent to the wrong team (threshold too low) or too many get queued for human review (threshold too high). Think of it like a weather forecast. An uncalibrated model is the forecaster who always says “90% chance of rain” regardless of the actual probability. Calibration adjusts those numbers so that when the system says 90%, it really does rain 90% of the time.
from openai import OpenAI
import json
from dataclasses import dataclass


@dataclass
class CalibrationResult:
    """Calibrated classification result."""
    label: str
    raw_confidence: float       # What the LLM claims
    calibrated_confidence: float # Adjusted to match real accuracy
    uncertainty_type: str        # Helps decide: retry, abstain, or escalate


class CalibratedClassifier:
    """Classifier with calibrated confidence scores."""
    
    def __init__(self, model: str = "gpt-4o-mini"):
        self.client = OpenAI()
        self.model = model
        self.calibration_factor = 0.85  # Conservative adjustment
    
    def classify_with_uncertainty(
        self,
        text: str,
        labels: list[str]
    ) -> CalibrationResult:
        """Classify with uncertainty quantification."""
        prompt = f"""Classify this text and assess your uncertainty.

Labels: {', '.join(labels)}
Text: {text}

Consider:
1. Epistemic uncertainty: Lack of knowledge about correct answer
2. Aleatoric uncertainty: Inherent ambiguity in the text

Return JSON:
{{
    "label": "chosen_label",
    "confidence": 0.0-1.0,
    "uncertainty_type": "low|epistemic|aleatoric|both",
    "alternative_labels": ["other possible labels"],
    "reasoning": "explanation of choice and uncertainty"
}}"""
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )
        
        data = json.loads(response.choices[0].message.content)
        
        # Apply calibration
        raw_conf = data.get("confidence", 0.5)
        calibrated = self._calibrate_confidence(
            raw_conf,
            data.get("uncertainty_type", "low"),
            len(data.get("alternative_labels", []))
        )
        
        return CalibrationResult(
            label=data["label"],
            raw_confidence=raw_conf,
            calibrated_confidence=calibrated,
            uncertainty_type=data.get("uncertainty_type", "low")
        )
    
    def _calibrate_confidence(
        self,
        raw: float,
        uncertainty_type: str,
        num_alternatives: int
    ) -> float:
        """Apply calibration to raw confidence.
        
        Two types of uncertainty to understand:
        - Epistemic: The model doesn't know enough (fixable with more data/examples)
        - Aleatoric: The text itself is genuinely ambiguous (not fixable -- the
          review "it's complicated" really IS hard to classify as positive/negative)
        """
        # Reduce confidence based on uncertainty type
        uncertainty_penalty = {
            "low": 0.0,
            "epistemic": 0.15,   # "I don't know enough"
            "aleatoric": 0.10,   # "This is genuinely ambiguous"
            "both": 0.25         # Double trouble
        }.get(uncertainty_type, 0.1)
        
        # Reduce for alternatives
        alt_penalty = min(0.1 * num_alternatives, 0.2)
        
        calibrated = raw * self.calibration_factor - uncertainty_penalty - alt_penalty
        return max(0.0, min(1.0, calibrated))
    
    def classify_with_abstention(
        self,
        text: str,
        labels: list[str],
        min_confidence: float = 0.6
    ) -> dict:
        """Classify or abstain if uncertain."""
        result = self.classify_with_uncertainty(text, labels)
        
        if result.calibrated_confidence < min_confidence:
            return {
                "label": None,
                "abstained": True,
                "reason": f"Confidence {result.calibrated_confidence:.0%} below threshold",
                "uncertainty_type": result.uncertainty_type,
                "suggested_label": result.label
            }
        
        return {
            "label": result.label,
            "abstained": False,
            "confidence": result.calibrated_confidence,
            "uncertainty_type": result.uncertainty_type
        }


# Usage
classifier = CalibratedClassifier()

# Classification with uncertainty
labels = ["positive", "negative", "neutral"]
text = "The product is okay but the customer service was rude."

result = classifier.classify_with_uncertainty(text, labels)
print(f"Label: {result.label}")
print(f"Raw confidence: {result.raw_confidence:.0%}")
print(f"Calibrated: {result.calibrated_confidence:.0%}")
print(f"Uncertainty: {result.uncertainty_type}")

# With abstention
result = classifier.classify_with_abstention(
    "It's complicated...",
    labels,
    min_confidence=0.6
)

if result["abstained"]:
    print(f"Abstained: {result['reason']}")
    print(f"Suggestion: {result['suggested_label']}")
else:
    print(f"Label: {result['label']} ({result['confidence']:.0%})")

Intent Classification

from openai import OpenAI
import json
from dataclasses import dataclass


@dataclass
class Intent:
    """A classified intent with entities."""
    name: str
    confidence: float
    entities: dict
    slots: dict


class IntentClassifier:
    """Classify user intents with entity extraction."""
    
    def __init__(self, model: str = "gpt-4o-mini"):
        self.client = OpenAI()
        self.model = model
        self.intents = {}
    
    def register_intent(
        self,
        name: str,
        description: str,
        examples: list[str],
        slots: list[dict] = None
    ):
        """Register an intent with examples and slots."""
        self.intents[name] = {
            "description": description,
            "examples": examples,
            "slots": slots or []
        }
    
    def classify(self, text: str) -> Intent:
        """Classify text into an intent and extract entities."""
        intents_desc = []
        for name, data in self.intents.items():
            examples = ", ".join(f'"{e}"' for e in data["examples"][:2])
            slots = [s["name"] for s in data.get("slots", [])]
            intents_desc.append(
                f"- {name}: {data['description']}\n"
                f"  Examples: {examples}\n"
                f"  Slots: {slots if slots else 'none'}"
            )
        
        prompt = f"""Classify this user message into an intent and extract relevant entities.

Available intents:
{chr(10).join(intents_desc)}

User message: "{text}"

Return JSON:
{{
    "intent": "intent_name",
    "confidence": 0.0-1.0,
    "entities": {{"entity_type": "extracted_value"}},
    "slots": {{"slot_name": "value"}}
}}"""
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )
        
        data = json.loads(response.choices[0].message.content)
        
        return Intent(
            name=data["intent"],
            confidence=data.get("confidence", 0),
            entities=data.get("entities", {}),
            slots=data.get("slots", {})
        )


# Usage
classifier = IntentClassifier()

# Register intents
classifier.register_intent(
    name="book_flight",
    description="User wants to book a flight",
    examples=[
        "I want to fly to New York",
        "Book me a flight to London"
    ],
    slots=[
        {"name": "origin", "type": "city"},
        {"name": "destination", "type": "city"},
        {"name": "date", "type": "date"}
    ]
)

classifier.register_intent(
    name="check_status",
    description="User wants to check booking status",
    examples=[
        "Where is my booking?",
        "Status of my flight"
    ],
    slots=[
        {"name": "booking_id", "type": "string"}
    ]
)

classifier.register_intent(
    name="cancel_booking",
    description="User wants to cancel a booking",
    examples=[
        "Cancel my reservation",
        "I need to cancel"
    ],
    slots=[
        {"name": "booking_id", "type": "string"}
    ]
)

# Classify
text = "I need to book a flight from Boston to Miami next Friday"

intent = classifier.classify(text)
print(f"Intent: {intent.name} ({intent.confidence:.0%})")
print(f"Slots: {intent.slots}")
print(f"Entities: {intent.entities}")

Production Classification Pipeline

A production classifier needs to handle the realities that demos skip: what happens when the API is down? What about the 500th identical support ticket today — should you really pay for an API call each time? How do you handle a batch of 10,000 reviews that need to be classified by morning? The pipeline below adds caching (so identical or very similar texts don’t cost you twice), automatic fallback to a cheaper model when the primary is unavailable, and batch processing for throughput.
from openai import OpenAI
import json
from dataclasses import dataclass, field
from typing import Optional
import time


@dataclass
class ClassificationResult:
    """Complete classification result."""
    text: str
    primary_label: str
    confidence: float
    all_labels: dict
    processing_time_ms: float
    model_used: str
    metadata: dict = field(default_factory=dict)


class ProductionClassifier:
    """Production-ready classification pipeline."""
    
    def __init__(
        self,
        labels: list[str],
        model: str = "gpt-4o-mini",
        fallback_model: str = "gpt-3.5-turbo"
    ):
        self.client = OpenAI()
        self.labels = labels
        self.model = model
        self.fallback_model = fallback_model
        self.cache = {}
    
    def _get_cache_key(self, text: str) -> str:
        """Generate cache key for text."""
        return hash(text.lower().strip())
    
    def _classify_single(
        self,
        text: str,
        model: str,
        temperature: float = 0
    ) -> dict:
        """Perform single classification call."""
        prompt = f"""Classify this text into one of these categories: {', '.join(self.labels)}

Text: {text}

Return JSON with all label probabilities:
{{
    "primary_label": "most_likely_label",
    "probabilities": {{"label1": 0.8, "label2": 0.15, "label3": 0.05}}
}}"""
        
        response = self.client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"},
            temperature=temperature
        )
        
        return json.loads(response.choices[0].message.content)
    
    def classify(
        self,
        text: str,
        use_cache: bool = True,
        require_confidence: float = 0.5
    ) -> ClassificationResult:
        """Classify text with caching and fallback."""
        start_time = time.time()
        
        # Check cache
        cache_key = self._get_cache_key(text)
        if use_cache and cache_key in self.cache:
            cached = self.cache[cache_key]
            cached.metadata["from_cache"] = True
            return cached
        
        model_used = self.model
        
        try:
            result = self._classify_single(text, self.model)
        except Exception as e:
            # Fallback to simpler model
            model_used = self.fallback_model
            result = self._classify_single(text, self.fallback_model)
        
        primary = result.get("primary_label", self.labels[0])
        probs = result.get("probabilities", {})
        confidence = probs.get(primary, 0)
        
        # Handle low confidence
        if confidence < require_confidence:
            # Try with higher temperature for diversity
            result2 = self._classify_single(text, model_used, temperature=0.3)
            if result2.get("probabilities", {}).get(result2.get("primary_label"), 0) > confidence:
                result = result2
                primary = result["primary_label"]
                probs = result["probabilities"]
                confidence = probs.get(primary, 0)
        
        processing_time = (time.time() - start_time) * 1000
        
        classification = ClassificationResult(
            text=text,
            primary_label=primary,
            confidence=confidence,
            all_labels=probs,
            processing_time_ms=processing_time,
            model_used=model_used,
            metadata={"from_cache": False}
        )
        
        # Cache result
        if use_cache:
            self.cache[cache_key] = classification
        
        return classification
    
    def classify_batch(
        self,
        texts: list[str],
        batch_size: int = 10
    ) -> list[ClassificationResult]:
        """Classify texts in batches."""
        results = []
        
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            
            # Batch classification prompt
            texts_formatted = "\n".join([
                f"{j+1}. {t}" for j, t in enumerate(batch)
            ])
            
            prompt = f"""Classify each text into one of: {', '.join(self.labels)}

Texts:
{texts_formatted}

Return JSON:
{{
    "classifications": [
        {{"index": 1, "label": "category", "confidence": 0.9}}
    ]
}}"""
            
            start_time = time.time()
            
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[{"role": "user", "content": prompt}],
                response_format={"type": "json_object"}
            )
            
            processing_time = (time.time() - start_time) * 1000
            
            data = json.loads(response.choices[0].message.content)
            
            for j, classification in enumerate(data.get("classifications", [])):
                results.append(ClassificationResult(
                    text=batch[j] if j < len(batch) else "",
                    primary_label=classification["label"],
                    confidence=classification.get("confidence", 0),
                    all_labels={},
                    processing_time_ms=processing_time / len(batch),
                    model_used=self.model
                ))
        
        return results


# Usage
classifier = ProductionClassifier(
    labels=["positive", "negative", "neutral"],
    model="gpt-4o-mini"
)

# Single classification
result = classifier.classify("Great product, highly recommend!")
print(f"Label: {result.primary_label}")
print(f"Confidence: {result.confidence:.0%}")
print(f"Processing time: {result.processing_time_ms:.0f}ms")

# Batch classification
texts = [
    "Excellent service!",
    "Terrible experience",
    "It was okay",
    "Love it!",
    "Not worth the money"
]

results = classifier.classify_batch(texts)
for r in results:
    print(f"{r.primary_label}: {r.text[:30]}...")

Choosing a Classification Approach

The right approach depends on how much labeled data you have, how fast it needs to run, and how much accuracy you need. This table provides the decision framework.
ApproachLabeled Data NeededLatencyCost per 1K ItemsAccuracy (typical)Best For
Zero-shotNone200-500ms$0.01-0.0570-85%Prototyping, new categories that change frequently
Few-shot (3-5 per class)15-25 examples300-600ms$0.02-0.0880-92%Production with limited data, evolving taxonomies
Fine-tuned LLM500+ examples100-300ms$0.001-0.0188-96%High-volume production, stable categories
Traditional ML (logistic regression, SVM)1000+ examples1-5ms~$0 (self-hosted)85-95%Extreme throughput (100K+/min), offline batch
Embedding + nearest-neighbor50-100 per class50-100ms$0.001-0.00582-90%No LLM dependency, add categories without retraining
Decision flowchart:
  1. Do you have fewer than 20 labeled examples total? Start with zero-shot. Get a baseline, then invest in labeling.
  2. Do you have 20-500 examples? Use few-shot. Select diverse examples that cover boundary cases.
  3. Do you have 500+ examples and stable categories? Fine-tune, or train a traditional classifier if latency matters.
  4. Do your categories change monthly? Stay with few-shot or zero-shot — fine-tuned models become stale.
  5. Do you need sub-10ms latency at scale? Train a small traditional model (distilbert, logistic regression on embeddings).

Edge Cases in Classification

Texts that belong to no category. A user sends “asdfghjkl” or an empty string. Without an explicit “unknown/other” handling, the classifier will confidently assign a label anyway. Always include an abstention path: if the highest confidence is below your threshold, route to human review rather than forcing a wrong label. Texts that belong to multiple categories equally. “The battery life is amazing but the screen cracked after one drop” is both positive and negative. Single-label classification will oscillate between runs. Use multi-label classification when your domain has overlapping categories, and set a per-label threshold rather than picking the argmax. Adversarial or sarcastic input. “Oh great, another product that breaks. Just what I needed.” Sarcasm inverts the surface sentiment. LLMs handle sarcasm better than traditional models, but they still struggle with subtle cases. If sarcasm is common in your domain (product reviews, social media), include sarcastic examples in your few-shot demonstrations explicitly labeled. Extremely long texts. A 10,000-word document shoved into a classification prompt either gets truncated (losing important content at the end) or blows the context window. For long texts, either classify on a summary, classify each section independently and vote, or extract the most relevant passages first. Label drift over time. What “urgent” means to your support team in January may differ from July. Production classifiers need periodic re-evaluation against fresh labeled data. Build a feedback loop: sample 50-100 classifications per week, have humans verify, and retrain or adjust prompts when accuracy drops below your threshold.
Classification Best Practices
  • Label descriptions are your biggest lever — “positive” is vague; “customer expresses satisfaction with the product or service” is actionable. Clear descriptions can improve accuracy by 10-20%.
  • Few-shot examples should cover edge cases — Don’t pick three obviously positive reviews. Include the borderline case (“decent for the price”) that defines where your categories blur.
  • Calibrate before you trust — Run your classifier on 100+ labeled examples and compare predicted vs. actual confidence. The gap will surprise you.
  • Abstention beats wrong answers — In support routing, sending a ticket to “unknown/human review” is infinitely better than routing it to the wrong team.
  • Cache aggressively — In many production systems, 20-30% of inputs are near-duplicates. Semantic caching pays for itself in days.

Practice Exercise

Build a classification system that:
  1. Supports zero-shot, few-shot, and hierarchical classification
  2. Provides calibrated confidence scores
  3. Handles multi-label classification
  4. Implements abstention for uncertain cases
  5. Includes batch processing for efficiency
Focus on:
  • Accurate label assignment
  • Well-calibrated confidence estimates
  • Graceful handling of edge cases
  • Production-ready error handling

Interview Deep-Dive

Strong Answer:
  • This is a classic confidence calibration problem. LLMs are notoriously overconfident — when they say “90% sure,” they are often right only 70% of the time. The gap between stated confidence and actual accuracy is called the calibration error, and it is one of the most common production issues with LLM classifiers.
  • The fix is a calibration pipeline. Step one: collect a labeled evaluation set of 500+ tickets with ground-truth team assignments. Step two: run your classifier on all of them and record (predicted_label, stated_confidence, actual_label) tuples. Step three: compute reliability diagrams — bin predictions by confidence range and compare to actual accuracy in each bin. You will see that the 90%+ bin actually has 70% accuracy.
  • The practical correction is a calibration function. The simplest approach is Platt scaling — fit a logistic regression on (raw_confidence to actual_accuracy) using your eval set. More robust is temperature scaling, where you learn a single scalar that adjusts all confidence scores. After calibration, a “90% confident” prediction really does correspond to 90% accuracy.
  • The production implication is routing policy. With calibrated scores, you can set meaningful thresholds: above 0.85 calibrated confidence routes automatically, between 0.6-0.85 goes to a human review queue, below 0.6 gets flagged as unclassifiable. Before calibration, these thresholds are meaningless because the raw scores are inflated.
Follow-up: Your calibration works well on English tickets, but accuracy drops to 55% on Spanish tickets even though you trained calibration on both languages. What do you investigate?This is almost certainly a data distribution issue. The embedding space for Spanish text differs from English, so confidence scores do not transfer across languages. First check whether the label descriptions and few-shot examples in the prompt are English-only — if so, the model classifies Spanish text using English-language guidance, degrading performance. The fix is language-specific classification pipelines: detect the language first, then route to a prompt with descriptions and examples in that language. Also check for label imbalance — maybe “billing” tickets are 40% of English volume but only 5% of Spanish, so the classifier has far fewer Spanish billing examples. Per-language calibration curves are essential for multilingual systems.
Strong Answer:
  • At 0.003perreview,50,000reviewscosts0.003 per review, 50,000 reviews costs 150. That is manageable, but the real constraint is throughput — sequential calls would take hours. The production approach is batch classification.
  • First, batch multiple reviews into a single prompt. Instead of one review per API call, send 10-20 reviews per call with numbered outputs. This reduces HTTP overhead, amortizes system prompt tokens, and cuts costs by 30-50%. At 15 reviews per call, that is 3,333 API calls instead of 50,000.
  • Second, use async parallelism. Fire 50-100 concurrent API calls using asyncio. With batching at 15 reviews per call and 50 concurrent requests, you process 750 reviews per batch round. At 200ms per call, the entire dataset finishes in under 10 minutes.
  • Third, implement a semantic cache. In a 50,000-review dataset, there are guaranteed near-duplicates. Embed each review, check similarity against already-classified reviews, and if similarity exceeds 0.95, reuse the classification. This eliminates 20-30% of redundant calls.
  • Fourth, quality control. Sample 200 reviews across all labels after classification, have a human verify, and compute per-label accuracy. If any label falls below 90%, investigate whether label descriptions need refinement.
Follow-up: Midway through processing, the API starts returning 429 rate-limit errors on 30% of requests. How do you handle this without losing progress?The immediate response is exponential backoff with jitter — but the strategic response is a persistent progress tracker. Every successfully classified batch gets written to a results file immediately, not held in memory. If the process crashes, you check which review IDs are already classified and skip them. For the rate limiting, reduce concurrency from 50 to 10, increase backoff delays, and respect the Retry-After header. If limits persist, fall back to OpenAI’s Batch API endpoint — designed for exactly this use case, it processes requests within 24 hours at a 50% discount and delivers results in a single response file.
Strong Answer:
  • The confusion between “billing inquiry” and “refund request” is almost certainly a label description problem, not a model capability problem. These categories genuinely overlap — a refund request often starts as a billing inquiry. The model is not wrong to be confused; the categories are ambiguous.
  • Step one: diagnose by examining 50 confused examples. Look for patterns in the misclassified tickets. Most likely, they mention both billing and refunds, and the distinguishing signal is user intent (informational vs. action-seeking).
  • Step two: rewrite label descriptions to encode the distinguishing signal. Instead of “billing inquiry: questions about billing,” write “billing inquiry: user wants to understand a charge, check payment history, or update payment methods — no request for money back.” For refund: “refund request: user explicitly wants money returned to their account.” Make the distinction explicit, not implicit.
  • Step three: add few-shot examples covering boundary cases. Include a “billing inquiry” example that mentions money (“Why was I charged $50?”) and a “refund request” that mentions billing (“My bill shows a charge I need refunded”). These boundary examples teach the model where the line is.
  • Step four: if categories are genuinely inseparable, merge them. “Billing and refunds” routed to the same team is better than two categories with 70% accuracy.
Follow-up: After fixing descriptions, accuracy improves to 88%. But now a third category, ‘account cancellation,’ pulls in refund-adjacent tickets. How do you prevent this whack-a-mole problem?The structural fix is hierarchical classification. First classify into top-level buckets (“billing/payments” vs. “account management”), then sub-classify within each. “Refund” and “cancellation” never compete at the same level because they live in different branches. The alternative is multi-label classification: allow a ticket to carry both “refund” and “cancellation” tags if it genuinely is both. This is more honest about the data but requires routing logic that handles multi-labeled tickets, which many teams are not set up for.
Strong Answer:
  • Zero-shot is for prototyping and categories that are self-explanatory from their names. If “positive, negative, neutral” is understood without examples, zero-shot suffices. The moment accuracy drops below 85%, add few-shot examples.
  • Few-shot is the production sweet spot. The right 3-5 examples per class improve accuracy by 10-20% over zero-shot with no training infrastructure. The key insight: example selection matters more than quantity. Three diverse, boundary-case examples per class outperform ten obvious ones. Include the sarcastic review, the ticket that could go either way — not just easy wins.
  • Skip LLM classification when: (1) latency must be under 50ms (LLM calls take 200ms minimum), (2) volume exceeds 100K per day and cost becomes prohibitive, or (3) you have 10,000+ labeled examples and can train a DistilBERT or logistic regression matching LLM accuracy at 1/100th the cost. At a previous project, moving from GPT-4o-mini (900/month)toafinetunedDistilBERT(900/month) to a fine-tuned DistilBERT (15/month) maintained identical accuracy on a 12-class problem.
  • The decision framework: start zero-shot, add few-shot when accuracy is insufficient, graduate to fine-tuned when volume, latency, or cost forces your hand. LLM-based classification is always the right starting point because you iterate on categories in minutes, not days.
Follow-up: You decide to fine-tune DistilBERT. How do you generate training data without manually labeling 10,000 examples?Use the LLM as a teacher. Run your few-shot GPT-4o classifier on 20,000 unlabeled examples, keep only those where confidence exceeds 0.9, and use them as silver labels. This is “LLM distillation.” A confident GPT-4o prediction is correct 95%+ of the time, so training data has roughly 5% label noise — DistilBERT handles that fine. To improve further, have a human verify a random 500-example sample, correct errors, and up-weight those verified examples during training. You get 10,000+ training examples at the cost of 500 manual labels.