Text classification is a fundamental NLP task with broad applications. This chapter covers implementing classification systems using LLMs, from zero-shot to production-grade solutions. The mental model is simple: classification is sorting mail. You have a pile of incoming text and a set of labeled bins. The challenge is not building a sorter — it is building one that handles the letter that could go in three bins, the package with no return address, and the ten-thousand-piece batch that needs to be sorted before lunch. Each pattern below handles a progressively harder version of that problem.Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Zero-Shot Classification
Zero-shot means “no examples needed” — you hand the model a piece of text and a list of labels, and it classifies based purely on its understanding of the words. This is remarkably powerful for prototyping: you can go from idea to working classifier in five minutes. The trade-off is that without examples, the model is guessing based on its general knowledge, which works well for obvious categories (“sports” vs. “politics”) but breaks down for domain-specific or ambiguous labels.Basic Zero-Shot
Few-Shot Classification
Few-shot classification is the “show, don’t tell” approach. Instead of describing what “positive” means, you show the model three positive reviews and let it generalize. This is the sweet spot for most production use cases: significantly more accurate than zero-shot, but without the cost and complexity of fine-tuning a model. The key insight is that example selection matters more than example quantity — three well-chosen, diverse examples per class typically outperform ten similar ones.Hierarchical Classification
Real-world classification rarely has flat categories. An article about the Apple Watch is not just “Technology” — it is Technology, then Hardware, then Wearables. Hierarchical classification walks down a taxonomy tree, getting more specific at each level. This mirrors how humans categorize: “Is this tech or business? OK, tech — is it hardware or software? Hardware — what kind?” The advantage is that you can fall back to a parent category when the model is uncertain about the leaf, rather than forcing a bad specific answer.Confidence Calibration
Here is a dirty secret about LLM confidence scores: they are often poorly calibrated. When a model says “90% confident,” it might be right only 70% of the time. Calibration fixes this gap between stated confidence and actual accuracy. This matters enormously in production — if you are auto-routing support tickets based on classification, a miscalibrated confidence score means either too many tickets get sent to the wrong team (threshold too low) or too many get queued for human review (threshold too high). Think of it like a weather forecast. An uncalibrated model is the forecaster who always says “90% chance of rain” regardless of the actual probability. Calibration adjusts those numbers so that when the system says 90%, it really does rain 90% of the time.Intent Classification
Production Classification Pipeline
A production classifier needs to handle the realities that demos skip: what happens when the API is down? What about the 500th identical support ticket today — should you really pay for an API call each time? How do you handle a batch of 10,000 reviews that need to be classified by morning? The pipeline below adds caching (so identical or very similar texts don’t cost you twice), automatic fallback to a cheaper model when the primary is unavailable, and batch processing for throughput.Choosing a Classification Approach
The right approach depends on how much labeled data you have, how fast it needs to run, and how much accuracy you need. This table provides the decision framework.| Approach | Labeled Data Needed | Latency | Cost per 1K Items | Accuracy (typical) | Best For |
|---|---|---|---|---|---|
| Zero-shot | None | 200-500ms | $0.01-0.05 | 70-85% | Prototyping, new categories that change frequently |
| Few-shot (3-5 per class) | 15-25 examples | 300-600ms | $0.02-0.08 | 80-92% | Production with limited data, evolving taxonomies |
| Fine-tuned LLM | 500+ examples | 100-300ms | $0.001-0.01 | 88-96% | High-volume production, stable categories |
| Traditional ML (logistic regression, SVM) | 1000+ examples | 1-5ms | ~$0 (self-hosted) | 85-95% | Extreme throughput (100K+/min), offline batch |
| Embedding + nearest-neighbor | 50-100 per class | 50-100ms | $0.001-0.005 | 82-90% | No LLM dependency, add categories without retraining |
- Do you have fewer than 20 labeled examples total? Start with zero-shot. Get a baseline, then invest in labeling.
- Do you have 20-500 examples? Use few-shot. Select diverse examples that cover boundary cases.
- Do you have 500+ examples and stable categories? Fine-tune, or train a traditional classifier if latency matters.
- Do your categories change monthly? Stay with few-shot or zero-shot — fine-tuned models become stale.
- Do you need sub-10ms latency at scale? Train a small traditional model (distilbert, logistic regression on embeddings).
Edge Cases in Classification
Texts that belong to no category. A user sends “asdfghjkl” or an empty string. Without an explicit “unknown/other” handling, the classifier will confidently assign a label anyway. Always include an abstention path: if the highest confidence is below your threshold, route to human review rather than forcing a wrong label. Texts that belong to multiple categories equally. “The battery life is amazing but the screen cracked after one drop” is both positive and negative. Single-label classification will oscillate between runs. Use multi-label classification when your domain has overlapping categories, and set a per-label threshold rather than picking the argmax. Adversarial or sarcastic input. “Oh great, another product that breaks. Just what I needed.” Sarcasm inverts the surface sentiment. LLMs handle sarcasm better than traditional models, but they still struggle with subtle cases. If sarcasm is common in your domain (product reviews, social media), include sarcastic examples in your few-shot demonstrations explicitly labeled. Extremely long texts. A 10,000-word document shoved into a classification prompt either gets truncated (losing important content at the end) or blows the context window. For long texts, either classify on a summary, classify each section independently and vote, or extract the most relevant passages first. Label drift over time. What “urgent” means to your support team in January may differ from July. Production classifiers need periodic re-evaluation against fresh labeled data. Build a feedback loop: sample 50-100 classifications per week, have humans verify, and retrain or adjust prompts when accuracy drops below your threshold.- Label descriptions are your biggest lever — “positive” is vague; “customer expresses satisfaction with the product or service” is actionable. Clear descriptions can improve accuracy by 10-20%.
- Few-shot examples should cover edge cases — Don’t pick three obviously positive reviews. Include the borderline case (“decent for the price”) that defines where your categories blur.
- Calibrate before you trust — Run your classifier on 100+ labeled examples and compare predicted vs. actual confidence. The gap will surprise you.
- Abstention beats wrong answers — In support routing, sending a ticket to “unknown/human review” is infinitely better than routing it to the wrong team.
- Cache aggressively — In many production systems, 20-30% of inputs are near-duplicates. Semantic caching pays for itself in days.
Practice Exercise
Build a classification system that:- Supports zero-shot, few-shot, and hierarchical classification
- Provides calibrated confidence scores
- Handles multi-label classification
- Implements abstention for uncertain cases
- Includes batch processing for efficiency
- Accurate label assignment
- Well-calibrated confidence estimates
- Graceful handling of edge cases
- Production-ready error handling
Interview Deep-Dive
You are building a support ticket classifier that routes tickets to the correct team. After deploying, you notice the LLM reports 90% confidence on most tickets, but actual routing accuracy is only 72%. What is going on, and how do you fix it?
You are building a support ticket classifier that routes tickets to the correct team. After deploying, you notice the LLM reports 90% confidence on most tickets, but actual routing accuracy is only 72%. What is going on, and how do you fix it?
- This is a classic confidence calibration problem. LLMs are notoriously overconfident — when they say “90% sure,” they are often right only 70% of the time. The gap between stated confidence and actual accuracy is called the calibration error, and it is one of the most common production issues with LLM classifiers.
- The fix is a calibration pipeline. Step one: collect a labeled evaluation set of 500+ tickets with ground-truth team assignments. Step two: run your classifier on all of them and record (predicted_label, stated_confidence, actual_label) tuples. Step three: compute reliability diagrams — bin predictions by confidence range and compare to actual accuracy in each bin. You will see that the 90%+ bin actually has 70% accuracy.
- The practical correction is a calibration function. The simplest approach is Platt scaling — fit a logistic regression on (raw_confidence to actual_accuracy) using your eval set. More robust is temperature scaling, where you learn a single scalar that adjusts all confidence scores. After calibration, a “90% confident” prediction really does correspond to 90% accuracy.
- The production implication is routing policy. With calibrated scores, you can set meaningful thresholds: above 0.85 calibrated confidence routes automatically, between 0.6-0.85 goes to a human review queue, below 0.6 gets flagged as unclassifiable. Before calibration, these thresholds are meaningless because the raw scores are inflated.
You need to classify 50,000 customer reviews overnight. Each review costs $0.003 with GPT-4o-mini. Walk me through your approach to minimize cost and maximize throughput while maintaining quality.
You need to classify 50,000 customer reviews overnight. Each review costs $0.003 with GPT-4o-mini. Walk me through your approach to minimize cost and maximize throughput while maintaining quality.
- At 150. That is manageable, but the real constraint is throughput — sequential calls would take hours. The production approach is batch classification.
- First, batch multiple reviews into a single prompt. Instead of one review per API call, send 10-20 reviews per call with numbered outputs. This reduces HTTP overhead, amortizes system prompt tokens, and cuts costs by 30-50%. At 15 reviews per call, that is 3,333 API calls instead of 50,000.
- Second, use async parallelism. Fire 50-100 concurrent API calls using asyncio. With batching at 15 reviews per call and 50 concurrent requests, you process 750 reviews per batch round. At 200ms per call, the entire dataset finishes in under 10 minutes.
- Third, implement a semantic cache. In a 50,000-review dataset, there are guaranteed near-duplicates. Embed each review, check similarity against already-classified reviews, and if similarity exceeds 0.95, reuse the classification. This eliminates 20-30% of redundant calls.
- Fourth, quality control. Sample 200 reviews across all labels after classification, have a human verify, and compute per-label accuracy. If any label falls below 90%, investigate whether label descriptions need refinement.
Your classification system uses zero-shot for 8 categories. It works well for 6 but consistently confuses 'billing inquiry' and 'refund request.' How do you diagnose and fix this without retraining a model?
Your classification system uses zero-shot for 8 categories. It works well for 6 but consistently confuses 'billing inquiry' and 'refund request.' How do you diagnose and fix this without retraining a model?
- The confusion between “billing inquiry” and “refund request” is almost certainly a label description problem, not a model capability problem. These categories genuinely overlap — a refund request often starts as a billing inquiry. The model is not wrong to be confused; the categories are ambiguous.
- Step one: diagnose by examining 50 confused examples. Look for patterns in the misclassified tickets. Most likely, they mention both billing and refunds, and the distinguishing signal is user intent (informational vs. action-seeking).
- Step two: rewrite label descriptions to encode the distinguishing signal. Instead of “billing inquiry: questions about billing,” write “billing inquiry: user wants to understand a charge, check payment history, or update payment methods — no request for money back.” For refund: “refund request: user explicitly wants money returned to their account.” Make the distinction explicit, not implicit.
- Step three: add few-shot examples covering boundary cases. Include a “billing inquiry” example that mentions money (“Why was I charged $50?”) and a “refund request” that mentions billing (“My bill shows a charge I need refunded”). These boundary examples teach the model where the line is.
- Step four: if categories are genuinely inseparable, merge them. “Billing and refunds” routed to the same team is better than two categories with 70% accuracy.
When would you choose few-shot classification over zero-shot, and when would you skip LLM classification entirely in favor of a fine-tuned traditional model?
When would you choose few-shot classification over zero-shot, and when would you skip LLM classification entirely in favor of a fine-tuned traditional model?
- Zero-shot is for prototyping and categories that are self-explanatory from their names. If “positive, negative, neutral” is understood without examples, zero-shot suffices. The moment accuracy drops below 85%, add few-shot examples.
- Few-shot is the production sweet spot. The right 3-5 examples per class improve accuracy by 10-20% over zero-shot with no training infrastructure. The key insight: example selection matters more than quantity. Three diverse, boundary-case examples per class outperform ten obvious ones. Include the sarcastic review, the ticket that could go either way — not just easy wins.
- Skip LLM classification when: (1) latency must be under 50ms (LLM calls take 200ms minimum), (2) volume exceeds 100K per day and cost becomes prohibitive, or (3) you have 10,000+ labeled examples and can train a DistilBERT or logistic regression matching LLM accuracy at 1/100th the cost. At a previous project, moving from GPT-4o-mini (15/month) maintained identical accuracy on a 12-class problem.
- The decision framework: start zero-shot, add few-shot when accuracy is insufficient, graduate to fine-tuned when volume, latency, or cost forces your hand. LLM-based classification is always the right starting point because you iterate on categories in minutes, not days.