LLMs excel at generating synthetic data for training, testing, and augmentation. The core insight: you can use a powerful model (GPT-4o) to generate training data for a smaller, cheaper model — a technique sometimes called “model distillation.” Instead of spending months collecting and labeling real data, you can bootstrap a dataset in hours. But there is a catch: synthetic data inherits the biases and limitations of the generating model, so quality filtering is not optional — it is the most important step in the pipeline. This chapter covers patterns for creating high-quality synthetic datasets.Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Synthetic Data Techniques at a Glance
| Technique | Input Required | Output Quality | Cost | Best For |
|---|---|---|---|---|
| Direct generation | Task description + 3-5 seeds | Medium (needs filtering) | Medium | Bootstrapping from scratch |
| Paraphrase augmentation | Existing examples | High (preserves meaning) | Low | Expanding a small dataset 3-5x |
| Back-translation | Existing examples | Medium-High | Low | Diversifying sentence structure |
| Context variation | Examples + context list | Medium | Medium | Training domain-robust models |
| Edge case generation | Task description + seeds | High (adversarial value) | Medium | Hardening against unusual inputs |
| Distillation (GPT-4o to small model) | Task + large model access | High | High (upfront) | Cheaper inference with good quality |
Training Data Generation
Basic Data Generation
Instruction-Following Data
Data Augmentation
Data augmentation creates variations of your existing data to increase dataset size and diversity without collecting new real data. The analogy: in computer vision, you flip and rotate images to train more robust models. For text, you paraphrase, change context, and generate edge cases. The goal is to make your model robust to the many ways users phrase the same intent.Quality Filtering
This is the most important step in synthetic data generation and the one most teams skip. Without filtering, roughly 10-30% of generated examples will be low quality: wrong labels, ambiguous inputs, near-duplicates, or examples that don’t match the task. Using an LLM to judge another LLM’s output is effective because judging is easier than generating — the same model that might generate a borderline example can reliably identify it as borderline when asked to evaluate.Test Data Generation
Evaluation Dataset Creation
Synthetic Data Guidelines
- Always validate generated data against task requirements — never use raw generated data without filtering. Plan to reject 15-30%.
- Use seed examples (3-5 real examples) to guide generation style. Without seeds, output is generic and homogeneous.
- Include difficulty stratification — easy/medium/hard distribution prevents your model from only learning the obvious cases.
- Filter and deduplicate before use. LLMs love generating near-duplicate examples with minor word changes.
- Test on held-out real data to verify effectiveness. If your model trained on synthetic data performs worse on real data, the synthetic data distribution doesn’t match reality.
- Pitfall: Generated data inherits model biases. If GPT-4o consistently gives positive sentiment to certain topics, your training data will be skewed. Audit for systematic bias.
- Pitfall: Be careful with licensing. Some model terms of service restrict using outputs for training competing models. Check before building a distillation pipeline.
Practice Exercise
Build a synthetic data pipeline that:- Generates task-specific training examples
- Creates augmented variations of existing data
- Filters for quality and removes duplicates
- Produces balanced evaluation benchmarks
- Includes adversarial test cases
- Diversity in generated examples
- Accuracy of labels and outputs
- Quality filtering at each stage
- Realistic edge case coverage
Interview Deep-Dive
Your team wants to use GPT-4o to generate training data for a smaller model. Walk me through the end-to-end pipeline and the specific failure modes at each stage.
Your team wants to use GPT-4o to generate training data for a smaller model. Walk me through the end-to-end pipeline and the specific failure modes at each stage.
Strong Answer:
- This is model distillation via synthetic data, and the pipeline has five stages, each with distinct failure modes. Stage one: define the task specification and create 10-20 seed examples by hand. These seed examples are the most important input to the entire pipeline — they define the style, complexity distribution, and quality bar that the generator will mimic. Failure mode: if your seeds are all easy or all hard, the generated data will be skewed. If your seeds are all positive sentiment, the model will under-generate negative examples. You need deliberate diversity in your seeds.
- Stage two: generate at scale. Prompt GPT-4o with the task description and seed examples, asking for batches of 20-50 examples per call. Use temperature 0.8-1.0 for diversity. Failure mode: mode collapse — despite high temperature, LLMs tend to generate variations on a theme rather than truly diverse examples. After 500 examples, you start seeing the same patterns with different nouns. Mitigation: generate in batches with different seed subsets, different prompt phrasings, and explicit diversity instructions (“generate examples that are different from these existing examples: [list]”).
- Stage three: quality filtering. Use GPT-4o (the same model or a comparable one) as a judge to score each example on relevance, accuracy, clarity, and diversity. Reject the bottom 15-30%. Failure mode: the judge model has the same biases as the generator. It might rate a subtly wrong example as correct because both models share the same misconception. Mitigation: include a small set of known-bad examples in your evaluation to calibrate the judge’s rejection rate. If the judge does not reject the known-bad examples, your judging prompt needs work.
- Stage four: deduplication. Remove near-duplicates using embedding similarity (threshold ~0.95) or n-gram overlap. Failure mode: duplicates that differ only in proper nouns or numbers pass dedup but add no training signal. Consider semantic deduplication at a lower threshold (~0.85) for the most aggressive filtering.
- Stage five: train the smaller model and evaluate on held-out real data. Failure mode: the model performs well on synthetic test data but poorly on real data. This distribution gap is the single biggest risk in the entire pipeline. The synthetic data may be systematically different from real data in ways that are hard to detect — different sentence lengths, different vocabulary, different error patterns. Always evaluate on real data, never on a synthetic held-out set.
When is synthetic data a bad idea? What are the specific scenarios where you should invest in real data collection instead?
When is synthetic data a bad idea? What are the specific scenarios where you should invest in real data collection instead?
Strong Answer:
- Synthetic data is a bad idea when the cost of a wrong label exceeds the cost of manual labeling. Three specific scenarios. First: safety-critical applications. If you are training a model to detect medical emergencies in patient messages, a synthetic dataset where GPT-4o mislabels 3% of examples could teach your smaller model to ignore real emergencies. The cost of a false negative is catastrophic. For safety-critical tasks, every training example should be human-verified, which defeats the purpose of synthetic generation. Collect real data and pay domain experts to label it.
- Second: tasks where the real-world distribution is highly specific and unpredictable. A customer support classifier for a niche SaaS product will encounter very specific jargon, product names, feature requests, and bug descriptions that GPT-4o has never seen. Synthetic data will generate plausible-sounding but off-distribution examples. The model trained on synthetic data will confidently classify real tickets into wrong categories because the real tickets look different from anything in training. Real data is irreplaceable here.
- Third: tasks where cultural, regulatory, or domain nuance matters. Legal document classification, financial compliance categorization, healthcare triage — these domains have specific terminology, edge cases, and classification rules that LLMs approximate but do not precisely replicate. A synthetic dataset might train a model that is “close enough” in general but fails on the 5% of cases that matter most (regulatory edge cases, ambiguous compliance scenarios). Domain experts catching nuance during labeling is the actual value, not just the labels themselves.
- Synthetic data shines when: you need to bootstrap before real data exists (cold start), you need to augment a small real dataset (50 examples to 500), you need diverse test data for evaluation, or the task is well-defined and the cost of errors is low (content categorization, casual chatbot responses, internal tool queries).
You are using an LLM to judge the quality of data generated by another LLM. Isn't this circular? How do you ensure the judge is actually catching bad examples?
You are using an LLM to judge the quality of data generated by another LLM. Isn't this circular? How do you ensure the judge is actually catching bad examples?
Strong Answer:
- The circularity concern is valid but overstated. The key insight is that judging is fundamentally easier than generating. An LLM that generates a borderline or incorrect example can still reliably identify that same example as borderline when explicitly asked to evaluate it. This is analogous to how a student who writes a mediocre essay can still distinguish between a good and bad essay when shown both — evaluation requires less creative effort than generation.
- However, there are real failure modes. Shared blind spots: if GPT-4o consistently gets a subtle fact wrong during generation, it will also consistently rate that wrong fact as correct during evaluation. This is the genuine circularity risk. Mitigation: use a different model family as the judge (e.g., generate with GPT-4o, judge with Claude), or use a specialized smaller model that is fine-tuned specifically for quality evaluation on your domain.
- Calibration testing: before trusting your LLM judge at scale, calibrate it on a set of examples with known quality labels. Create 50 examples where you know 25 are good and 25 are bad (including subtle errors, not just obvious ones). Run the judge on this calibration set. If it correctly identifies fewer than 80% of the known-bad examples, your judging prompt needs revision. Track the confusion matrix: false positives (bad examples rated as good) are more costly than false negatives (good examples rated as bad) because false positives pollute your training data.
- Multi-dimensional scoring beats binary accept/reject. Instead of asking “is this example good?”, score relevance, accuracy, clarity, and diversity independently on a 1-5 scale. Examples that score 5 on relevance but 2 on accuracy reveal a specific fixable issue (correct topic, wrong label). Binary scoring would just reject it without explaining why, losing the signal.
- The most robust approach is not pure LLM judging but human-calibrated LLM judging: have humans label 100 examples, train the LLM judge prompt to match human judgments on those 100, then deploy the calibrated judge at scale. Re-calibrate monthly as data distributions shift.