Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
December 2025 Update: Covers OpenAI fine-tuning, LoRA/QLoRA with Hugging Face, when to fine-tune vs. prompt, and cost analysis.
When to Fine-Tune
Think of fine-tuning like hiring a specialist versus giving better instructions to a generalist. A general-purpose model (GPT-4o) is like a brilliant consultant who can handle anything but needs detailed briefing every time. A fine-tuned model is like an employee who has internalized your company’s style, jargon, and processes — they need less instruction per task but it took investment to train them. The question is always: is the upfront training cost worth the per-request savings? Fine-tuning is not your first option. Consider this decision tree:Fine-Tune When
| Use Case | Why Fine-Tuning Helps |
|---|---|
| Specific format | Consistent JSON structure, code style |
| Domain language | Medical, legal, technical jargon |
| Brand voice | Consistent tone across all outputs |
| Latency | Smaller fine-tuned model beats large model |
| Cost | Reduce tokens by baking knowledge in |
| Classification | High-accuracy categorization |
Don’t Fine-Tune When
| Use Case | Better Alternative |
|---|---|
| Add new knowledge | Use RAG instead |
| One-off tasks | Better prompts |
| Rapidly changing info | RAG with fresh data |
| General improvement | Use a bigger model |
OpenAI Fine-Tuning
OpenAI’s hosted fine-tuning is the easiest on-ramp: you upload a JSONL file, click start, and get a custom model endpoint back. No GPUs to manage, no training infrastructure to build. The tradeoff is cost and control — you pay per training token, and you cannot inspect the model weights or run it locally.Preparing Training Data
Validating Data
Always validate before uploading. A single malformed line in your JSONL file will cause the entire fine-tuning job to fail, sometimes after you have already waited an hour in the queue. This validator catches the most common mistakes: missing roles, invalid JSON, and examples that exceed the context window.Running Fine-Tuning
The process has four steps: upload your data, create a job, monitor progress, and use the resulting model. The whole pipeline typically takes 30 minutes to a few hours depending on dataset size.Local Fine-Tuning with LoRA
Why LoRA?
Imagine you need to customize a 7-billion-parameter model. Full fine-tuning means updating all 7 billion weights, which requires 80GB+ of VRAM — that is a $10,000+ GPU. LoRA (Low-Rank Adaptation) is a clever shortcut: instead of updating all weights, it freezes the original model and injects small trainable “adapter” matrices into each layer. You end up training less than 1% of the parameters while getting 90%+ of the quality improvement. It is like adding a specialized lens to a camera instead of rebuilding the entire camera. LoRA lets you fine-tune large models on consumer hardware:| Method | VRAM Needed | Training Time | Quality |
|---|---|---|---|
| Full fine-tune | 80GB+ | Days | Best |
| LoRA | 16-24GB | Hours | Great |
| QLoRA | 8-12GB | Hours | Good |
Setup
QLoRA Fine-Tuning
Inference with LoRA Model
At inference time, you load the base model plus the tiny LoRA adapter (usually 10-50MB). For production, you can merge the adapter into the base weights withmerge_and_unload() — this eliminates the adapter overhead and gives you a single model that runs at native speed.
Fine-Tuning Best Practices
Data Quality Over Quantity
The single most impactful thing you can do for fine-tuning quality is curate better data. 50 perfect, diverse examples consistently outperform 500 mediocre ones with inconsistent formatting. Think of it like training a new employee: five well-structured onboarding sessions beat a month of watching someone else wing it.Evaluation After Fine-Tuning
Never ship a fine-tuned model without comparing it head-to-head against the baseline. Fine-tuning can improve some dimensions while regressing others (e.g., better format compliance but worse factual accuracy). Always measure both.Cost Comparison
Fine-tuning has an upfront cost (training tokens) that you amortize over inference. The key question: does the reduced prompt size (because you baked knowledge into the model) save enough per-request to offset the training investment? This function does the math. Practical tip: if your break-even is longer than 90 days, the model will likely need retraining before you recoup the investment.Key Takeaways
Exhaust Alternatives First
Try prompts, few-shot, and RAG before fine-tuning. It’s often unnecessary.
Quality Over Quantity
50 perfect examples beat 500 mediocre ones. Curate carefully.
LoRA for Local
Use QLoRA to fine-tune 7B+ models on consumer GPUs.
Always Evaluate
Compare fine-tuned model to baseline. Measure the improvement.
What’s Next
Evaluation & Testing
Learn how to properly evaluate your fine-tuned models
Interview Deep-Dive
Your team wants to fine-tune a model for customer support. Walk me through how you decide whether fine-tuning is actually the right approach.
Your team wants to fine-tune a model for customer support. Walk me through how you decide whether fine-tuning is actually the right approach.
Strong Answer:
- Before I even consider fine-tuning, I would exhaust the cheaper alternatives in order. Step one: improve the prompt. Most “the model is not good enough” complaints I have seen in production were actually “the prompt is lazy.” I would spend 2-3 days iterating on the system prompt with clear instructions, formatting examples, and explicit constraints. Step two: add few-shot examples. For customer support specifically, including 3-5 examples of ideal responses in the prompt often gets you 80% of the way there. Step three: if the issue is that the model does not know about our product, that is a knowledge problem and RAG solves it better than fine-tuning. Fine-tuning bakes knowledge into weights, which means it goes stale the moment your product changes.
- Fine-tuning becomes the right choice when the problem is behavior, not knowledge. Specific scenarios: I need the model to always respond in a very specific JSON format and it keeps drifting; I need a consistent brand voice that few-shot examples cannot fully capture across hundreds of edge cases; I need to reduce latency by using a smaller model that matches a larger model’s quality on this narrow task; or I need to reduce cost by eliminating long system prompts (fine-tuning bakes instructions into the weights so the prompt can be much shorter).
- For the customer support case specifically, I would benchmark the current prompt-engineered approach against a held-out test set of 200 real customer conversations. If accuracy is above 85% and the main issue is consistency of tone, I would try fine-tuning. If accuracy is below 70%, the problem is likely the retrieval or prompt, not the base model capability.
- The ROI calculation matters too. Fine-tuning GPT-4o-mini costs roughly 0.04 per day on input costs with the mini model. That is roughly 5-15. The payback period is about a month, which is reasonable, but only if the quality improvement is real.
Explain the difference between full fine-tuning, LoRA, and QLoRA. When would you pick each?
Explain the difference between full fine-tuning, LoRA, and QLoRA. When would you pick each?
Strong Answer:
- Full fine-tuning updates every parameter in the model. For a 7B parameter model, that means 7 billion weights getting gradient updates, which requires roughly 80GB+ of VRAM and can take days. The result is the highest quality because you have maximum capacity to adapt, but the cost and infrastructure requirements are steep. I would pick full fine-tuning only for mission-critical applications with a large training set (10K+ examples) and access to multi-GPU clusters, or when the task is extremely different from what the base model was trained on.
- LoRA (Low-Rank Adaptation) freezes the original model weights and injects small trainable matrices into the attention layers. Instead of updating 7 billion parameters, you update maybe 10-50 million (less than 1% of the model). This drops VRAM requirements to 16-24GB, training time to hours instead of days, and quality is surprisingly close to full fine-tuning for most tasks. The rank parameter
rcontrols the capacity:r=8is minimal but fast,r=16is the sweet spot for most tasks,r=64approaches full fine-tuning capacity. I use LoRA as my default choice for any fine-tuning project because the quality-to-cost ratio is unmatched. - QLoRA adds 4-bit quantization on top of LoRA. The base model weights are quantized to 4-bit precision (using NF4 quantization), and the LoRA adapters train in 16-bit on top of that. This cuts VRAM to 8-12GB, meaning you can fine-tune a 7B model on a single consumer GPU like an RTX 3090. The quality tradeoff is small but measurable — roughly 1-2% below LoRA on most benchmarks. I pick QLoRA when I am prototyping on limited hardware, when I need to iterate quickly on data quality before committing to a full LoRA run, or for hobbyist/research settings.
- The practical decision tree: if I have an A100 80GB, I use LoRA. If I have an RTX 4090 24GB, I use LoRA with gradient checkpointing. If I have an RTX 3090 or less, I use QLoRA. If I have a multi-node cluster and a six-figure compute budget, I consider full fine-tuning.
model.merge_and_unload(). This produces a standard model with zero inference overhead — no adapter routing, no extra memory for the adapter matrices. Inference speed is identical to the base model. If I am serving multiple fine-tuned variants from the same base model (for example, different customer tenants each with their own fine-tune), I keep the adapters separate and load them dynamically. Frameworks like vLLM and LoRAX support multi-adapter serving where the base model weights are shared in GPU memory and adapters are swapped per request with minimal overhead. This is dramatically more memory-efficient than loading N separate full models. At one company we served 12 different LoRA fine-tunes from a single base model on two A100s, where loading 12 full models would have required 24 GPUs.You fine-tuned a model and it scores well on your eval set but performs poorly on production traffic. What happened?
You fine-tuned a model and it scores well on your eval set but performs poorly on production traffic. What happened?
Strong Answer:
- The most likely cause is distribution mismatch between the training data and real-world queries. This is the fine-tuning equivalent of the classic ML problem of training on clean data and deploying into the wild. I would diagnose it in this order.
- First, I check for data leakage. If any production-like queries leaked into the training set, the eval set might have been too easy. I verify that the eval set was held out properly and is drawn from a different time period or source than the training data.
- Second, I analyze the production failures by category. Are they concentrated in specific intent types? If so, those intents were probably underrepresented in training data. Customer support has a long tail — the top 10 query types might cover 60% of traffic, but the remaining 40% is spread across hundreds of edge cases. If my training set only covered the top 10, the fine-tuned model might have actually gotten worse at handling anything outside that distribution because fine-tuning can cause the model to “forget” some of its general capabilities. This is called catastrophic forgetting.
- Third, I check for overfitting. If I trained for too many epochs or my training set was too small and homogeneous, the model memorized patterns instead of learning generalizable behavior. I look at the training loss curve — if it kept dropping well below the validation loss, that is a smoking gun. The fix is fewer epochs (2-3 is usually sufficient for OpenAI fine-tuning), more diverse training data, or a higher LoRA dropout rate.
- Fourth, I compare the fine-tuned model against the base model with a good prompt on the failing production queries. If the base model actually handles those queries better, the fine-tuning introduced a regression. This is why I always keep the base model available as a fallback and monitor quality metrics with automatic rollback.
r=8 instead of r=16) for narrow domains because the adapter has less capacity to overwrite the base model’s general behavior. The nuclear option is to keep two models in production: the fine-tuned model for in-domain queries and the base model for everything else, with a classifier routing between them. This is more operational complexity but guarantees you never regress on out-of-domain performance.How do you calculate whether fine-tuning is worth the investment? Walk me through the cost analysis.
How do you calculate whether fine-tuning is worth the investment? Walk me through the cost analysis.
Strong Answer:
- The ROI calculation has four components: one-time training cost, ongoing inference cost delta, quality improvement value, and maintenance cost.
- Training cost for OpenAI fine-tuning of GPT-4o-mini is roughly 0.75 in training cost. Add validation and a couple re-runs for hyperparameter tuning, and total training cost is under 100+/hour) dwarfs the compute cost by 100x.
- Inference cost delta is where the real savings come from. Fine-tuning lets me shrink the prompt because instructions are baked into the weights. If my current system prompt is 2000 tokens and fine-tuning lets me reduce it to 200 tokens, I save 1800 input tokens per call. At GPT-4o-mini rates of 27 per day or roughly 0.30 vs $0.15 for input), but the net savings from shorter prompts still comes out positive if the prompt shrinkage is significant.
- Quality improvement value is harder to quantify but often the real driver. If fine-tuning reduces customer escalation rate from 15% to 10%, and each escalation costs 2,500 per day. This dwarfs the compute costs.
- Maintenance cost is the hidden one. The fine-tuned model needs re-training when policies change, when new products launch, or when the base model gets deprecated. I budget for quarterly re-training cycles and keep the data preparation pipeline automated so re-training is a one-click operation, not a multi-week project.