Consider this: a ResNet-50 trained on ImageNet has already learned to detect edges, textures, shapes, and high-level object concepts from 14 million images. These features are not specific to ImageNet — edges look like edges whether you are classifying dogs or detecting tumors in medical scans. Throwing away all that learned knowledge and starting from random weights is like hiring an experienced chef and making them re-learn how to hold a knife.Training from scratch requires:
Massive datasets (typically 100K+ labeled examples for reasonable accuracy)
Transfer learning uses knowledge from pretrained models:
Start with ImageNet/web-scale features (lower layers already detect edges, textures, patterns)
Adapt to your specific task with much less data (often 100-1000 examples is enough)
Get SOTA results with minimal compute (hours instead of weeks)
An analogy: Transfer learning is like learning a new language when you already speak a related one. A Spanish speaker learning Italian starts with a massive head start — shared vocabulary, similar grammar, familiar sounds. They do not start from zero. Similarly, a model pretrained on ImageNet already “speaks the language” of visual features; fine-tuning on medical images is like learning a specialized dialect.
The simplest form of transfer learning: treat the pretrained model as a fixed feature extractor. Freeze all pretrained weights (they do not update during training) and only train a new classification head on top. This works surprisingly well when your target domain is similar to the pretraining domain and you have limited data — with as few as 50-100 examples per class, you can often get 85-95% accuracy.Freeze pretrained backbone, only train new classifier:
import torchimport torch.nn as nnimport torchvision.models as models# Load pretrained modelbackbone = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)# Freeze all parametersfor param in backbone.parameters(): param.requires_grad = False# Replace classifiernum_features = backbone.fc.in_featuresbackbone.fc = nn.Sequential( nn.Dropout(0.2), nn.Linear(num_features, 256), nn.ReLU(), nn.Dropout(0.2), nn.Linear(256, num_classes))# Only new layers are trainabletrainable_params = sum(p.numel() for p in backbone.parameters() if p.requires_grad)print(f"Trainable parameters: {trainable_params:,}")
def unfreeze_layers(model, num_layers): """Unfreeze last N layers of ResNet.""" layers = list(model.children()) # Freeze all for param in model.parameters(): param.requires_grad = False # Unfreeze last N layers for layer in layers[-num_layers:]: for param in layer.parameters(): param.requires_grad = True# Stage 1: Train classifier onlyunfreeze_layers(model, 1)train(model, epochs=5, lr=1e-3)# Stage 2: Unfreeze last blockunfreeze_layers(model, 2)train(model, epochs=5, lr=1e-4)# Stage 3: Fine-tune more layersunfreeze_layers(model, 4)train(model, epochs=10, lr=1e-5)
import timm# Load pretrained ViTmodel = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=10)# Freeze patch embedding and first N blocksfor param in model.patch_embed.parameters(): param.requires_grad = Falsefor i, block in enumerate(model.blocks): if i < 8: # Freeze first 8 of 12 blocks for param in block.parameters(): param.requires_grad = False# Fine-tune with smaller LR for unfrozen pretrained layersoptimizer = torch.optim.AdamW([ {'params': model.blocks[8:].parameters(), 'lr': 1e-5}, {'params': model.head.parameters(), 'lr': 1e-4},])
The insight behind LoRA: when you fine-tune a pretrained model, the weight updates tend to be low-rank — meaning the actual “change” to each weight matrix can be approximated by multiplying two much smaller matrices. Instead of updating the full weight matrix W (which might be 4096 x 4096 = 16 million parameters), LoRA freezes W and trains two small matrices A (4096 x 8) and B (8 x 4096), so the effective update is W+BA with only 65,536 trainable parameters. That is a 99.6% reduction.Think of it like adjusting a building’s plumbing. Full fine-tuning rebuilds every pipe. LoRA installs small bypass valves at key junctions — same water flow, fraction of the construction cost.Only train small rank decomposition matrices:
from peft import LoraConfig, get_peft_modelconfig = LoraConfig( r=8, # Rank -- higher = more capacity but more params. 4-16 is typical. lora_alpha=32, # Scaling factor. Rule of thumb: lora_alpha = 2-4x r target_modules=["q_proj", "v_proj"], # Which layers to adapt (attention projections) lora_dropout=0.05, # Dropout on LoRA layers for regularization bias="none", # Do not train bias terms (saves memory, rarely helps))model = get_peft_model(model, config)model.print_trainable_parameters()# trainable params: 294,912 || all params: 124,734,464 || trainable%: 0.24%
When to use LoRA vs full fine-tuning: LoRA shines when you have limited compute, need to serve multiple task-specific models from one base model (just swap LoRA weights — they are tiny), or want to prevent catastrophic forgetting. Full fine-tuning still wins when you have ample data and compute, and the target domain is very different from pretraining. In practice, LoRA at rank 16-64 achieves 95-99% of full fine-tuning quality at a fraction of the cost.
Always use a smaller learning rate (10x-100x smaller) when fine-tuning pretrained models! Pretrained weights are already in a good region of the loss landscape. A large learning rate will catapult them out of that region, destroying the learned features — this is called catastrophic forgetting. For BERT-style models, 2e-5 to 5e-5 is standard. For vision models, 1e-4 to 1e-5 for the backbone and 1e-3 for the new head is a good starting point.
Common pitfall — forgetting to adjust data normalization: Pretrained models expect inputs normalized with the pretraining dataset statistics (e.g., ImageNet mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]). If you normalize with your own dataset’s statistics, the pretrained features will receive out-of-distribution inputs and perform poorly. Always use the pretraining normalization, even if your domain is very different.