Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Transfer Learning

Transfer Learning & Fine-tuning

Why Transfer Learning?

Consider this: a ResNet-50 trained on ImageNet has already learned to detect edges, textures, shapes, and high-level object concepts from 14 million images. These features are not specific to ImageNet — edges look like edges whether you are classifying dogs or detecting tumors in medical scans. Throwing away all that learned knowledge and starting from random weights is like hiring an experienced chef and making them re-learn how to hold a knife. Training from scratch requires:
  • Massive datasets (typically 100K+ labeled examples for reasonable accuracy)
  • Enormous compute (days to weeks on GPUs)
  • Careful hyperparameter tuning (learning rate, augmentation, etc.)
Transfer learning uses knowledge from pretrained models:
  • Start with ImageNet/web-scale features (lower layers already detect edges, textures, patterns)
  • Adapt to your specific task with much less data (often 100-1000 examples is enough)
  • Get SOTA results with minimal compute (hours instead of weeks)
An analogy: Transfer learning is like learning a new language when you already speak a related one. A Spanish speaker learning Italian starts with a massive head start — shared vocabulary, similar grammar, familiar sounds. They do not start from zero. Similarly, a model pretrained on ImageNet already “speaks the language” of visual features; fine-tuning on medical images is like learning a specialized dialect.

The Transfer Learning Spectrum

Transfer Learning Spectrum
StrategyWhat ChangesWhen to Use
Feature ExtractionOnly classifierSmall data, similar domain
Fine-tuning (partial)Classifier + top layersMedium data
Fine-tuning (full)All layersLarger data, different domain
Train from scratchEverythingMassive data, very different domain

Feature Extraction

The simplest form of transfer learning: treat the pretrained model as a fixed feature extractor. Freeze all pretrained weights (they do not update during training) and only train a new classification head on top. This works surprisingly well when your target domain is similar to the pretraining domain and you have limited data — with as few as 50-100 examples per class, you can often get 85-95% accuracy. Freeze pretrained backbone, only train new classifier:
import torch
import torch.nn as nn
import torchvision.models as models

# Load pretrained model
backbone = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

# Freeze all parameters
for param in backbone.parameters():
    param.requires_grad = False

# Replace classifier
num_features = backbone.fc.in_features
backbone.fc = nn.Sequential(
    nn.Dropout(0.2),
    nn.Linear(num_features, 256),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(256, num_classes)
)

# Only new layers are trainable
trainable_params = sum(p.numel() for p in backbone.parameters() if p.requires_grad)
print(f"Trainable parameters: {trainable_params:,}")

Fine-tuning (Gradual Unfreezing)

def unfreeze_layers(model, num_layers):
    """Unfreeze last N layers of ResNet."""
    layers = list(model.children())
    
    # Freeze all
    for param in model.parameters():
        param.requires_grad = False
    
    # Unfreeze last N layers
    for layer in layers[-num_layers:]:
        for param in layer.parameters():
            param.requires_grad = True

# Stage 1: Train classifier only
unfreeze_layers(model, 1)
train(model, epochs=5, lr=1e-3)

# Stage 2: Unfreeze last block
unfreeze_layers(model, 2)
train(model, epochs=5, lr=1e-4)

# Stage 3: Fine-tune more layers
unfreeze_layers(model, 4)
train(model, epochs=10, lr=1e-5)

Discriminative Learning Rates

Different learning rates for different layers:
def get_layer_groups(model):
    """Split model into layer groups for discriminative LRs."""
    return [
        list(model.layer1.parameters()) + list(model.layer2.parameters()),  # Early layers
        list(model.layer3.parameters()),  # Middle layers
        list(model.layer4.parameters()),  # Late layers
        list(model.fc.parameters()),  # Classifier
    ]

layer_groups = get_layer_groups(model)
base_lr = 1e-4

optimizer = torch.optim.AdamW([
    {'params': layer_groups[0], 'lr': base_lr / 100},  # Very small for early layers
    {'params': layer_groups[1], 'lr': base_lr / 10},
    {'params': layer_groups[2], 'lr': base_lr},
    {'params': layer_groups[3], 'lr': base_lr * 10},  # Largest for classifier
])

Transfer Learning for Vision Transformers

import timm

# Load pretrained ViT
model = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=10)

# Freeze patch embedding and first N blocks
for param in model.patch_embed.parameters():
    param.requires_grad = False

for i, block in enumerate(model.blocks):
    if i < 8:  # Freeze first 8 of 12 blocks
        for param in block.parameters():
            param.requires_grad = False

# Fine-tune with smaller LR for unfrozen pretrained layers
optimizer = torch.optim.AdamW([
    {'params': model.blocks[8:].parameters(), 'lr': 1e-5},
    {'params': model.head.parameters(), 'lr': 1e-4},
])

Transfer Learning for NLP

Using Hugging Face Transformers

from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments

# Load pretrained model
model_name = "bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Fine-tuning configuration
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,  # Small LR for fine-tuning
    warmup_ratio=0.1,
    weight_decay=0.01,
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()

Parameter-Efficient Fine-tuning (PEFT)

LoRA (Low-Rank Adaptation)

The insight behind LoRA: when you fine-tune a pretrained model, the weight updates tend to be low-rank — meaning the actual “change” to each weight matrix can be approximated by multiplying two much smaller matrices. Instead of updating the full weight matrix WW (which might be 4096 x 4096 = 16 million parameters), LoRA freezes WW and trains two small matrices AA (4096 x 8) and BB (8 x 4096), so the effective update is W+BAW + BA with only 65,536 trainable parameters. That is a 99.6% reduction. Think of it like adjusting a building’s plumbing. Full fine-tuning rebuilds every pipe. LoRA installs small bypass valves at key junctions — same water flow, fraction of the construction cost. Only train small rank decomposition matrices:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,              # Rank -- higher = more capacity but more params. 4-16 is typical.
    lora_alpha=32,    # Scaling factor. Rule of thumb: lora_alpha = 2-4x r
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt (attention projections)
    lora_dropout=0.05,  # Dropout on LoRA layers for regularization
    bias="none",       # Do not train bias terms (saves memory, rarely helps)
)

model = get_peft_model(model, config)
model.print_trainable_parameters()
# trainable params: 294,912 || all params: 124,734,464 || trainable%: 0.24%
When to use LoRA vs full fine-tuning: LoRA shines when you have limited compute, need to serve multiple task-specific models from one base model (just swap LoRA weights — they are tiny), or want to prevent catastrophic forgetting. Full fine-tuning still wins when you have ample data and compute, and the target domain is very different from pretraining. In practice, LoRA at rank 16-64 achieves 95-99% of full fine-tuning quality at a fraction of the cost.

Adapter Layers

class Adapter(nn.Module):
    """Bottleneck adapter for efficient fine-tuning."""
    
    def __init__(self, dim, reduction=16):
        super().__init__()
        self.down = nn.Linear(dim, dim // reduction)
        self.up = nn.Linear(dim // reduction, dim)
        self.act = nn.GELU()
    
    def forward(self, x):
        return x + self.up(self.act(self.down(x)))

# Insert adapters into transformer blocks
for block in model.blocks:
    block.adapter = Adapter(block.dim)
    original_forward = block.forward
    block.forward = lambda x: original_forward(x) + block.adapter(x)

Domain Adaptation Techniques

When target domain differs significantly:
# Gradual domain adaptation
from torchvision import transforms

# Strong augmentation bridges domain gap
source_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.ColorJitter(0.4, 0.4, 0.4),
    transforms.RandomGrayscale(p=0.2),
    transforms.ToTensor(),
])

# Self-training on unlabeled target data
def pseudo_label(model, target_loader, threshold=0.9):
    model.eval()
    pseudo_labels = []
    
    with torch.no_grad():
        for x, _ in target_loader:
            probs = torch.softmax(model(x), dim=1)
            max_probs, preds = probs.max(dim=1)
            
            # Only keep high-confidence predictions
            mask = max_probs > threshold
            pseudo_labels.extend(zip(x[mask], preds[mask]))
    
    return pseudo_labels

Best Practices

ScenarioStrategy
< 1k samplesFeature extraction only
1k-10k samplesGradual unfreezing
10k-100k samplesFull fine-tuning with discriminative LR
Similar domainLower learning rates
Different domainMore aggressive fine-tuning
Limited computeLoRA / Adapters
Always use a smaller learning rate (10x-100x smaller) when fine-tuning pretrained models! Pretrained weights are already in a good region of the loss landscape. A large learning rate will catapult them out of that region, destroying the learned features — this is called catastrophic forgetting. For BERT-style models, 2e-5 to 5e-5 is standard. For vision models, 1e-4 to 1e-5 for the backbone and 1e-3 for the new head is a good starting point.
Common pitfall — forgetting to adjust data normalization: Pretrained models expect inputs normalized with the pretraining dataset statistics (e.g., ImageNet mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]). If you normalize with your own dataset’s statistics, the pretrained features will receive out-of-distribution inputs and perform poorly. Always use the pretraining normalization, even if your domain is very different.

Exercises

Compare accuracy on a 500-sample dataset using frozen backbone vs full fine-tuning.
Implement layer-wise learning rates and compare with uniform learning rate.
Add LoRA adapters to a ViT model. Compare trainable parameters and final accuracy.

What’s Next

Module 21: Model Deployment

Export models to ONNX, TorchScript, and deploy to production.