Skip to main content
Transfer Learning

Transfer Learning & Fine-tuning

Why Transfer Learning?

Training from scratch requires:
  • Massive datasets
  • Enormous compute
  • Weeks of training
Transfer learning uses knowledge from pretrained models:
  • Start with ImageNet/web-scale features
  • Adapt to your specific task
  • Get SOTA results with minimal data

The Transfer Learning Spectrum

Transfer Learning Spectrum
StrategyWhat ChangesWhen to Use
Feature ExtractionOnly classifierSmall data, similar domain
Fine-tuning (partial)Classifier + top layersMedium data
Fine-tuning (full)All layersLarger data, different domain
Train from scratchEverythingMassive data, very different domain

Feature Extraction

Freeze pretrained backbone, only train new classifier:
import torch
import torch.nn as nn
import torchvision.models as models

# Load pretrained model
backbone = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

# Freeze all parameters
for param in backbone.parameters():
    param.requires_grad = False

# Replace classifier
num_features = backbone.fc.in_features
backbone.fc = nn.Sequential(
    nn.Dropout(0.2),
    nn.Linear(num_features, 256),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(256, num_classes)
)

# Only new layers are trainable
trainable_params = sum(p.numel() for p in backbone.parameters() if p.requires_grad)
print(f"Trainable parameters: {trainable_params:,}")

Fine-tuning (Gradual Unfreezing)

def unfreeze_layers(model, num_layers):
    """Unfreeze last N layers of ResNet."""
    layers = list(model.children())
    
    # Freeze all
    for param in model.parameters():
        param.requires_grad = False
    
    # Unfreeze last N layers
    for layer in layers[-num_layers:]:
        for param in layer.parameters():
            param.requires_grad = True

# Stage 1: Train classifier only
unfreeze_layers(model, 1)
train(model, epochs=5, lr=1e-3)

# Stage 2: Unfreeze last block
unfreeze_layers(model, 2)
train(model, epochs=5, lr=1e-4)

# Stage 3: Fine-tune more layers
unfreeze_layers(model, 4)
train(model, epochs=10, lr=1e-5)

Discriminative Learning Rates

Different learning rates for different layers:
def get_layer_groups(model):
    """Split model into layer groups for discriminative LRs."""
    return [
        list(model.layer1.parameters()) + list(model.layer2.parameters()),  # Early layers
        list(model.layer3.parameters()),  # Middle layers
        list(model.layer4.parameters()),  # Late layers
        list(model.fc.parameters()),  # Classifier
    ]

layer_groups = get_layer_groups(model)
base_lr = 1e-4

optimizer = torch.optim.AdamW([
    {'params': layer_groups[0], 'lr': base_lr / 100},  # Very small for early layers
    {'params': layer_groups[1], 'lr': base_lr / 10},
    {'params': layer_groups[2], 'lr': base_lr},
    {'params': layer_groups[3], 'lr': base_lr * 10},  # Largest for classifier
])

Transfer Learning for Vision Transformers

import timm

# Load pretrained ViT
model = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=10)

# Freeze patch embedding and first N blocks
for param in model.patch_embed.parameters():
    param.requires_grad = False

for i, block in enumerate(model.blocks):
    if i < 8:  # Freeze first 8 of 12 blocks
        for param in block.parameters():
            param.requires_grad = False

# Fine-tune with smaller LR for unfrozen pretrained layers
optimizer = torch.optim.AdamW([
    {'params': model.blocks[8:].parameters(), 'lr': 1e-5},
    {'params': model.head.parameters(), 'lr': 1e-4},
])

Transfer Learning for NLP

Using Hugging Face Transformers

from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments

# Load pretrained model
model_name = "bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Fine-tuning configuration
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,  # Small LR for fine-tuning
    warmup_ratio=0.1,
    weight_decay=0.01,
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()

Parameter-Efficient Fine-tuning (PEFT)

LoRA (Low-Rank Adaptation)

Only train small rank decomposition matrices:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,  # Rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(model, config)
model.print_trainable_parameters()
# trainable params: 294,912 || all params: 124,734,464 || trainable%: 0.24%

Adapter Layers

class Adapter(nn.Module):
    """Bottleneck adapter for efficient fine-tuning."""
    
    def __init__(self, dim, reduction=16):
        super().__init__()
        self.down = nn.Linear(dim, dim // reduction)
        self.up = nn.Linear(dim // reduction, dim)
        self.act = nn.GELU()
    
    def forward(self, x):
        return x + self.up(self.act(self.down(x)))

# Insert adapters into transformer blocks
for block in model.blocks:
    block.adapter = Adapter(block.dim)
    original_forward = block.forward
    block.forward = lambda x: original_forward(x) + block.adapter(x)

Domain Adaptation Techniques

When target domain differs significantly:
# Gradual domain adaptation
from torchvision import transforms

# Strong augmentation bridges domain gap
source_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.ColorJitter(0.4, 0.4, 0.4),
    transforms.RandomGrayscale(p=0.2),
    transforms.ToTensor(),
])

# Self-training on unlabeled target data
def pseudo_label(model, target_loader, threshold=0.9):
    model.eval()
    pseudo_labels = []
    
    with torch.no_grad():
        for x, _ in target_loader:
            probs = torch.softmax(model(x), dim=1)
            max_probs, preds = probs.max(dim=1)
            
            # Only keep high-confidence predictions
            mask = max_probs > threshold
            pseudo_labels.extend(zip(x[mask], preds[mask]))
    
    return pseudo_labels

Best Practices

ScenarioStrategy
< 1k samplesFeature extraction only
1k-10k samplesGradual unfreezing
10k-100k samplesFull fine-tuning with discriminative LR
Similar domainLower learning rates
Different domainMore aggressive fine-tuning
Limited computeLoRA / Adapters
Always use a smaller learning rate (10x-100x smaller) when fine-tuning pretrained models!

Exercises

Compare accuracy on a 500-sample dataset using frozen backbone vs full fine-tuning.
Implement layer-wise learning rates and compare with uniform learning rate.
Add LoRA adapters to a ViT model. Compare trainable parameters and final accuracy.

What’s Next