Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Self-Supervised Learning: Learning Without Labels
The Promise of Self-Supervision
Here is the fundamental asymmetry of modern AI: labeling a single ImageNet image takes about 60 seconds of human effort; generating an unlabeled image takes a camera shutter click. The internet produces billions of unlabeled images and text documents daily, while labeled datasets require expensive human annotation campaigns. Self-supervised learning bridges this gap by creating pretext tasks from unlabeled data — clever problems where the “answer” is already embedded in the data itself.Pretext Tasks
Classic Pretext Tasks
These were the first generation of self-supervised methods, and while they have been largely superseded by contrastive and masked approaches, understanding them builds intuition for the core principle: design a task where the labels come for free, and where solving the task forces useful feature learning.Contrastive Learning
The Contrastive Learning Framework
Goal: Pull together representations of similar samples (positives), push apart representations of dissimilar samples (negatives). The analogy is organizing a photo album: pictures of the same person should be grouped together regardless of pose, lighting, or background, while pictures of different people should be clearly separated. The mathematical insight: if your model maps two augmentations of the same image to nearby points in embedding space, it must have learned features that are invariant to the augmentation (crops, rotations, color changes) but sensitive to the actual content (objects, structure, semantics). And those invariant, semantic features are exactly what you need for downstream tasks like classification.SimCLR (Simple Contrastive Learning)
SimCLR (Chen et al., 2020) showed that contrastive learning could match supervised pretraining with a surprisingly simple recipe: strong augmentation + large batches + a projection head. The augmentation policy is critical — SimCLR found that the combination of random cropping and color distortion is especially powerful because it forces the model to match views that share neither spatial location nor color information, leaving only semantic content.MoCo (Momentum Contrast)
SimCLR has a practical limitation: it needs very large batch sizes (4096+) to have enough negative samples, which requires multiple GPUs with huge memory. MoCo solves this elegantly with two innovations: (1) a momentum-updated encoder that provides consistent key representations, and (2) a queue of past key embeddings that serves as a large, diverse negative pool without needing a massive batch. Think of it like this: SimCLR compares every image to every other image in the current batch. MoCo compares each image to a much larger “memory bank” of past representations, giving it many more negatives to contrast against without the GPU memory cost.Non-Contrastive Methods
BYOL (Bootstrap Your Own Latent)
BYOL dropped a bombshell in 2020: you do not need negative samples at all. This was shocking because the entire field assumed negatives were essential to prevent “collapse” (where the model maps everything to the same point). BYOL uses an asymmetric architecture with a predictor head and a momentum-updated target network, which creates enough asymmetry to prevent collapse without any negatives. The mechanism is still not fully understood theoretically, which makes it one of the more intriguing results in recent representation learning.SimSiam (Simple Siamese)
SimSiam distills the non-contrastive approach to its absolute minimum: no negative samples, no momentum encoder, no large batches. The only thing preventing collapse is the stop-gradient operation on one branch and the asymmetric predictor head on the other. The paper by Xinlei Chen and Kaiming He (2021) is remarkable for showing that this minimal recipe works — and for providing an analysis suggesting that SimSiam implicitly performs an alternating optimization similar to Expectation-Maximization.Masked Modeling
Masked Autoencoders (MAE)
Masked Autoencoders brought the “masked language modeling” paradigm from NLP (BERT) to vision, and the results were striking. The core idea: mask 75% of image patches (far more aggressive than BERT’s 15% token masking) and train the model to reconstruct the missing pixels. Why such aggressive masking? Because images have high spatial redundancy — neighboring patches are highly correlated, so the model can “cheat” by interpolating from nearby visible patches unless you force it to reason about large missing regions. The efficiency benefit is also remarkable: since the encoder only processes the 25% visible patches, MAE pretraining is 3-4x faster than contrastive methods that must process full images through two augmented views.BERT-style Masked Language Modeling
Evaluation & Downstream Tasks
Linear Probing
Linear probing is the standard litmus test for representation quality: freeze the pretrained encoder entirely and train only a single linear layer on top for a downstream classification task. The idea is that if a linear classifier can achieve high accuracy, the representations must already be linearly separable — meaning the encoder has organized its feature space in a semantically meaningful way without any task-specific supervision.Fine-tuning
Practical Considerations
Data Augmentation for SSL
Training Tips
Exercises
Exercise 1: Implement SwAV
Exercise 1: Implement SwAV
Implement Swapping Assignments between Views:
Exercise 2: Add Multi-Crop
Exercise 2: Add Multi-Crop
Implement DINO-style multi-crop augmentation:
Exercise 3: Compare Methods
Exercise 3: Compare Methods
Train SimCLR, BYOL, and MAE on CIFAR-10 and compare:
- Training time
- Linear probe accuracy
- Fine-tuning accuracy
- Representation quality (t-SNE)
What’s Next?
Reinforcement Learning for DL
RLHF, PPO, and preference optimization
Neural Architecture Search
Automated architecture discovery
Interview Deep-Dive
Compare contrastive learning (SimCLR) with masked image modeling (MAE). When would you choose one over the other?
Compare contrastive learning (SimCLR) with masked image modeling (MAE). When would you choose one over the other?
Strong Answer:SimCLR learns representations by pulling together two augmented views of the same image while pushing apart views of different images. It requires large batch sizes (4096-8192) for sufficient negatives, and captures primarily high-level semantic features — object identity and category — but can miss low-level visual details like texture and spatial structure.MAE masks 75% of image patches and trains the model to reconstruct the missing pixels. It needs no negatives or large batches. MAE representations are richer in spatial and structural information because pixel reconstruction forces understanding of geometry, texture, and local context.Choose SimCLR when your downstream task is classification or retrieval where discriminative semantic embeddings matter. SimCLR representations transfer better to linear probing because they are more discriminative by construction.Choose MAE when your downstream task requires dense prediction (segmentation, detection, depth estimation) or when pretraining a ViT from scratch on smaller datasets. MAE’s reconstruction objective provides a very strong learning signal that overcomes ViT’s lack of inductive bias, while SimCLR on small datasets produces weaker representations due to insufficient negative diversity.The current trend: DINOv2 combines self-distillation and masked image modeling, getting the best of both approaches.Follow-up: Why does SimCLR need such large batch sizes, and how does BYOL avoid this requirement?SimCLR’s contrastive loss needs negatives to define what to push apart. With small batches, most negatives are “easy” (very different images). Large batches increase the probability of “hard” negatives that force finer-grained representations.BYOL sidesteps negatives entirely using two networks: an online network and a target network (exponential moving average of the online). The online network predicts the target’s representation of a different view. The asymmetry — a predictor MLP exists only in the online network, and the target updates slowly via EMA — prevents collapse without needing negatives. BYOL works well with batch sizes as small as 256, making it accessible without massive GPU clusters, with accuracy within 0.5% of SimCLR on ImageNet linear probe.
How is self-supervised pretraining used in large language models, and why is next-token prediction such an effective objective?
How is self-supervised pretraining used in large language models, and why is next-token prediction such an effective objective?
Strong Answer:Every major LLM (GPT, LLaMA, Mistral) uses causal language modeling — predicting the next token given all previous tokens. The “labels” are just the next word in existing text, requiring no human annotation.Why it works so well: next-token prediction is an extraordinarily rich objective that implicitly requires learning syntax, semantics, world knowledge, reasoning, and theory of mind. To predict the next word after “The capital of France is,” the model must encode geographic knowledge. To complete code, it must understand programming logic. Every possible downstream task is effectively a subtask of next-token prediction.The information-theoretic argument: natural language has about 1-2 bits of entropy per character. A model achieving near-human perplexity has necessarily compressed vast world knowledge into its parameters, because that knowledge is required for accurate prediction. This is not just pattern matching — it is implicit multi-task learning.The practical insight: data quality matters enormously. A model trained on high-quality data with fewer tokens typically outperforms one trained on low-quality data with more tokens. Teams like Meta invest months in data curation, deduplication, and quality filtering before training begins.Follow-up: BERT uses masked language modeling instead of causal language modeling. Why did GPT-style causal models win for generative tasks?Causal models are autoregressive — they generate tokens left-to-right, matching how text generation works in practice. BERT is bidirectional (sees context from both sides of masked tokens), giving stronger understanding but making it unsuitable for generation since you cannot condition on future tokens that do not exist yet.The deeper reason causal models won: generation is more commercially valuable and more general than understanding. A model that generates coherent text implicitly understands context, but an understanding model cannot easily generate. This asymmetry drove industry convergence on autoregressive architectures despite BERT’s earlier publication.
You have 10M unlabeled domain-specific images and 5K labeled images. Design a pretraining and fine-tuning strategy.
You have 10M unlabeled domain-specific images and 5K labeled images. Design a pretraining and fine-tuning strategy.
Strong Answer:Phase one — self-supervised pretraining on the 10M unlabeled images using MAE. MAE works well without huge batch sizes, and for domain-specific data, the spatial and structural features from reconstruction are more useful than SimCLR’s purely semantic features. Pretrain a ViT-Base for 200-400 epochs — MAE’s 75% masking makes each epoch cheap.Phase two — supervised fine-tuning on the 5K labeled images. Use aggressive augmentation (RandAugment, CutMix, MixUp), low learning rate (1e-4 with cosine decay), label smoothing (0.1), and weight decay (0.05) to prevent overfitting. Fine-tune the full model, not just a linear probe, since 5K images is sufficient for end-to-end tuning.Phase three — semi-supervised refinement. Use the fine-tuned model to pseudo-label the most confident 50% of the 10M unlabeled images. Retrain on real labels plus pseudo-labels for 2-5% additional accuracy.If the domain is very different from ImageNet (microscopy, satellite, sonar), domain-specific pretraining significantly outperforms ImageNet pretraining. If similar to natural images, the gap narrows but domain pretraining still wins.Follow-up: How do you validate that self-supervised pretraining learned useful representations before investing in expensive fine-tuning?Three quick checks under one hour total. First, linear probing: freeze backbone, attach a linear classifier, train on 500 of the 5K labels — accuracy significantly above random indicates useful features. Second, nearest-neighbor retrieval: embed all 5K labeled images, check if nearest neighbors share labels — high recall at k=5 indicates good semantic organization. Third, t-SNE visualization of labeled set embeddings — clusters corresponding to meaningful categories confirm structural learning. These checks provide high confidence before committing to full fine-tuning.