Model Interpretability: Opening the Black Box
Why Interpretability Matters
Deep learning models are often “black boxes” — they work, but we do not know why. A model that classifies skin lesions with 95% accuracy is useless in a clinic if the dermatologist cannot understand what features it is using. Is it looking at the lesion’s border irregularity (good) or the presence of a ruler in the image (bad — common in training datasets of confirmed melanomas)? This is not a theoretical concern. It is problematic for:- Trust: A radiologist will not act on a model’s prediction without understanding its reasoning
- Debugging: When a model fails on production data, you need to know why to fix it
- Fairness: You must verify the model is not using protected attributes like race or gender as proxies
- Compliance: GDPR’s “right to explanation” and the EU AI Act explicitly require algorithmic transparency
- Science: Understanding what a model has learned can reveal genuine scientific insights about the data
Gradient-Based Methods
Vanilla Gradients
The simplest approach: compute the gradient of the output class score with respect to the input pixels. The intuition is direct — pixels with large gradients are pixels where a small change would most affect the model’s prediction. If the gradient at pixel (100, 200) is large for the “cat” class, then changing that pixel significantly affects whether the model thinks the image contains a cat.Gradient × Input
Multiply gradients by input for sharper attributions:Integrated Gradients
Vanilla gradients have a fundamental flaw: they only capture local sensitivity, not total attribution. A pixel might have a small gradient (locally flat) but still be critically important for the prediction. Integrated Gradients fixes this by accumulating gradients along a path from a baseline (typically a black image) to the actual input. This satisfies important theoretical properties like “completeness” — the attributions sum to the difference between the model’s output at the input and at the baseline. Accumulate gradients along the path from baseline to input:SmoothGrad
Average gradients over noisy versions of input:Class Activation Mapping (CAM)
Grad-CAM
Grad-CAM is the most widely-used interpretability method in practice, and for good reason: it produces intuitive, coarse-grained heatmaps that show which regions of the image were most important for a particular class prediction. Unlike gradient-based methods that operate at the pixel level (producing noisy, hard-to-interpret saliency maps), Grad-CAM works at the feature map level, producing smooth heatmaps that align with human-understandable regions. Gradient-weighted Class Activation Mapping:Grad-CAM++
Improved version with better localization:Attention Visualization
Visualizing Transformer Attention
SHAP (SHapley Additive exPlanations)
SHAP brings rigorous game theory to model interpretability. The idea comes from cooperative game theory: if a group of players (features) collectively achieve some payoff (the model’s prediction), how do you fairly distribute credit? Shapley values provide the unique attribution method that satisfies several desirable fairness axioms: symmetry (equal features get equal credit), efficiency (attributions sum to the total prediction), and monotonicity (a feature that always helps never gets negative credit). The catch: computing exact Shapley values requires evaluating the model on all possible subsets of features, which is exponential. DeepSHAP and GradientSHAP provide efficient approximations that leverage the network’s structure.Feature Visualization
Activation Maximization
While attribution methods answer “which inputs matter?”, feature visualization answers “what does this neuron detect?” By starting with random noise and iteratively modifying the image to maximize a specific neuron’s activation, we can literally see the platonic ideal of what that neuron is looking for. Early layers reveal edge and texture detectors; deeper layers reveal complex pattern detectors like dog faces, wheels, or building facades. This technique, popularized by Google’s Distill publication, is both scientifically illuminating and occasionally surreal.Concept-Based Explanations
TCAV (Testing with Concept Activation Vectors)
TCAV moves beyond pixel-level explanations to concept-level ones, which is what humans actually care about. Instead of asking “which pixels matter?”, TCAV asks “does the concept of stripes influence the model’s prediction of zebra?” You define a concept by providing example images (images with stripes vs. random images), TCAV learns a direction in the model’s activation space that corresponds to that concept, and then measures how much pushing the representation along that direction affects the prediction. This approach speaks the language of domain experts — a doctor can ask “does the presence of calcification influence the model’s diagnosis?” without needing to understand gradients or attention maps.Practical Considerations
Exercises
Exercise 1: Compare Attribution Methods
Exercise 1: Compare Attribution Methods
Compare Grad-CAM, Integrated Gradients, and SHAP on the same images:
Exercise 2: Sanity Checks
Exercise 2: Sanity Checks
Implement sanity checks for explanation methods:
Exercise 3: Build TCAV for Custom Concepts
Exercise 3: Build TCAV for Custom Concepts
Create CAVs for your own concepts:
What’s Next?
Adversarial Robustness
Defend against adversarial attacks
Knowledge Distillation
Transfer knowledge between models