mechanistic interpretability is mainly reverse engineering a trained DL model to peek into what it’s looking at in order to perform the task in hand

Github Repo: Tickling-Vision-Models

core concepts:

feature visualization (activation maximization)

using Gradient Ascent to identify images / samples that maximize a selected circuit (: neuron/ conv channel/ layer/ …)

$$ x^* = \arg\max_{x} a_i(x) - \lambda \mathrm{R}(x) $$

where R(x) is a regularization term could be:

$L_2 norm$
$\mathrm{TV}$
$\mathrm{Jitter}$
…

Linear Probing

train a linear classifier on activations to get an intuition of manifold shape: learned representation

SAE decomposition

$$ z = E(h), \quad \hat{h} = D(z) \\ \mathcal{L} = \frac{1}{N} \sum\|h_n - D(E(h_n))\|_2^2 + \lambda \|E(h_n)\|_1 $$

learning a disentangle latent representation by encouraging sparsity in a lower manifold

Feature Manifold & Geometry

manifold cluster based on feature $\mathrm{F}$: subset of points sharing $\mathrm{F}$

$$ M_F={a_L(x)∣x \text{ has feature } F}, \quad a_L\text{ is the activation of layer }L $$

$$ \text{dim}(M_F) \le d \\ \text{dim}(M_F) \approx \mathrm{PCA}_k, \quad \text{top k eigen vectors preserving 95\% of variance} $$

local dimension via PCA:

$$ \tilde{A} = A - \textbf{1} \mu^T, \quad \mu = \frac{1}{n} \sum a_i \\ \sigma = \frac{1}{n-1} \tilde{A}^T\tilde{A} $$

$$ \text{Given Eigen values: }\lambda_1, \lambda_2, \dots, \lambda_d \\ \text{intristic dimension: }k = min\left\{m \middle| \frac{\sum_{i=1}^m \lambda_i}{\sum_{i=1}^d \lambda_i} \ge 0.95 \right\} $$

curvature: flat manifold → smooth travel on surface

   curved manifold → bumpy ride

(a short step might jump to a semantically different region)

$$ d_G(a, b) = \int_a^b \sqrt{g_{\gamma(t)} \big(\dot{\gamma}(t), \dot{\gamma}(t)\big)}dt $$

$d_G$ components

$$ d_G: \text{ Geodesic distance, } d_E: \text{euclidian distance} \\\text{curvature index: } \quad \mathcal{k}(p, q) = \frac{d_G(p, q) - d_E(p, q)}{d_E(p, q)} $$

Geodesic distance: walking on the manifold

Euclidian distance: walking in a straight path (through the manifold)

connectedness: all x containing F reachable through a smooth shift / interpolation

test: interpolation effect on classification / feature identification & activation

Adversarial Examples (FGSM & PGD) & mechanistic view

$$ \text{FSGM: } \quad x^\prime = x + \epsilon \cdot \mathcal{sign}(\nabla_x \mathcal{L}\big(f(x), y\big) \\

\text{PGD:} \quad x_{t+1} = x_t + \lambda \cdot \mathcal{sign}(\nabla_x \mathcal{L}\big(f(x_t), y\big) $$

adversarial perturbation introduces a change in activation space: $\Delta a = a(x^\prime) - a(x)$ that can be decomposed using an SAE ⇒ determine hijacked circuits by learning the sparse representation

$$ a \approx W_{decode} \cdot h, \quad h \approx \mathrm{ReLU}(W_{encode} \cdot a) \\ \Delta h = h(x^\prime) - h(x) $$

Adversarial Path:

An adversarial path is a continuous trajectory through input space (or representation space) that starts at one label and ends at another, while staying imperceptible or minimally different to a human observer.

formal definition:

$$ \min_{x(t)} \int_0^1 \| \nabla_x\mathrm{L}\big(h\small(x(t)\small), y\small(t\small)\big) \|dt $$

clustering Activation atlases:

looking for fractures as multi-cluster concepts

Circuit analysis techniques:

ablation:

circuit silence downstream effect

$$ \Delta logit = logit_{origin} - logit_{ablated} $$

patching:

replacing a subset of activations for target images from a donor image

feature visualization in learned sparse representation

Metrics:

probe accuracy vs chance baseline
reconstruction loss: MSE on SAE
adversarial success rate
effect size in ablation/patching
clustering purity

$$ \mathrm{purity} = \frac{1}{N} \sum_{C} \max \big|\{x \in c: \mathrm{label(x)=\ell}\}\big| $$

to check

TCAV: Testing with Concept Activation Vectors

drawing a linear decision boundary in latent space of a selected layer of the model / network separating samples with feature F from samples without it

CAV formally:

$$ h(x) = \sigma(X \cdot W + b) \\ \text{Concept Vector (boundary): } \quad V = \frac{W}{\|W\|} $$

given $f_i(x)$: logits for class i, c: concept, l: layer

$$ \mathrm{TCAV}{c, i} = \frac{1}{\mid X \mid} \sum{x \in X} \mathbf{1} \!\left[\frac{\partial f_i(x)}{\partial \mathbf{a}_\ell(x)} \cdot \mathbf{V}_c > 0\right] $$

how V is aligned with the mentioned gradient reflect the concept contribution / correspondence to that class

Integrated Gradient

let $x^\prime$ be a baseline input (mean / zero image)

$$ x_\alpha = x + \alpha (x - x^\prime), \quad \mathrm{IG}i(x) = (x_i - x_i^\prime ) \int_0^1 \frac{\partial f(x\alpha)}{\partial x_i} d\alpha $$

Layer Integrated Gradient

let $z = f_\ell(x), \quad z^\prime = f_\ell(x^\prime)$

$$ x + \alpha (x - x^\prime), \quad \mathrm{LIG}i(x) = (z_i - z_i^\prime ) \int_0^1 \frac{\partial f(x\alpha)}{\partial z_i} d\alpha $$

Generalized Integrated Gradient

let $\gamma(\alpha)$ be a smooth path

$$ \mathrm{GIG}_i(x) = \int_0^1 \frac{\partial f\big(\gamma(\alpha)\big)}{\partial \gamma_i(\alpha)} \frac{\partial \gamma_i(\alpha)}{\partial \alpha} d\alpha $$

Discretized Integrated Gradient:

GIG using a discrete path (?): common in NLP

Concept localization in hidden layers

CAV projected IG score per layer
TCAV: sign consistency of batch of samples
concepts vs layers heatmap / matrix

Concept Flow throughout the network

concept score through layers:

$$ s_\ell(x) = a_\ell(x) \cdot \mathbf{V_c} \\ S(x) = [s_1(x), s_2(x), ..., s_L(x)] $$

PFV: Pointwise Feature Vector

CAV per spatial point, no GAP

Dimensionality Reduction x High Dimensions projection (intrinsic dims extraction):

(to elaborate on/ note)

PCA
ICA
KPCA
t-SNE
UMAP