mechanistic interpretability is mainly reverse engineering a trained DL model to peek into what it’s looking at in order to perform the task in hand
Github Repo: Tickling-Vision-Models
using Gradient Ascent to identify images / samples that maximize a selected circuit (: neuron/ conv channel/ layer/ …)
$$ x^* = \arg\max_{x} a_i(x) - \lambda \mathrm{R}(x) $$
where R(x) is a regularization term could be:
train a linear classifier on activations to get an intuition of manifold shape: learned representation
$$ z = E(h), \quad \hat{h} = D(z) \\ \mathcal{L} = \frac{1}{N} \sum\|h_n - D(E(h_n))\|_2^2 + \lambda \|E(h_n)\|_1 $$
learning a disentangle latent representation by encouraging sparsity in a lower manifold
manifold cluster based on feature $\mathrm{F}$: subset of points sharing $\mathrm{F}$
$$ M_F={a_L(x)∣x \text{ has feature } F}, \quad a_L\text{ is the activation of layer }L $$
$$ \text{dim}(M_F) \le d \\ \text{dim}(M_F) \approx \mathrm{PCA}_k, \quad \text{top k eigen vectors preserving 95\% of variance} $$
local dimension via PCA:
$$ \tilde{A} = A - \textbf{1} \mu^T, \quad \mu = \frac{1}{n} \sum a_i \\ \sigma = \frac{1}{n-1} \tilde{A}^T\tilde{A} $$
$$ \text{Given Eigen values: }\lambda_1, \lambda_2, \dots, \lambda_d \\ \text{intristic dimension: }k = min\left\{m \middle| \frac{\sum_{i=1}^m \lambda_i}{\sum_{i=1}^d \lambda_i} \ge 0.95 \right\} $$
curvature: flat manifold → smooth travel on surface
curved manifold → bumpy ride
(a short step might jump to a semantically different region)
$$ d_G(a, b) = \int_a^b \sqrt{g_{\gamma(t)} \big(\dot{\gamma}(t), \dot{\gamma}(t)\big)}dt $$
$$ d_G: \text{ Geodesic distance, } d_E: \text{euclidian distance} \\\text{curvature index: } \quad \mathcal{k}(p, q) = \frac{d_G(p, q) - d_E(p, q)}{d_E(p, q)} $$
Geodesic distance: walking on the manifold
Euclidian distance: walking in a straight path (through the manifold)
connectedness: all x containing F reachable through a smooth shift / interpolation
test: interpolation effect on classification / feature identification & activation
$$ \text{FSGM: } \quad x^\prime = x + \epsilon \cdot \mathcal{sign}(\nabla_x \mathcal{L}\big(f(x), y\big) \\
\text{PGD:} \quad x_{t+1} = x_t + \lambda \cdot \mathcal{sign}(\nabla_x \mathcal{L}\big(f(x_t), y\big) $$
adversarial perturbation introduces a change in activation space: $\Delta a = a(x^\prime) - a(x)$ that can be decomposed using an SAE ⇒ determine hijacked circuits by learning the sparse representation
$$ a \approx W_{decode} \cdot h, \quad h \approx \mathrm{ReLU}(W_{encode} \cdot a) \\ \Delta h = h(x^\prime) - h(x) $$
An adversarial path is a continuous trajectory through input space (or representation space) that starts at one label and ends at another, while staying imperceptible or minimally different to a human observer.
formal definition:
$$ \min_{x(t)} \int_0^1 \| \nabla_x\mathrm{L}\big(h\small(x(t)\small), y\small(t\small)\big) \|dt $$
looking for fractures as multi-cluster concepts
circuit silence downstream effect
$$ \Delta logit = logit_{origin} - logit_{ablated} $$
replacing a subset of activations for target images from a donor image

feature visualization in learned sparse representation
probe accuracy vs chance baseline
reconstruction loss: MSE on SAE
adversarial success rate
effect size in ablation/patching
clustering purity
$$ \mathrm{purity} = \frac{1}{N} \sum_{C} \max \big|\{x \in c: \mathrm{label(x)=\ell}\}\big| $$
drawing a linear decision boundary in latent space of a selected layer of the model / network separating samples with feature F from samples without it
CAV formally:
$$ h(x) = \sigma(X \cdot W + b) \\ \text{Concept Vector (boundary): } \quad V = \frac{W}{\|W\|} $$
given $f_i(x)$: logits for class i, c: concept, l: layer
$$ \mathrm{TCAV}{c, i} = \frac{1}{\mid X \mid} \sum{x \in X} \mathbf{1} \!\left[\frac{\partial f_i(x)}{\partial \mathbf{a}_\ell(x)} \cdot \mathbf{V}_c > 0\right] $$
how V is aligned with the mentioned gradient reflect the concept contribution / correspondence to that class
let $x^\prime$ be a baseline input (mean / zero image)
$$ x_\alpha = x + \alpha (x - x^\prime), \quad \mathrm{IG}i(x) = (x_i - x_i^\prime ) \int_0^1 \frac{\partial f(x\alpha)}{\partial x_i} d\alpha $$
let $z = f_\ell(x), \quad z^\prime = f_\ell(x^\prime)$
$$ x + \alpha (x - x^\prime), \quad \mathrm{LIG}i(x) = (z_i - z_i^\prime ) \int_0^1 \frac{\partial f(x\alpha)}{\partial z_i} d\alpha $$
let $\gamma(\alpha)$ be a smooth path
$$ \mathrm{GIG}_i(x) = \int_0^1 \frac{\partial f\big(\gamma(\alpha)\big)}{\partial \gamma_i(\alpha)} \frac{\partial \gamma_i(\alpha)}{\partial \alpha} d\alpha $$
GIG using a discrete path (?): common in NLP
concept score through layers:
$$ s_\ell(x) = a_\ell(x) \cdot \mathbf{V_c} \\ S(x) = [s_1(x), s_2(x), ..., s_L(x)] $$
CAV per spatial point, no GAP
(to elaborate on/ note)