Generator vs Discriminator

FID: frechet inception distance

measures the similarity between two distributions

P_r \sim \mathcal{N}(\mu_r, \Sigma_r), P_g \sim \mathcal{N}(\mu_g, \Sigma_g) \\

\mathrm{FID}(P_r, P_g) = \| \mu_r - \mu_g \|_2^2 + \mathrm{Tr}\left( \Sigma_r + \Sigma_g - 2\left( \Sigma_r \Sigma_g \right)^{1/2} \right)

KID: kernel inception distance

$$ \mathrm{KID}(P_r, P_f) = \mathbb{E}{x,x'}[k(x,x')] + \mathbb{E}{y,y'}[k(y,y')] - 2 \mathbb{E}_{x,y}[k(x,y)] $$

IS: Inception Score

given a set of generated images ${x_1, x_2, \dots, x_n}$, the inception score is defined as:

$$ \mathrm{IS} = \exp\!\Big(\mathbb{E}{x \sim G}\big[D{\mathrm{KL}}\big(p(y| x)\,\|\,|p(y)\big)\big]\!\Big) $$

where $p(y) = \frac{1}{N}\sum_{i=1}^N p(y|x_i)$

high score ⇒ indiv points are sharply classified + classes diversity

low scores ⇒ blurry images / low diversity

(flat $p(y/x)$ + peaky $p(y)$)

LPIPS: Learned Perceptual Image Patch Similarity

weighted average of distances between feature map outputs from a vision network

$$ x, x^\prime \in \R^{3, h, w}, \quad y_l = f_l(x), \quad \hat{y}_l = \frac{y_l}{\| y_l \|_2} $$

$$ \\d_l = \frac{1}{H_l W_l} \sum_{h=1}^{H_l} \sum_{w=1}^{W_l}\| \hat{y}{l,h,w} - \hat{y}'{l,h,w} \|_2^2 $$

$$ \text{LPIPS}(x, x') = \sum_{l \in L} w_l \times d_l $$

G_EMA:

exponential moving average of the Generator weights

overfitting heuristics:

$$ r_t = \frac{\mathbb{E}[D_\text{train}] - \mathbb{E}[D_\text{val}]}{\mathbb{E}[D_\text{train}] - \mathbb{E}[D_\text{gen}]} $$

$$ r_t = \mathbb{E}[\mathcal{sign}(D_\text{train})] $$

0 ⇒ not overfitting

1 ⇒ overfitting

GAN FID evaluation GAN training loss

   GAN FID evaluation            GAN training loss

Losses

non-saturating loss:

$$ \mathcal{L}D = -\mathbb{E}{x \sim p_{\text{data}}} [\log D(x)] - \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z))]\\ $$

$$ \mathcal{L}G = -\mathbb{E}{z \sim p_z}[log(D(G(z))]

LSGAN: Least Squares GAN

minimizes pearson’s $\chi^2$ divergence

$$ L_D = \frac{1}{2} \mathrm{E}{x \sim p\text{data}}\!\big[(D(x)-0)^2\!\big] + \frac{1}{2} \mathrm{E}{z \sim p\text{z}}\!\big[(D(G(z))-1)^2\!\big] \\ L_G = \frac{1}{2} \mathrm{E}_{z\sim p_z}\big[(D(G(z))-1)^2\big] $$

Wassertein GAN

minimizes 1-wassertein’s distance

$$ L_D = -\mathrm{E}{x \sim p\text{data}}\big[D(x)\big] + \mathrm{E}{z \sim p\text{z}}\big[D(G(z))\big] \\ L_G = - \mathrm{E}_{z\sim p_z}\big[D(G(z))\big] $$

enforcing $D$ to be 1-$Lipschitz$ through weight clipping

RaGAN

$$ \mathcal{L}D =- \mathbb{E}{x} \left[ \log \sigma \left(C(x) - \mathbb{E}{z}[C(G(z))] \right) \right]- \mathbb{E}{z} \left[ \log \sigma \left( \mathbb{E}_{x}[C(x)] - C(G(z)) \right) \right] $$

$$ \mathcal{L}G =- \mathbb{E}{z} \left[ \log \sigma \left(C(G(z)) - \mathbb{E}{x}[C(x)] \right) \right]- \mathbb{E}{x} \left[ \log \sigma \left( \mathbb{E}_{z}[C(G(z))] - C(x) \right) \right] $$

Regularizers

WGAN-GP

WGAN + a gradient penalty (: enforcing 1-Lipschitz)

interpolating $\hat x$ samples from the fake-real path

$$ \lambda \small\times \mathbb{E}{\hat x \sim P{\hat x}} \big(\|\nabla_{\hat x} f_w(\hat x) \|_2 - 1\big) $$

R1, R2

$$ L_\text{R1} = \frac{\gamma}{2} \mathrm{E}{x\sim P\text{real}}[\|\nabla_xD(x)\|^2] $$

$$ L_\text{R2} = \frac{\gamma}{2} \mathrm{E}_{z\sim P_z}[\|\nabla_xD(g(z))\|^2] $$

in practice R1 have sticked around while R2 turned out to be less stable in practice

path length penalty

$$ \textbf{v} \sim \mathcal{N}(0, 1), \mathcal{J} = \frac{\partial{G(w)}} {\partial{w}} $$

$$ \mathcal{L}\text{PLP} = \mathrm{E}{w, \textbf{v}}\big[(\|\mathcal{J}.\textbf{v}\|_{2} - a)^2\big] $$

computing the deviation of the output generated image wrt perturbation of intermediate latent state, enforcing it to be close to normal distribution over an EMA

efficient estimate / alternative: directional derivative

$$ \mathcal{s} = \big<G(w),y_\text{hat}\big> = \sum_{i=1}^{n}{G(w)i . \mathrm{y}\text{hat, i}} $$

$$ \mathcal{L} = \|\frac{\partial{\mathcal{s}}}{\partial{w}}\|_2 = \|\mathcal{J}^T . \mathrm{y}_hat \|_2 $$

$$ \mathcal{L}\text{path length} = \lambda\text{path length} \times (\mathcal{L} - \mathrm{EMA})^2 $$

Training GANs with limited data (ADA): paper

overfitting in GANs:
- always (p) using augmentations on real and fake images
- invertible augmentations: invertible in the sense that the undderlying distribution is still learnable
- p < .8 ⇒ aug leaks unlikely to happen
- best observed transformations for small datasets:
  - pixel blinting
  - geometric transforms
  - color transforms
Adaptive Discriminator Augmentation
- r_t & r_v: measuring overfitting ⇒ used to adapt p during training
- target .6 gave consistantly good results
- evaluate every N steps ⇒
  - define p update speed
  - update p
  - clamp to [0, 1]
Evalutation
- PA-GANs: progressive augmentation
- WGANS: using wasserstein distance + grad penalty ⇒ restricting lipschtiz constraint on D
- KID is more informative than FID when training on a small dataset

GANS trained by Two Time-scale Update Rule Converge to a local nash equilibrium: paper

main points

Wasserstein GAN w/ Gradient Penalty

Wassertein distance:

1-wasserstein distance (a.k.a earth mover’s, how dramatic)

$$ \mathcal{W_1}(P_r, P_f) = \inf_{\gamma \in \Pi(P_r, P_f)} \mathbb{E}_{(x, y)\sim\gamma}\big[\|x - y\|\big] $$

kantorovich rubinstein dual form

$$ \mathcal{W}1(P_r, P_f) = \sup{\|f\|L\le1}\big\{\mathbb{E}{x\sim P_r}[f(x)] - \mathbb{E}_{y\sim P_f}[f(y)]\} $$

parametrizing f as a neul-net

$$ \mathcal{L}\text{critic}(W) = \mathbb{E}{x \sim P_r}[f_w(x)] - \mathbb{E}{x \sim P_r}[f_w(G\theta(z)] \\[1ex] \text{subject to } \|f_w\|_L \le 1 $$

enforcing 1-Lipschitz on f

clamping [-c, c]
gradient penalty

$$ \lambda \small\times \mathbb{E}{\hat x \sim P{\hat x}} \big(\|\nabla_{\hat x} f_w(\hat x) \|_2 -1\big) $$

Generator objective function

$$ L_\text{G}(\theta) = - E_{z \sim P_z}[f_w(G_\theta(z))] $$

,D is called the critic here (makin it sound fancy)

unpaired image-to-image translation using CycleGAN: paper

StyleGANs core innovations (super duper cool):

Generator:

mapping network: latent space disentanglement