Denoising Diffusion Probabilistic Models

Denoising Diffusion Probabilistic Models

1. Forward Diffusion Process

All vectors are column vectors. For multi-dimensional tensors, they can be flattened into column vectors.

In the forward process, we gradually transform a data distribution $P_{0}$ into a distribution $P_{T}$ which is close to $N (0, I)$ .

$x_{0} \to x_{1} \to x_{2} \to x_{3} \to ... \to x_{T}$

$P_{0} \to P_{1} \to P_{2} \to P_{3} \to ... \to P_{T} \approx N (0, I)$

x_{0} \sim P_{0}

Given noise schedule $(β_{t})_{t = 1}^{T}$ ,

q (x_{t} ∣ x_{t - 1}) = N (x_{t}; 1 - β_{t} x_{t - 1}, β_{t} I)

or equivalently, let $α_{t} = 1 - β_{t}$ , we have

q (x_{t} ∣ x_{t - 1}) = N (x_{t}; α_{t} x_{t - 1}, (1 - α_{t}) I)

1.1. Reparameterization

$x_{t} = 1 - β_{t} x_{t - 1} + β_{t} ϵ, ϵ \sim N (0, 1)$

By mathematical induction, we can prove that

q (x_{t} ∣ x_{0}) = N (x_{t}; i = 1 \prod t (1 - β_{i}) x_{0}, (1 - (i = 1 \prod t (1 - β_{i}))) I)

That is, multiple noise additions can be expressed as one noise addition and it is easy to know that as the number of noise additions increases, the distribution of the data will be transformed into a standard normal distribution.

import torch
n_steps = 500
betas = torch.linspace(0.0001, 0.02, n_steps)
alphas = 1 - betas
alphas_cumprod = torch.cumprod(alphas, dim=0)
expectation = alphas_cumprod.sqrt()
variance = 1 - alphas_cumprod
expectation_rounded = [round(x, 3) for x in expectation[::10].numpy().tolist()]
variance_rounded = [round(x, 3) for x in variance[::10].numpy().tolist()]
print(f"expectation: {expectation_rounded}")
print(f"variance: {variance_rounded}")

expectation: [1.0, 0.998, 0.995, 0.989, 0.982, 0.972, 0.961, 0.948, 0.934, 0.917, 0.9, 0.88, 0.86, 0.838, 0.815, 0.791, 0.767, 0.742, 0.716, 0.689, 0.662, 0.635, 0.608, 0.581, 0.554, 0.527, 0.501, 0.474, 0.449, 0.423, 0.399, 0.375, 0.352, 0.329, 0.308, 0.287, 0.267, 0.248, 0.23, 0.213, 0.196, 0.181, 0.166, 0.153, 0.14, 0.128, 0.116, 0.106, 0.096, 0.087]
variance: [0.0, 0.003, 0.01, 0.021, 0.036, 0.054, 0.076, 0.101, 0.128, 0.159, 0.191, 0.225, 0.261, 0.298, 0.335, 0.374, 0.412, 0.45, 0.488, 0.525, 0.561, 0.596, 0.63, 0.662, 0.693, 0.722, 0.749, 0.775, 0.799, 0.821, 0.841, 0.859, 0.876, 0.891, 0.905, 0.918, 0.929, 0.938, 0.947, 0.955, 0.961, 0.967, 0.972, 0.977, 0.98, 0.984, 0.986, 0.989, 0.991, 0.992]

2. Training Process

Train a neural network $ϵ_{θ}$ to predict the noise $ϵ$ in $x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ϵ$

To minimize:

t, x_{0}, ϵ E [∥ ϵ - ϵ_{θ} (x_{t}, t) ∥^{2}]

3. Sampling Process

First, estimate an clean data $\hat{x}_{0} := \frac{1}{α ˉ _{t}} (x_{t} - 1 - \overset{α}{ˉ}_{t} ϵ_{θ} (x_{t}, t))$

Then, we can use the following conditional distribution to sample $x_{t - 1}$ (the data at the previous time step):

q (x_{t - 1} ∣ x_{t}, \hat{x}_{0}) = \frac{q ( x _{t} ∣ x _{t - 1} , x ^ _{0} ) q ( x _{t - 1} ∣ x ^ _{0} )}{q ( x _{t} ∣ x ^ _{0} )} = \frac{q ( x _{t} ∣ x _{t - 1} ) q ( x _{t - 1} ∣ x ^ _{0} )}{q ( x _{t} ∣ x ^ _{0} )} = N x_{t - 1}; μ_{q} (x_{t}, \hat{x}_{0}) \frac{α _{t} ( 1 - α ˉ _{t - 1} ) x _{t} + α ˉ _{t - 1} ( 1 - α _{t} ) x ^ _{0}}{1 - α ˉ _{t}}, Σ_{q} (t) \frac{( 1 - α _{t} ) ( 1 - α ˉ _{t - 1} )}{1 - α ˉ _{t}} I

In practice, $Σ_{q} (t)$ is usually set to $β_{t} I$ .

Substitute $\hat{x}_{0}$ into the formula above, we have

μ_{q} = \frac{1}{α _{t}} x_{t} - \frac{1 - α _{t}}{1 - α ˉ _{t} α _{t}} ϵ_{t}

Therefore,

x_{t - 1} = \frac{1}{α _{t}} (x_{t} - \frac{1 - α _{t}}{1 - α ˉ _{t}} ϵ_{θ} (x_{t}, t)) + (1 - α_{t}) ϵ

4. Useful Formulas

x_{t} = a_{t} x_{t - 1} + b_{t} ϵ_{t}

By mathematical induction, we can prove that

x_{t} = (i = 1 \prod t a_{i}) x_{0} + i = 1 \sum t (j = i + 1 \prod t a_{j}) b_{i} ϵ_{i}

If $a_{t}^{2} + b_{t}^{2} = 1$ , then

(i = 1 \prod t a_{i})^{2} + i = 1 \sum t (j = i + 1 \prod t a_{j})^{2} b_{i}^{2} = 1

Then

x_{t} = (i = 1 \prod t a_{i}) x_{0} + 1 - (i = 1 \prod t a_{i})^{2} ϵ, ϵ \sim N (0, I)

In particular, for DDPM, we have $a_{t} = α_{t} = 1 - β_{t}$ and $b_{t} = 1 - α_{t} = β_{t}$ .

Blogs

探索

DDPM