Diffusion Model In Essence

diffusion_model_teaser

Generated by Diffusion Model

The Essence of Diffusion Model

The essence of diffusion model learning is to learn a mapping between an unknown data distribution and a known distribution, i.e., \(\mathbf{T}: R^d \leftrightarrow R^d\). The distributions have the same dimensions \(d\). With a known distribution, we can generate infinite samples from it easily, and map them back to the unknown distribution which is more interesting to us, for example, the human language distribution, the image set distribution, or even the distribution of the universe.

diffusion_model_mapping

Gaussian Is an Option. In practice, we can choose any known distributions we like to learn the mapping. But in general, standard Gaussian distribution is used, because it is easy to understand and simple to implement. Theoretically, it is also possible to map any distributions to a Gaussian distribution given a sufficiently large number of diffusion steps.

Learn the mapping in reduced dimension. In terms of the data space where the mapping is learned, one option is to keep the original high-dimensional data space, for example, millions of dimensions for a high-resolution image. The other option is to transform the original data space into a low-dimensional latent space and learn the mapping in the latent space, which is exactly what StableDiffusion proposed to do. Generally speaking, the dimension is the higher the better, because higher dimension means larger capacity for capturing a complex data distribution. However, many of the data distributions are highly correlated across dimensions and therefore redundant. It is always beneficial to apply a PCA-like transformation to compress the data dimensionality. At the same time, it has the benefit of only keeping the high-level semantic information in the latent space. Therefore, choosing the dimension of the latent space is non-heuristics. It depends on how much essential information we want to capture in the latent space with the lowest dimensionality. Additionally, reducing the dimension to a latent space makes it easier to train the mapping function and more efficient to run sampling.

Probabilistic Modeling

Knowing the mapping from a data distribution of interest, e.g., an image set, to a Gaussian distribution is useless, while the reverse is what we want. The reverse mapping from a sample in Gaussian distribution (e.g., \(y \in N(0,\Sigma)\)) to a sample in an image set is not deterministic, but probabilistic. Given a Gaussian sample \(y\), learning the reverse mapping is to uncover the conditional probability \(p(x\|y)\), where \(x\) is a sample in image set. Mathematically, knowing \(p(x\|y)\) is equivalent to knowing \(p(x, y)\), since \(p(x, y) = p(y) p(x\|y)\) and \(p(y)\) is already known.

Hard Things in Small Steps. The transform from any distribution to a Gaussian distribution can be easily modelled as a diffusion process as in Physics world, where noise is added into the data gradually and releatedly with a self-defined Markov chain. The basic logic is that every distribution can be seen as piecewise Gaussian. Similarly, we can also learn the reverse process gradually in many small steps, as learning a single-step mapping \(p(x\|y)\) is almost intractable when the distribution of \(x\) is complicated.

Once we decompose the mapping into a bunch of small steps, we get the hidden variables in the same number as the steps. Let’s denote the variable of the image set distribution as \(x_0\), and the variable of the final Gaussian distribution as \(x_T\). We have a number of \(T-1\) intemediate hidden variables \(x_1, x_2, ..., x_{T-1}\) transitioning from \(x_0\) to \(x_T\). The number of the steps depend on how far away the original distribution of \(x_0\) is from a Gaussian distribution.

The benefit of these hidden variables is that they make it easier to learning mapping between two very different distributions. On the one hand, it is more effortless to train small substeps than a single huge step, because the regularization of the hidden variable distributions can make the optimization more stable. On the other hand, if one step fails to learn well, the other steps can compensate for that and give an overall good performance. This is how we get the figure below:

diffusion_model_graphical_model

Forward Diffusion In the forward diffusion process, the hidden variable \(x_t\) is conditioned only on the last hidden variable \(x_{t-1}\) due to Markov property, i.e., \(p(x_t \| x_{t-1}, ..., x_0) = p(x_t \| x_{t-1})\). Particuarly, we use Gaussian diffusion process here. So we have \(x_t \sim \mathbf{N}(\sqrt{1-\beta_t} x_{t-1}, \beta_t \mathbf{I})\), where \(\beta_t\) is a pre-defined constants reflecting the noise quantity. In practice, we let \(\beta_t\) grow as \(t\) increases, which means we add more noise in the later diffusion steps than in the early steps. It can be interpreted as filtering out higher frequency signals prior to lower frequency signals. In the reserval process, it is natural that lower frequency data is generated earlier than higher frequency data, because lower frequency data generally represents more prenominant features.

After a large finite number of \(T\) steps, we will have \(x_T \sim \mathbf{N}(\mathbf{0}, \mathbf{I})\). With the forward diffusion process defined, we can express the distribution of the diffused sample at any timestamp \(t\) in closed form:

\[q(x_t \| x_0) = N(\sqrt(\hat{\alpha_t}) x_0, (1 - \hat{\alpha_t})I),\]

where \(\alpha_t = 1 - \beta_t\) and \(\hat{\alpha_t} = \prod_{s=1}^t \alpha_s\).

Reverse Diffusion Although the forward conditionals \(p(x_t \| x_{t-1})\) are clearly predefined, the reverse conditionals \(q(x_{t-1} \| x_t)\) are not. For the ease of modelling, we also choose to express the \(q(x_{t-1} \| x_t)\) in Gaussian format:

\[q(x_{t-1} \| x_t) \sim \mathbf{N} (\mu_\theta(x_t, t), \Sigma_\theta(x_t, t)),\]

while the mean and variance are determined by neural networks with parameters \(\theta\) and shared across different timesteps. In practice, \(\Sigma_\theta(x_t, t)\) is set to time dependent constants \(\sigma_t^2 I\) where \(\sigma_t\) is related to \(\beta_t\).

In this way, we are able to get all the probabilitic relationships of \(\{x_t \}_{t=0}^T\) in both forward and backward directions.

Loss Functions

Maximum Likelihood Estimation Among all the variables \(\{x_t \}_{t=0}^T\) for which we can derive probabilitic expressions from the probabilitic model above, \(x_0\) is the only variable of which we have observations, that is, the data samples from the training dataset. Therefore, the objective to optimize parameters \(\theta\) is built on the likelihood of the data samples of \(x_0\). Intuitively, our goal is to find the best estimation of \(\theta\) so that the data samples are most likely to be derived from the probabilistic distribution of \(x_0\) we modelled. That writes

\[\mathbf{L} = -\int_{x_0} q(x_0) \log p_{\theta}(x_0),\]

where \(q(x_0)\) reflects the probability of sample \(x_0\) drawn from the underlying distribution.

Now let’s look at the derivation of \(p_{\theta}(x_0)\). Since \(x_0\) can be seen as derived from the reverse diffusion process as shown above, we can include the hidden variables in its expression:

\[\begin{split} p_{\theta}(x_0) &= \int p_{\theta}(x_{0:T}) d x_{1:T} \\ &= \int p_{\theta}(x_{0:T}) \frac{q(x_{1:T} \| x_0)}{q(x_{1:T} \| x_0)} d x_{1:T} \\ &= \int q(x_{1:T} \| x_0) \frac{p_{\theta}(x_{0:T})}{q(x_{1:T} \| x_0)} d x_{1:T} \\ &= \int q(x_{1:T} \| x_0) p(x_T) \frac{p_{\theta}(x_{0:T-1} \| x_T)}{q(x_{1:T} \| x_0)} d x_{1:T} \\ &= \int q(x_{1:T} \| x_0) p(x_T) \prod_{t=1}^T \frac{p_{\theta}(x_{t-1} \| x_t)}{q(x_t \| x_{t-1})} d x_{1:T}\end{split}\]

Put the expression of \(p_{\theta}(x_0)\) into the loss function, we have

\[\mathbf{L} = - \int q(x_0)log \left( \int q(x_{1:T} \| x_0) p(x_T) \prod_{t=1}^T \frac{p_{\theta}(x_{t-1} \| x_t)}{q(x_t \| x_{t-1})} d x_{1:T} \right) d x_0.\]

According to Jensen’s inequality, we have

\[\mathbf{L} \leq - \int q(x_{0:T} ) log \left( p(x_T) \prod_{t=1}^T \frac{p_{\theta}(x_{t-1} \| x_t)}{q(x_t \| x_{t-1})}\right) d x_{0:T} ,\]

which constitutes an upper bound of the maximum likelihood loss function.

Since we have \(q(x_T \| x_0) = \prod_{t=1}^T q(x_t \| x_{t-1})\), the upper bound can be simplified into

\[\int q(x_{0:T} ) \left( D_{KL}(q(x_T\| x_0) , p(x_T) ) + \sum_{t=2}^T D_{KL}(q(x_{t-1} \| x_t, x_0) , p_{\theta} (x_{t-1} \| x_t)) - log p_{\theta}(x_0 \| x_1) \right) d x_{0:T},\]

where \(D_{KL}(,)\) denotes KL-divergence, and \(q(x_{0:T})\) and \(q(x_{t-1} \| x_t, x_0)\) can be computed in closed form.

What is actually learned for denoising? Among the terms in the loss function above, \(\int q(x_{0:T} ) \sum_{t=2}^T D_{KL}(q(x_{t-1} \| x_t, x_0) , p_{\theta} (x_{t-1} \| x_t))\) is actually the most important one. \(q(x_{t-1} \| x_t, x_0)\) is a posterior distribution known from the forward diffusion process, i.e., \(N(x_t, )\)

Training Algorithm

The integration over \(x_{0:T}\) can be approximated by sampling discrete samples according to the joint distribution \(q(x_{0:T} )\). Since \(q(x_{0:T} ) = \prod_{t=1}^T q(x_t \| x_{t-1}) q(x_0)\), we can firstly sample \(x_0\) from the training dataset according to \(q(x_0)\), and then sample \(x_{1:T}\) from the Markov chain. A gradient descent step is computed from the loss function upper bound above and then we re-sample \(x_{0:T}\) for the next iteration.

As we sample \(x_0\) randomly from the training dataset, we actually assume that the samples in the training set are independently and identically distributed. Otherwise, the sampling probability will be different from \(q(x_0)\). Therefore, it is important to collect data samples with good density, coverage and diversity. Otherwise, the maximium likelihood estimation loss will be biased.

diffusion_model_training

Reference

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Denoising Diffusion Probabilistic Models

https://ayandas.me/blog-tut/2021/12/04/diffusion-prob-models.html