colorlinks=true, linkcolor=mydarkblue, citecolor=mydarkblue, filecolor=mydarkblue, urlcolor=mydarkblue, pdfview=FitH, *latexMarginpar on page 0 moved *captionThe option ‘hypcap=true’ will be ignored
StepbyStep Diffusion: An Elementary Tutorial
Abstract
We present an accessible first course on diffusion models and flow matching for machine learning, aimed at a technical audience with no diffusion experience. We try to simplify the mathematical details as much as possible (sometimes heuristically), while retaining enough precision to derive correct algorithms.
Preface
There are many existing resources for learning diffusion models. Why did we write another? Our goal was to teach diffusion as simply as possible, with minimal mathematical and machine learning prerequisites, but in enough detail to reason about its correctness. Unlike most tutorials on this subject, we take neither a Variational Auto Encoder (VAE) nor an Stochastic Differential Equations (SDE) approach. In fact, for the core ideas we will not need any SDEs, EvidenceBasedLowerBounds (ELBOs), Langevin dynamics, or even the notion of a score. The reader need only be familiar with basic probability, calculus, linear algebra, and multivariate Gaussians. The intended audience for this tutorial is technical readers at the level of at least advanced undergraduate or graduate students, who are learning diffusion for the first time and want a mathematical understanding of the subject.
This tutorial has five parts, each relatively selfcontained, but covering closely related topics. Section 1 presents the fundamentals of diffusion: the problem we are trying to solve and an overview of the basic approach. Sections 2 and 3 show how to construct a stochastic and deterministic diffusion sampler, respectively, and give intuitive derivations for why these samplers correctly reverse the forward diffusion process. Section 4 covers the closelyrelated topic of Flow Matching, which can be thought of as a generalization of diffusion that offers additional flexibility (including what are called rectified flows or linear flows). Finally, in Section 5 we return to diffusion and connect this tutorial to the broader literature while highlighting some of the design choices that matter most in practice, including samplers, noise schedules, and parametrizations.
Acknowledgements
We are grateful for helpful feedback and suggestions from many people, in particular: Josh Susskind, Eugene Ndiaye, Dan Busbridge, Sam Power, De Wang, Russ Webb, Sitan Chen, Vimal Thilak, Etai Littwin, Chenyang Yuan, Alex Schwing, and Miguel Angel Bautista Martin.
1 Fundamentals of Diffusion
The goal of generative modeling is: given i.i.d. samples from some unknown distribution ${p}^{\ast}\left(x\right)$, construct a sampler for (approximately) the same distribution. For example, given a training set of dog images from some underlying distribution ${p}_{\text{dog}}$, we want a method of producing new images of dogs from this distribution.
One way to solve this problem, at a high level, is to learn a transformation from some easytosample distribution (such as Gaussian noise) to our target distribution ${p}^{\ast}$. Diffusion models offer a general framework for learning such transformations. The clever trick of diffusion is to reduce the problem of sampling from distribution ${p}^{\ast}\left(x\right)$ into to a sequence of easier sampling problems.
This idea is best explained via the following Gaussian diffusion example. We’ll sketch the main ideas now, and in later sections we will use this setup to derive what are commonly known as the DDPM and DDIM samplers^{1}^{1}1These stand for Denoising Diffusion Probabilistic Models (DDPM) and Denoising Diffusion Implicit Models (DDIM), following Ho et al. (2020) and Song et al. (2021)., and reason about their correctness.
1.1 Gaussian Diffusion
For Gaussian diffusion, let ${x}_{0}$ be a random variable in ${\mathbb{R}}^{d}$ distributed according to the target distribution ${p}^{\ast}$ (e.g., images of dogs). Then construct a sequence of random variables ${x}_{1},{x}_{2},\mathrm{\dots},{x}_{T}$, by successively adding independent Gaussian noise with some small scale $\sigma $:
$${x}_{t+1}:={x}_{t}+{\eta}_{t},{\eta}_{t}\sim \mathcal{N}(0,{\sigma}^{2}).$$  (1) 
This is called the forward process^{2}^{2}2 One benefit of using this particular forward process is computational: we can directly sample ${x}_{t}$ given ${x}_{0}$ in constant time., which transforms the data distribution into a noise distribution. Equation (1) defines a joint distribution over all $({x}_{0},{x}_{1},\mathrm{\dots},{x}_{T})$, and we let ${\left\{{p}_{t}\right\}}_{t\in \left[T\right]}$ denote the marginal distributions of each ${x}_{t}$. Notice that at large step count $T$, the distribution ${p}_{T}$ is nearly Gaussian^{3}^{3}3Formally, ${p}_{T}$ is close in KL divergence to $\mathcal{N}(0,T{\sigma}^{2})$, assuming ${p}_{0}$ has bounded moments., so we can approximately sample from ${p}_{T}$ by just sampling a Gaussian.
Now, suppose we can solve the following subproblem:
“Given a sample marginally distributed as ${\mathrm{p}}_{\mathrm{t}}$, produce a sample marginally distributed as ${\mathrm{p}}_{\mathrm{t}1}$”.
We will call a method that does this a reverse sampler^{4}^{4}4Reverse samplers will be formally defined in Section 1.2 below., since it tells us how to sample from ${p}_{t1}$ assuming we can already sample from ${p}_{t}$. If we had a reverse sampler, we could sample from our target ${p}_{0}$ by simply starting with a Gaussian sample from ${p}_{T}$, and iteratively applying the reverse sampling procedure to get samples from ${p}_{T1},{p}_{T2},\mathbf{\dots}$ and finally ${p}_{0}={p}^{\ast}$.
The key insight of diffusion is, learning to reverse each intermediate step can be easier than learning to sample from the target distribution in one step^{5}^{5}5 Intuitively this is because the distributions $({p}_{t1},{p}_{t})$ are already quite close, so the reverse sampler does not need to do much.. There are many ways to construct reverse samplers, but for concreteness let us first see the standard diffusion sampler which we will call the DDPM sampler^{6}^{6}6This is the sampling strategy originally proposed in SohlDickstein et al. (2015)..k
The Ideal DDPM sampler uses the obvious strategy: At time $t$, given input $z$ (which is promised to be a sample from ${p}_{t}$), we output a sample from the conditional distribution
$$p\left({x}_{t1}\mid {x}_{t}=z\right).$$  (2) 
This is clearly a correct reverse sampler. The problem is, it requires learning a generative model for the conditional distribution $p\left({x}_{t1}\mid {x}_{t}\right)$ for every ${x}_{t}$, which could be complicated. But if the perstep noise $\sigma $ is sufficiently small, then it turns out this conditional distribution becomes simple:
Fact 1 (Diffusion Reverse Process).
For small $\sigma $, and the Gaussian diffusion process defined in (1), the conditional distribution $p({x}_{t1}\mid {x}_{t})$ is itself close to Gaussian. That is, for all times $t$ and conditionings $z\in {\mathbb{R}}^{d}$, there exists some mean parameter $\mu \in {\mathbb{R}}^{d}$ such that
$$p\left({x}_{t1}\mid {x}_{t}=z\right)\approx \mathcal{N}({x}_{t1};\mu ,{\sigma}^{2}).$$  (3) 
This is not an obvious fact; we will derive it in Section 2.1. This fact enables a drastic simplification: instead of having to learn an arbitrary distribution $p\left({x}_{t1}\mid {x}_{t}\right)$ from scratch, we now know everything about this distribution except its mean, which we denote^{7}^{7}7 We denote the mean as a function ${\mu}_{t1}:{\mathbb{R}}^{d}\to {\mathbb{R}}^{d}$ because the mean of $p({x}_{t1}\mid {x}_{t})$ depends on the time $t$ as well as the conditioning ${x}_{t}$, as described in Fact 1. ${\mu}_{t1}\left({x}_{t}\right)$. The fact that we can approximate the posterior distribution as Gaussian when $\sigma $ is sufficiently small is illustrated in Fig 2. This is an important point, so to reiterate: for a given time $t$ and conditioning value ${x}_{t}$, learning the mean of $p\left({x}_{t1}\mid {x}_{t}\right)$ is sufficient to learn the full conditional distribution $p\left({x}_{t1}\mid {x}_{t}\right)$.
Learning the mean of $p\left({x}_{t1}\mid {x}_{t}\right)$ is a much simpler problem than learning the full conditional distribution, because we can solve it by regression. To elaborate, we have a joint distribution $({x}_{t1},{x}_{t})$ from which we can easily sample, and we would like to estimate $\mathbb{E}\left[{x}_{t1}\mid {x}_{t}\right]$. This can be done by optimizing a standard regression loss^{8}^{8}8 Recall the generic fact that for any distribution over $(x,y)$, we have: ${argmin}_{f}\mathbb{E}{\Vert f(x)y\Vert}^{2}=\mathbb{E}[y\mid x]$ :
${\mu}_{t1}\left(z\right)$  $:=\mathbb{E}\left[{x}_{t1}\mid {x}_{t}=z\right]$  (4)  
$\u27f9{\mu}_{t1}$  $=\underset{f:{\mathbb{R}}^{d}\to {\mathbb{R}}^{d}}{argmin}\underset{{x}_{t},{x}_{t1}}{\mathbb{E}}{\Vert f\left({x}_{t}\right){x}_{t1}\Vert}_{2}^{2}$  (5)  
$=\underset{f:{\mathbb{R}}^{d}\to {\mathbb{R}}^{d}}{argmin}\underset{{x}_{t1},\eta}{\mathbb{E}}\left\rightf({x}_{t1}+{\eta}_{t}){x}_{t1}\left)\right{}_{2}{}^{2},$  (6) 
where the expectation is taken over samples ${x}_{0}$ from our target distribution ${p}^{\ast}$.^{9}^{9}9Notice that we simulate samples of $({x}_{t1},{x}_{t})$ by adding noise to the samples of ${x}_{0}$, as defined in Equation 1. This particular regression problem is wellstudied in certain settings. For example, when the target ${p}^{\ast}$ is a distribution on images, then the corresponding regression problem (Equation 6) is exactly an image denoising objective, which can be approached with familiar methods (e.g. convolutional neural networks).
Stepping back, we have seen something remarkable: we have reduced the problem of learning to sample from an arbitrary distribution to the standard problem of regression.
1.2 Diffusions in the Abstract
Let us now abstract away the Gaussian setting, to define diffusionlike models in a way that will capture their many instantiations (including deterministic samplers, discrete domains, and flowmatching).
Abstractly, here is how to construct a diffusionlike generative model: We start with our target distribution ${p}^{\ast}$, and we pick some base distribution $q\left(x\right)$ which is easy to sample from, e.g. a standard Gaussian or i.i.d bits. We then try to construct a sequence of distributions which interpolate between our target ${p}^{\ast}$ and the base distribution $q$. That is, we construct distributions
$${p}_{0},{p}_{1},{p}_{2},\mathrm{\dots},{p}_{T},$$  (7) 
such that ${p}_{0}={p}^{\ast}$ is our target, ${p}_{T}=q$ the base distribution, and adjacent distributions $({p}_{t1},{p}_{t})$ are marginally “close” in some appropriate sense. Then, we learn a reverse sampler which transforms distributions ${p}_{t}$ to ${p}_{t1}$. This is the key learning step, which presumably is made easier by the fact that adjacent distributions are “close.” Formally, reverse samplers are defined below.
Definition 1 (Reverse Sampler).
Given a sequence of marginal distributions ${p}_{t}$, a reverse sampler for step $t$ is a potentially stochastic function ${F}_{t}$ such that if ${x}_{t}\sim {p}_{t}$, then the marginal distribution of ${F}_{t}({x}_{t})$ is exactly ${p}_{t1}$:
$$\{{F}_{t}\left(z\right):z\sim {p}_{t}\}\equiv {p}_{t1}.$$  (8) 
There are many possible reverse samplers^{10}^{10}10Notice that none of this abstraction is specific to the case of Gaussian noise— in fact, it does not even require the concept of “adding noise”. It is even possible to instantiate in discrete settings, where we consider distributions ${p}^{\ast}$ over a finite set, and define corresponding “interpolating distributions” and reverse samplers., and it is even possible to construct reverse samplers which are deterministic. In the remainder of this tutorial we will see three popular reverse samplers more formally: the DDPM sampler discussed above (Section 2.1), the DDIM sampler (Section 3), which is deterministic, and the family of flowmatching models (Section 4), which can be thought of as a generalization of DDIM.^{11}^{11}11 Given a set of marginal distributions $\{{p}_{t}\}$, there are many possible joint distributions consistent with these marginals (such joint distributions are called couplings). There is therefore no canonical reverse sampler for a given set of marginals $\{{p}_{t}\}$ — we are free to chose whichever coupling is most convenient.
1.3 Discretization
Before we proceed further, we need to be more precise about what we mean by adjacent distributions ${p}_{t},{p}_{t1}$ being “close”. We want to think of the sequence ${p}_{0},{p}_{1},\mathrm{\dots},{p}_{T}$ as the discretization of some (wellbehaved) timeevolving function $p(x,t)$, that starts from the target distribution ${p}_{0}$ at time $t=0$ and ends at the noisy distribution ${p}_{T}$ at time $t=1$:
$$p(x,k\mathrm{\Delta}t)={p}_{k}\left(x\right),\text{where}\mathrm{\Delta}t=\frac{1}{T}.$$  (9) 
The number of steps $T$ controls the fineness of the discretization (hence the closeness of adjacent distributions).^{12}^{12}12This naturally suggests taking the continuoustime limit, which we discuss in Section 2.4, though it is not needed for most of our arguments.
In order to ensure that the variance of the final distribution, ${p}_{T}$, is independent of the number of discretization steps, we also need to be more specific about the variance of each increment. Note that if ${x}_{k}={x}_{k1}+\mathcal{N}(0,{\sigma}^{2})$, then ${x}_{T}\sim \mathcal{N}({x}_{0},T{\sigma}^{2})$. Therefore, we need to scale the variance of each increment by $\mathrm{\Delta}t=1/T$, that is, choose
$\sigma ={\sigma}_{q}\sqrt{\mathrm{\Delta}t},$  (10) 
where ${\sigma}_{q}^{2}$ is the desired terminal variance. This choice ensures that the variance of ${p}_{T}$ is always ${\sigma}_{q}^{2}$, regardless of $T$. (The $\sqrt{\mathrm{\Delta}t}$ scaling will turn out to be important in our arguments for the correctness of our reverse solvers in the next chapter, and also connects to the SDE formulation in Section 2.4.)
At this point, it is convenient to adjust our notation. From here on, $t$ will represent a continuousvalue in the interval $[0,1]$ (specifically, taking one of the values $0,\mathrm{\Delta}t,2\mathrm{\Delta}t,\mathrm{\dots},T\mathrm{\Delta}t=1$). Subscripts will indicate time rather than index, so for example ${x}_{t}$ will now denote $x$ at a discretized time $t$. That is, Equation 1 becomes:
$${x}_{t+\mathrm{\Delta}t}:={x}_{t}+{\eta}_{t},{\eta}_{t}\sim \mathcal{N}(0,{\sigma}_{q}^{2}\mathrm{\Delta}t),$$  (11) 
which also implies that
$${x}_{t}\sim \mathcal{N}({x}_{0},{\sigma}_{t}^{2}),\text{where}{\sigma}_{t}:={\sigma}_{q}\sqrt{t},$$  (12) 
since the total noise added up to time $t$ (i.e. ${\sum}_{\tau \in \{0,\mathrm{\Delta}t,2\mathrm{\Delta}t,\mathrm{\dots},t\mathrm{\Delta}t\}}{\eta}_{\tau}$) is also Gaussian with mean zero and variance ${\sum}_{\tau}{\sigma}_{q}^{2}\mathrm{\Delta}t={\sigma}_{q}^{2}t$.
2 Stochastic Sampling: DDPM
In this section we review the DDPMlike reverse sampler discussed in Section 1, and heuristically prove its correctness. This sampler is conceptually the same as the sampler popularized in Denoising Diffusion Probabilistic Models (DDPM) by Ho et al. (2020) and originally introduced by SohlDickstein et al. (2015), when adapted to our simplified setting. However, a word of warning for the reader familiar with Ho et al. (2020): Although the overall strategy of our sampler is identical to Ho et al. (2020), certain technical details (like constants, etc) are slightly different^{13}^{13}13 For the experts, the main difference is we use the “Variance Exploding” diffusion forward process. We also use a constant noise schedule, and we do not discuss how to parameterize the predictor (“predicting ${x}_{0}$ vs. ${x}_{t1}$ vs. noise $\eta $”). We elaborate on the latter point in Section 2.3. .
We consider the setup from Section 1.3, with some target distribution ${p}^{\ast}$ and the joint distribution of noisy samples $({x}_{0},{x}_{\mathrm{\Delta}t},\mathrm{\dots},{x}_{1})$ defined by Equation (11). The DDPM sampler will require estimates of the following conditional expectations:
${\mu}_{t}\left(z\right):=\mathbb{E}\left[{x}_{t}\mid {x}_{t+\mathrm{\Delta}t}=z\right].$  (13) 
This is a set of functions $\left\{{\mu}_{t}\right\}$, one for every time step $t\in \{0,\mathrm{\Delta}t,\mathrm{\dots},1\mathrm{\Delta}t\}$. In the training phase, we estimate these functions from i.i.d. samples of ${x}_{0}$, by optimizing the denoising regression objective
${\mu}_{t}$  $=\underset{f:{\mathbb{R}}^{d}\to {\mathbb{R}}^{d}}{argmin}\underset{{x}_{t},{x}_{t+\mathrm{\Delta}t}}{\mathbb{E}}{\Vert f\left({x}_{t+\mathrm{\Delta}t}\right){x}_{t}\Vert}_{2}^{2},$  (14) 
typically with a neuralnetwork^{14}^{14}14 In practice, it is common to share parameters when learning the different regression functions ${\{{\mu}_{t}\}}_{t}$, instead of learning a separate function for each timestep independently. This is usually implemented by training a model ${f}_{\theta}$ that accepts the time $t$ as an additional argument, such that ${f}_{\theta}({x}_{t},t)\approx {\mu}_{t}({x}_{t})$. parameterizing $f$. Then, in the inference phase, we use the estimated functions in the following reverse sampler.
[nobreak=true]
Algorithm 1: Stochastic Reverse Sampler (DDPMlike)
For input sample ${x}_{t}$, and timestep $t$, output:
${\widehat{x}}_{t\mathrm{\Delta}t}\leftarrow {\mu}_{t\mathrm{\Delta}t}\left({x}_{t}\right)+\mathcal{N}(0,{\sigma}_{q}^{2}\mathrm{\Delta}t)$  (15) 
To actually generate a sample, we first sample ${x}_{1}$ as an isotropic Gaussian ${x}_{1}\sim \mathcal{N}(0,{\sigma}_{q}^{2})$, and then run the iteration of Algorithm 1 down to $t=0$, to produce a generated sample ${\widehat{x}}_{0}$. (Recall that in our discretized notation (12), ${x}_{1}$ is the fullynoised terminal distribution, and the iteration takes steps of size $\mathrm{\Delta}t$.) Explicit pseudocode for these algorithms are given in Section 2.2.
We want to reason about correctness of this entire procedure: why does iterating Algorithm 1 produce a sample from [approximately] our target distribution ${p}^{\ast}$? The key missing piece is, we need to prove some version of Fact 1: that the true conditional $p\left({x}_{t\mathrm{\Delta}t}\mid {x}_{t}\right)$ can be wellapproximated by a Gaussian, and this approximation gets better as we scale $\mathrm{\Delta}t\to 0$.
2.1 Correctness of DDPM
Here is a more precise version of Fact 1, along with a heuristic derivation. This will complete the argument that Algorithm 1 is correct— i.e. that it approximates a valid reverse sampler in the sense of Definition 1.
Claim 1 (Informal).
Let ${p}_{t\mathrm{\Delta}t}(x)$ be an arbitrary, sufficientlysmooth density over ${\mathbb{R}}^{d}$. Consider the joint distribution of $({x}_{t\mathrm{\Delta}t},{x}_{t})$, where ${x}_{t\mathrm{\Delta}t}\sim {p}_{t\mathrm{\Delta}t}$ and ${x}_{t}\sim {x}_{t\mathrm{\Delta}t}+\mathcal{N}(0,{\sigma}_{q}^{2}\mathrm{\Delta}t)$. Then, for sufficiently small $\mathrm{\Delta}t$, the following holds. For all conditionings $z\in {\mathbb{R}}^{d}$, there exists ${\mu}_{z}$ such that:
$p\left({x}_{t\mathrm{\Delta}t}\mid {x}_{t}=z\right)\approx \mathcal{N}({x}_{t\mathrm{\Delta}t};{\mu}_{z},{\sigma}_{q}^{2}\mathrm{\Delta}t).$  (16) 
for some constant ${\mu}_{z}$ depending only on $z$. Moreover, it suffices to take^{15}^{15}15Experts will recognize this mean as related to the score. In fact, Tweedie’s formula implies that this mean is exactly correct even for large $\mathrm{\Delta}t$, with no approximation required. That is, $\mathbb{E}\left[{x}_{t\mathrm{\Delta}t}\mid {x}_{t}=z\right]=z+{\sigma}_{q}^{2}\mathrm{\Delta}t\nabla \mathrm{log}{p}_{t}\left(z\right)$. The distribution $p\left({x}_{t\mathrm{\Delta}t}\mid {x}_{t}\right)$ may deviate from Gaussian, however, for larger $\sigma $.
${\mu}_{z}$  $:=\underset{({x}_{t\mathrm{\Delta}t},{x}_{t})}{\mathbb{E}}\left[{x}_{t\mathrm{\Delta}t}\mid {x}_{t}=z\right]$  (17)  
$=z+\left({\sigma}_{q}^{2}\mathrm{\Delta}t\right)\nabla \mathrm{log}{p}_{t}\left(z\right),$  (18) 
where ${p}_{t}$ is the marginal distribution of ${x}_{t}$.
Before we see the derivation, a few remarks: Claim 1 implies that to sample from ${x}_{t\mathrm{\Delta}t}$, it suffices to first sample from ${x}_{t}$, then sample from a Gaussian distribution centered around $\mathbb{E}\left[{x}_{t\mathrm{\Delta}t}\mid {x}_{t}\right]$. This is exactly what DDPM does, in Equation (15). Finally, in these notes we will not actually need the expression for ${\mu}_{z}$ in Equation (18); it is enough for us know that such a ${\mu}_{z}$ exists, so we can learn it from samples.
Proof of Claim 1 (Informal).
Here is a heuristic argument for why the score appears in the reverse process. We will essentially just apply Bayes rule and then Taylor expand appropriately. We start with Bayes rule:
$p\left({x}_{t\mathrm{\Delta}t}{x}_{t}\right)$  $=p\left({x}_{t}{x}_{t\mathrm{\Delta}t}\right){p}_{t\mathrm{\Delta}t}\left({x}_{t\mathrm{\Delta}t}\right)/{p}_{t}\left({x}_{t}\right)$  (19) 
Then take logs of both sizes. Throughout, we will drop any additive constants in the log (which translate to normalizing factors), and drop all terms of order $\mathcal{O}(\mathrm{\Delta}t)$ ^{†}^{†}[1cm]Note that ${x}_{t+1}{x}_{t}\sim \mathcal{O}(\sqrt{\mathrm{\Delta}t})$. Dropping $\mathcal{O}(\mathrm{\Delta}t)$ terms means dropping ${({x}_{t+1}{x}_{t})}^{2}\sim \mathcal{O}(\mathrm{\Delta}t)$ in the expansion of ${p}_{t}({x}_{t})$, but keeping $\frac{1}{2{\sigma}_{q}^{2}\mathrm{\Delta}t}{({x}_{t+1}{x}_{t})}^{2}\sim \mathcal{O}(1)$ in $p({x}_{t}{x}_{t+1})$.. Note that we should think of ${x}_{t}$ as a constant in this derivation, since we want to understand the conditional probability as a function of ${x}_{t\mathrm{\Delta}t}$. Now:
$\mathrm{log}p\left({x}_{t\mathrm{\Delta}t}{x}_{t}\right)=\mathrm{log}p\left({x}_{t}{x}_{t\mathrm{\Delta}t}\right)+\mathrm{log}{p}_{t\mathrm{\Delta}t}\left({x}_{t\mathrm{\Delta}t}\right)\overline{)\mathrm{log}{p}_{t}\left({x}_{t}\right)}$  Drop constants involving only ${x}_{t}$.  
$=\mathrm{log}p\left({x}_{t}{x}_{t\mathrm{\Delta}t}\right)+\mathrm{log}{p}_{t}\left({x}_{t\mathrm{\Delta}t}\right)+\mathcal{O}\left(\mathrm{\Delta}t\right)$  Since ${p}_{t\mathrm{\Delta}t}(\cdot )={p}_{t}(\cdot )+\mathrm{\Delta}t\frac{\partial}{\partial t}{p}_{t}(\cdot )$.  
$={\displaystyle \frac{1}{2{\sigma}_{q}^{2}\mathrm{\Delta}t}}{\Vert {x}_{t\mathrm{\Delta}t}{x}_{t}\Vert}_{2}^{2}+\mathrm{log}{p}_{t}\left({x}_{t\mathrm{\Delta}t}\right)$  Definition of $\mathrm{log}p({x}_{t}{x}_{t\mathrm{\Delta}t})$.  
$={\displaystyle \frac{1}{2{\sigma}_{q}^{2}\mathrm{\Delta}t}}{\Vert {x}_{t\mathrm{\Delta}t}{x}_{t}\Vert}_{2}^{2}$  
$+\overline{)\mathrm{log}{p}_{t}\left({x}_{t}\right)}+\u27e8{\nabla}_{x}\mathrm{log}{p}_{t}\left({x}_{t}\right),\left({x}_{t\mathrm{\Delta}t}{x}_{t}\right)\u27e9+\mathcal{O}\left(\mathrm{\Delta}t\right)$  Taylor expand around ${x}_{t}$ and drop constants.  
$={\displaystyle \frac{1}{2{\sigma}_{q}^{2}\mathrm{\Delta}t}}\left({\Vert {x}_{t\mathrm{\Delta}t}{x}_{t}\Vert}_{2}^{2}2{\sigma}_{q}^{2}\mathrm{\Delta}t\u27e8{\nabla}_{x}\mathrm{log}{p}_{t}\left({x}_{t}\right),\left({x}_{t\mathrm{\Delta}t}{x}_{t}\right)\u27e9\right)$  
$={\displaystyle \frac{1}{2{\sigma}_{q}^{2}\mathrm{\Delta}t}}{\Vert {x}_{t\mathrm{\Delta}t}{x}_{t}{\sigma}_{q}^{2}\mathrm{\Delta}t{\nabla}_{x}\mathrm{log}{p}_{t}\left({x}_{t}\right)\Vert}_{2}^{2}+C$  Complete the square in $({x}_{t\mathrm{\Delta}t}{x}_{t})$, and drop constant $C$ involving only ${x}_{t}$.  
$={\displaystyle \frac{1}{2{\sigma}_{q}^{2}\mathrm{\Delta}t}}{\Vert {x}_{t\mathrm{\Delta}t}\mu \Vert}_{2}^{2}$  For $\mu :={x}_{t}+({\sigma}_{q}^{2}\mathrm{\Delta}t){\nabla}_{x}\mathrm{log}{p}_{t}({x}_{t})$. 
This is identical, up to additive factors, to the logdensity of a Normal distribution with mean $\mu $ and variance ${\sigma}_{q}^{2}\mathrm{\Delta}t$. Therefore,
$$p\left({x}_{t\mathrm{\Delta}t}\mid {x}_{t}\right)\approx \mathcal{N}({x}_{t\mathrm{\Delta}t};\mu ,{\sigma}_{q}^{2}\mathrm{\Delta}t).$$  (20) 
∎
Reflecting on this derivation, the main idea was that for small enough $\mathrm{\Delta}t$, the Bayesrule expansion of the reverse process $p\left({x}_{t\mathrm{\Delta}t}\mid {x}_{t}\right)$ is dominated by the term $p\left({x}_{t}\mid {x}_{t\mathrm{\Delta}t}\right)$, from the forward process. This is intuitively why the reverse process and the forward process have the same functional form (both are Gaussian here)^{16}^{16}16This general relationship between forward and reverse processes holds somewhat more generally than just Gaussian diffusion; see e.g. the discussion in SohlDickstein et al. (2015)..
Technical Details [Optional].
The meticulous reader may notice that Claim 1 is not obviously sufficient to imply correctness of the entire DDPM algorithm. The issue is: as we scale down $\mathrm{\Delta}t$, the error in our perstep approximation (Equation 16) decreases, but the number of total steps required increases. So if the perstep error does not decrease fast enough (as a function of $\mathrm{\Delta}t$), then these errors could accumulate to a nonnegligible error by the final step. Thus, we need to quantify how fast the perstep error decays. Lemma 1 below is one way of quantifying this: it states that if the stepsize (i.e. variance of the perstep noise) is ${\sigma}^{2}$, then the KL error of the perstep Gaussian approximation is $\mathcal{O}\left({\sigma}^{4}\right)$. This decay rate is fast enough, because the number of steps only grows as^{17}^{17}17 The chain rule for KL implies that we can add up these perstep errors: the approximation error for the final sample is bounded by the sum of all the perstep errors. $\mathrm{\Omega}\left(1/{\sigma}^{2}\right)$.
Lemma 1.
Let $p(x)$ be an arbitrary density over $\mathbb{R}$, with bounded 1st to 4th order derivatives. Consider the joint distribution $({x}_{0},{x}_{1})$, where ${x}_{0}\sim p$ and ${x}_{1}\sim {x}_{0}+\mathcal{N}(0,{\sigma}^{2})$. Then, for any conditioning $z\in \mathbb{R}$, we have
$$\mathrm{KL}\left(\mathcal{N}({\mu}_{z},{\sigma}^{2})\right\left{p}_{{x}_{0}\mid {x}_{1}}(\cdot \mid {x}_{1}=z)\right)\le O\left({\sigma}^{4}\right),$$  (21) 
where
$${\mu}_{z}:=z+{\sigma}^{2}\nabla \mathrm{log}p\left(z\right).$$  (22) 
2.2 Algorithms
Pseudocode listings 1 and 2 give the explicit DDPM train loss and sampling code. To train^{18}^{18}18Note that the training procedure optimizes ${f}_{\theta}$ for all timesteps $t$ simultaneously, by sampling $t\in [0,1]$ uniformly in Line 2. the network ${f}_{\theta}$, we must minimize the expected loss ${L}_{\theta}$ output by Pseudocode 1, typically by backpropagation.
Pseudocode 3 describes the closelyrelated DDIM sampler, which will be discussed later in Section 3.
2.3 Variance Reduction: Predicting ${x}_{0}$
Thus far, our diffusion models have been trained to predict $\mathbb{E}[{x}_{t\mathrm{\Delta}t}\mid {x}_{t}]$: this is what Algorithm 1 requires, and what the training procedure of Pseudocode 1 produces. However, many practical diffusion implementations actually train to predict $\mathbb{E}\left[{x}_{0}\mid {x}_{t}\right]$, i.e. to predict the expectation of the initial point ${x}_{0}$ instead of the previous point ${x}_{t\mathrm{\Delta}t}$. This difference turns out to be just a variance reduction trick, which estimates the same quantity in expectation. Formally, the two quantities can be related as follows:
Claim 2.
For the Gaussian diffusion setting of Section 1.3 , we have:
$\mathbb{E}\left[\left({x}_{t\mathrm{\Delta}t}{x}_{t}\right)\mid {x}_{t}\right]={\displaystyle \frac{\mathrm{\Delta}t}{t}}\mathbb{E}\left[\left({x}_{0}{x}_{t}\right)\mid {x}_{t}\right].$  (23) 
Or equivalently:
$$\mathbb{E}\left[{x}_{t\mathrm{\Delta}t}\mid {x}_{t}\right]=\left(\frac{\mathrm{\Delta}t}{t}\right)\mathbb{E}\left[{x}_{0}\mid {x}_{t}\right]+\left(1\frac{\mathrm{\Delta}t}{t}\right){x}_{t}.$$  (24) 
This claim implies that if we want to estimate $\mathbb{E}\left[{x}_{t\mathrm{\Delta}t}\mid {x}_{t}\right]$, we can instead estimate $\mathbb{E}\left[{x}_{0}\mid {x}_{t}\right]$ and then then essentially divide by $\left(t/\mathrm{\Delta}t\right)$, which is the number of steps taken thus far. The variancereduced versions of the DDPM training and sampling algorithms do exactly this; we include them in Appendix B.9.
The intuition behind Claim 2 is illustrated in Figure 2: first, observe that predicting ${x}_{t\mathrm{\Delta}t}$ given ${x}_{t}$ is equivalent to predicting the last noise step, which is ${\eta}_{t\mathrm{\Delta}t}=\left({x}_{t}{x}_{t\mathrm{\Delta}t}\right)$ in the forward process of Equation (11). But, if we are only given the final ${x}_{t}$, then all of the previous noise steps $$ intuitively “look the same”— we cannot distinguish between noise that was added at the last step from noise that was added at the 5th step, for example. By this symmetry, we can conclude that all of the individual noise steps are distributed identically (though not independently) given ${x}_{t}$. Thus, instead of estimating a single noise step, we can equivalently estimate the average of all prior noise steps, which has much lower variance. There are $\left(t/\mathrm{\Delta}t\right)$ elapsed noise steps by time $t$, so we divide the total noise by this quantity in Equation 23 to compute the average. See Appendix B.8 for a formal proof.
Word of warning: Diffusion models should always be trained to estimate expectations. In particular, when we train a model to predict $\mathbb{E}\left[{x}_{0}\mid {x}_{t}\right]$, we should not think of this as trying to learn “how to sample from the distribution $p\left({x}_{0}\mid {x}_{t}\right)$”. For example, if we are training an image diffusion model, then the optimal model will output $\mathbb{E}\left[{x}_{0}\mid {x}_{t}\right]$ which will look like a blurry mix of images (e.g. Figure 1b in Karras et al. (2022))— it will not look like an actual image sample. It is good to keep in mind that when diffusion papers colloquially discuss models “predicting ${x}_{0}$”, they do not mean producing something that looks like an actual sample of ${x}_{0}$.
2.4 Diffusions as SDEs [Optional]
In this section, we connect the discretetime processes we have discussed so far to stochastic differential equations (SDEs). In the continuous limit, as $\mathrm{\Delta}t\to 0$, our discrete diffusion process turns into a stochastic differential equation. SDEs can also represent many other diffusion variants (corresponding to different drift and diffusion terms), offering flexibility in design choices, like scaling and noisescheduling. The SDE perspective is powerful because existing theory provides a general closedform solution for the timereversed SDE. Discretization of the reversetime SDE for our particular diffusion immediately yields the sampler we derived in this section, but reversetime SDEs for other diffusion variants are also available automatically (and can then be solved with any offtheshelf or custom SDE solver), enabling better training and sampling strategies as we will discuss further in Section 5. Though we mention these connections only briefly here, the SDE perspective has had significant impact on the field. For a more detailed discussion, we recommend Yang Song’s blog post (Song, 2021).
The Limiting SDE
Recall our discrete update rule:
${x}_{t+\mathrm{\Delta}t}$  $={x}_{t}+{\sigma}_{q}\sqrt{\mathrm{\Delta}t}\xi ,\xi \sim \mathcal{N}(0,1).$ 
In this limit as $\mathrm{\Delta}t\to 0$, this corresponds to a zerodrift SDE:
$dx$  $={\sigma}_{q}dw,$  (25) 
where $w$ is a Brownian motion. A Brownian motion is a stochastic process with i.i.d. Gaussian increments whose variance scales with $\mathrm{\Delta}t$.^{19}^{19}19See Eldan (2024) for a highlevel overview of Brownian motions and Itô’s formula. See also Evans (2012) for a gentle introductory textbook, and Kloeden and Platen (2011) for numerical methods. Very heuristically, we can think of $dw\sim {lim}_{\mathrm{\Delta}t\to 0}\sqrt{\mathrm{\Delta}t}\mathcal{N}(0,1),$ and thus “derive” (25) by
$dx$  $=\underset{\mathrm{\Delta}t\to 0}{lim}\left({x}_{t+\mathrm{\Delta}t}{x}_{t}\right)={\sigma}_{q}\underset{\mathrm{\Delta}t\to 0}{lim}\sqrt{\mathrm{\Delta}t}\xi ={\sigma}_{q}dw.$ 
More generally, different variants of diffusion are equivalent to SDEs with different choices of drift and diffusion terms:
$dx$  $=f(x,t)dt+g\left(t\right)dw.$  (26) 
The SDE (25) simply has $f=0$ and $g={\sigma}_{q}$. This formulation encompasses many other possibilities, though, corresponding to different choices of $f$, $g$ in the SDE. As we will revisit in Section 5, this flexibility is important for developing effective algorithms. Two important choices made in practice are tuning the noise schedule and scaling ${x}_{t}$; together these can help to control the variance of ${x}_{t}$, and control how much we focus on different noise levels. Adopting a flexible noise schedule $\left\{{\sigma}_{t}\right\}$ in place of the fixed schedule ${\sigma}_{t}\equiv {\sigma}_{q}\sqrt{t}$ corresponds to the SDE (Song et al., 2020)
${x}_{t}\sim \mathcal{N}({x}_{0},{\sigma}_{t}^{2})\iff {x}_{t}={x}_{t\mathrm{\Delta}t}+\sqrt{{\sigma}_{t}^{2}{\sigma}_{t\mathrm{\Delta}t}^{2}}{z}_{t\mathrm{\Delta}t}\iff dx=\sqrt{{\displaystyle \frac{d}{dt}}{\sigma}^{2}\left(t\right)}dw.$ 
If we also wish to scale each ${x}_{t}$ by a factor $s\left(t\right)$, Karras et al. (2022) show that this corresponds to the SDE ^{20}^{20}20As a sketch of how $f$ arises, let’s ignore the noise and note that: ${x}_{t}$ $=s(t){x}_{0}$ $\iff {x}_{t+\mathrm{\Delta}t}$ $={\displaystyle \frac{s(t+\mathrm{\Delta}t)}{s(t)}}{x}_{t}$ $={x}_{t}+{\displaystyle \frac{s(t)s(t+\mathrm{\Delta}t)}{s(t)}}{x}_{t}$ $\iff dx/dt$ $={\displaystyle \frac{\dot{s}}{s}}x$
${x}_{t}\sim \mathcal{N}(s\left(t\right){x}_{0},s{\left(t\right)}^{2}\sigma {\left(t\right)}^{2})\iff f\left(x\right)={\displaystyle \frac{\dot{s}\left(t\right)}{s\left(t\right)}}x,g\left(t\right)=s\left(t\right)\sqrt{2\dot{\sigma}\left(t\right)\sigma \left(t\right)}.$ 
These are only a few examples of the rich and useful design space enabled by the flexible SDE (26).
ReverseTime SDE
The timereversal of an SDE runs the process backward in time. Reversetime SDEs are the continuoustime analog of samplers like DDPM. A deep result due to Anderson (1982) (and nicely rederived in Winkler (2021)) states that the timereversal of SDE (26) is given by:
$dx$  $=\left(f(x,t)g{\left(t\right)}^{2}{\nabla}_{x}\mathrm{log}{p}_{t}\left(x\right)\right)dt+g\left(t\right)d\overline{w}$  (27) 
That is, SDE (27) tells us how to run any SDE of the form (26) backward in time! This means that we don’t have to rederive the reversal in each case, and we can choose any SDE solver to yield a practical sampler. But nothing is free: we sill cannot use (27) directly to sample backward, since the term ${\nabla}_{x}\mathrm{log}{p}_{t}\left(x\right)$ – which is in fact the score that previously appeared in equation 18 – is unknown in general, since it depends on ${p}_{t}$. However, if we can learn the score, then we can solve the reverse SDE. This is analogous to discrete diffusion, where the forward process is easy to model (it just adds noise), while the reverse process must be learned.
Let us take a moment to discuss the score, ${\nabla}_{x}\mathrm{log}{p}_{t}\left(x\right)$, which plays a central role. Intuitively, since the score “points toward higher probability”, it helps to reverse the diffusion process, which “flattens out” the probability as it runs forward. The score is also related to the conditional expectation of ${x}_{0}$ given ${x}_{t}$. Recall that in the discrete case
$${\sigma}_{q}^{2}\mathrm{\Delta}t\nabla \mathrm{log}{p}_{t}\left({x}_{t}\right)=\mathbb{E}\left[{x}_{t\mathrm{\Delta}t}{x}_{t}\mid {x}_{t}\right]=\frac{\mathrm{\Delta}t}{t}\mathbb{E}\left[{x}_{0}{x}_{t}\mid {x}_{t}\right],$$ 
Similarly, in the continuous case we have ^{21}^{21}21We can see this directly by applying Tweedie’s formula, which states: $$\mathbb{E}[{\mu}_{z}z]=z+{\sigma}_{z}^{2}\nabla \mathrm{log}p(z)\text{for}z\sim \mathcal{N}({\mu}_{z},{\sigma}_{z}^{2}).$$ Since ${x}_{t}\sim \mathcal{N}({x}_{0},t{\sigma}_{q}^{2}),$ Tweedie with $z\equiv {x}_{t},$ ${\mu}_{z}\equiv {x}_{0}$ gives: $$\mathbb{E}[{x}_{0}{x}_{t}]={x}_{t}+t{\sigma}_{q}^{2}\nabla \mathrm{log}p({x}_{t}).$$
${\sigma}_{q}^{2}\nabla \mathrm{log}{p}_{t}\left({x}_{t}\right)$  $={\displaystyle \frac{1}{t}}\mathbb{E}\left[{x}_{0}{x}_{t}\mid {x}_{t}\right].$  (28) 
Returning to the reverse SDE, we can show that its discretization yields the DDPM sampler of Claim 1 as a special case. The reversal of the simple SDE (25) is:
$dx$  $={\sigma}_{q}^{2}{\nabla}_{x}\mathrm{log}{p}_{t}\left(x\right)dt+{\sigma}_{q}d\overline{w}$  (29)  
$={\displaystyle \frac{1}{t}}\mathbb{E}\left[{x}_{0}{x}_{t}\mid {x}_{t}\right]dt+{\sigma}_{q}d\overline{w}$  (30) 
The discretization is
${x}_{t\mathrm{\Delta}t}$  $={x}_{t}{\displaystyle \frac{\mathrm{\Delta}t}{t}}\mathbb{E}\left[{x}_{0}{x}_{t}\mid {x}_{t}\right]+\mathcal{N}(0,{\sigma}_{q}^{2}\mathrm{\Delta}t)$  
$={x}_{t}\mathbb{E}\left[{x}_{t\mathrm{\Delta}t}\mid {x}_{t}\right]+\mathcal{N}(0,{\sigma}_{q}^{2}\mathrm{\Delta}t)\phantom{\rule{1.44em}{0ex}}\text{(by eq.}\text{23}\text{)}$ 
which is exactly the stochastic (DDPM) sampler derived in Claim 1.
3 Deterministic Sampling: DDIM
We will now show a deterministic reverse sampler for Gaussian diffusion— which appears similar to the stochastic sampler of the previous section, but is conceptually quite different. This sampler is equivalent to the DDIM^{22}^{22}22DDIM stands for Denoising Diffusion Implicit Models, which reflects a perspective used in the original derivation of Song et al. (2021). Our derivation follows a different perspective, and the “implicit” aspect will not be important to us. update of Song et al. (2021), adapted to in our simplified setting.
We consider the same Gaussian diffusion setup as the previous section, with the joint distribution $({x}_{0},{x}_{\mathrm{\Delta}t},\mathrm{\dots},{x}_{1})$ and conditional expectation function ${\mu}_{t}\left(z\right):=\mathbb{E}\left[{x}_{t}\mid {x}_{t+\mathrm{\Delta}t}=z\right].$ The reverse sampler is defined below, and listed explicitly in Pseudocode 3.
[nobreak=true]
Algorithm 2: Deterministic Reverse Sampler (DDIMlike)
For input sample ${x}_{t}$, and step index $t$, output:
${\widehat{x}}_{t\mathrm{\Delta}t}\leftarrow {x}_{t}+\lambda \left({\mu}_{t\mathrm{\Delta}t}\left({x}_{t}\right){x}_{t}\right)$  (31) 
where $\lambda :=\left(\frac{{\sigma}_{t}}{{\sigma}_{t\mathrm{\Delta}t}+{\sigma}_{t}}\right)$ and ${\sigma}_{t}\equiv {\sigma}_{q}\sqrt{t}$ from Equation (12).
How do we show that this defines a valid reverse sampler? Since Algorithm 2 is deterministic, it does not make sense to argue that it samples from $p\left({x}_{t\mathrm{\Delta}t}\mid {x}_{t}\right)$, as we argued for the DDPMlike stochastic sampler. Instead, we will directly show that Equation (31) implements a valid transport map between the marginal distributions ${p}_{t}$ and ${p}_{t\mathrm{\Delta}t}$. That is, if we let ${F}_{t}$ be the update of Equation (31):
${F}_{t}\left(z\right):=$  $z+\lambda \left({\mu}_{t\mathrm{\Delta}t}\left(z\right)z\right)$  (32)  
$=$  $z+\lambda \left(\mathbb{E}\left[{x}_{t\mathrm{\Delta}t}\mid {x}_{t}=z\right]z\right)$  (33) 
then we want to show that^{23}^{23}23 The notation $F\mathrm{\u266f}p$ means the distribution of ${\{F(x)\}}_{x\sim p}$. This is called the pushforward of $p$ by the function $F$.
${F}_{t}\mathrm{\u266f}{p}_{t}\approx {p}_{t\mathrm{\Delta}t}.$  (34) 
Proof overview: The usual way to prove this is to use tools from stochastic calculus, but we’ll present an elementary derivation. Our strategy will be to first show that Algorithm 2 is correct in the simplest case of a pointmass distribution, and then lift this result to full distributions by marginalizing appropriately. For the experts, this is similar to “flowmatching” proofs.
3.1 Case 1: Single Point
Let’s first understand the simple case where the target distribution ${p}_{0}$ is a single point mass in ${\mathbb{R}}^{d}$. Without loss of generality^{24}^{24}24Because we can just “shift” our coordinates to make it so. Formally, our entire setup including Equation 33 is translationsymmetric., we can assume the point is at ${x}_{0}=0$. Is Algorithm 2 correct in this case? To reason about correctness, we want to consider the distributions of ${x}_{t}$ and ${x}_{t\mathrm{\Delta}t}$ for arbitrary step $t$. According to the diffusion forward process (Equation 11), at time $t$ the relevant random variables are^{25}^{25}25We omit the Identity matrix in these covariances for notational simplicity. The reader may assume dimension $d=1$ without loss of generality.
${x}_{0}$  $=0\phantom{\rule{1.44em}{0ex}}\text{(deterministically)}$  
${x}_{t\mathrm{\Delta}t}$  $\sim \mathcal{N}({x}_{0},{\sigma}_{t\mathrm{\Delta}t}^{2})$  
${x}_{t}$  $\sim \mathcal{N}({x}_{t\mathrm{\Delta}t},{\sigma}_{t}^{2}{\sigma}_{t\mathrm{\Delta}t}^{2}).$ 
The marginal distribution of ${x}_{t\mathrm{\Delta}t}$ is ${p}_{t\mathrm{\Delta}t}=\mathcal{N}(0,{\sigma}_{t1}^{2})$, and the marginal distribution of ${x}_{t}$ is ${p}_{t}=\mathcal{N}(0,{\sigma}_{t}^{2})$.
Let us first find some deterministic function ${G}_{t}:{\mathbb{R}}^{d}\to {\mathbb{R}}^{d}$, such that ${G}_{t}\mathrm{\u266f}{p}_{t}={p}_{t\mathrm{\Delta}t}$. There are many possible functions which will work^{26}^{26}26For example, we can always add a rotation around the origin to any valid map., but this is the obvious one:
${G}_{t}\left(z\right)$  $:=\left({\displaystyle \frac{{\sigma}_{t\mathrm{\Delta}t}}{{\sigma}_{t}}}\right)z.$  (35) 
The function ${G}_{t}$ above simply rescales the Gaussian distribution of ${p}_{t}$, to match variance of the Gaussian distribution ${p}_{t\mathrm{\Delta}t}$. It turns out this ${G}_{t}$ is exactly equivalent to the step ${F}_{t}$ taken by Algorithm 2, which we will now show.
Claim 3.
When the target distribution is a point mass ${p}_{0}={\delta}_{0}$, then update ${F}_{t}$ (as defined in Equation 33) is equivalent to the scaling ${G}_{t}$ (as defined in Equation 35):
$${F}_{t}\equiv {G}_{t}.$$  (36) 
Thus Algorithm 2 defines a reverse sampler for target distribution ${p}_{0}={\delta}_{0}.$
Proof.
To apply ${F}_{t}$, we need to compute $\mathbb{E}[{x}_{t\mathrm{\Delta}t}\mid {x}_{t}]$ for our simple distribution. Since $({x}_{t\mathrm{\Delta}t},{x}_{t})$ are jointly Gaussian, this is^{†}^{†}[2cm]Recall the conditional expectation of two jointly Gaussian random variables $(X,Y)$ is $\mathbb{E}[X\mid Y=y]={\mu}_{X}+{\mathrm{\Sigma}}_{XY}{\mathrm{\Sigma}}_{YY}^{1}(y{\mu}_{Y})$, where ${\mu}_{X},{\mu}_{Y}$ are the respective means, and ${\mathrm{\Sigma}}_{XY},{\mathrm{\Sigma}}_{YY}$ the crosscovariance of $(X,Y)$ and covariance of $Y$. Since $X={x}_{t\mathrm{\Delta}t}$ and $Y={x}_{t}$ are centered at $0$, we have ${\mu}_{X}={\mu}_{Y}=0$. For the covariance term, since ${x}_{t}={x}_{t\mathrm{\Delta}t}+\eta $ we have ${\mathrm{\Sigma}}_{XY}=\mathbb{E}[{x}_{t}{x}_{t\mathrm{\Delta}t}^{T}]=\mathbb{E}[{x}_{t\mathrm{\Delta}t}{x}_{t\mathrm{\Delta}t}^{T}]={\sigma}_{t\mathrm{\Delta}t}^{2}{I}_{d}$. Similarly, ${\mathrm{\Sigma}}_{YY}=\mathbb{E}[{x}_{t}{x}_{t}^{T}]={\sigma}_{t}^{2}{I}_{d}$.
$\mathbb{E}\left[{x}_{t\mathrm{\Delta}t}\mid {x}_{t}\right]$  $=\left({\displaystyle \frac{{\sigma}_{t\mathrm{\Delta}t}^{2}}{{\sigma}_{t}^{2}}}\right){x}_{t}.$  (37) 
The rest is algebra:
${F}_{t}\left({x}_{t}\right)$  $:={x}_{t}+\lambda \left(\mathbb{E}\left[{x}_{t\mathrm{\Delta}t}\mid {x}_{t}\right]{x}_{t}\right)$  by definition of ${F}_{t}$  
$={x}_{t}+\left({\displaystyle \frac{{\sigma}_{t}}{{\sigma}_{t\mathrm{\Delta}t}+{\sigma}_{t}}}\right)\left(\mathbb{E}\left[{x}_{t\mathrm{\Delta}t}\mid {x}_{t}\right]{x}_{t}\right)$  by definition of $\lambda $  
$={x}_{t}+\left({\displaystyle \frac{{\sigma}_{t}}{{\sigma}_{t\mathrm{\Delta}t}+{\sigma}_{t}}}\right)\left({\displaystyle \frac{{\sigma}_{t\mathrm{\Delta}t}^{2}}{{\sigma}_{t}^{2}}}1\right){x}_{t}$  by Equation (37)  
$=\left({\displaystyle \frac{{\sigma}_{t\mathrm{\Delta}t}}{{\sigma}_{t}}}\right){x}_{t}$  
$={G}_{t}\left({x}_{t}\right).$ 
We therefore conclude that Algorithm 2 is a correct reverse sampler, since it is equivalent to ${G}_{t}$, and ${G}_{t}$ is valid. ∎
3.2 Velocity Fields and Gases
Before we move on, it will be helpful to think of the DDIM update as equivalent to a velocity field, which moves points at time $t$ to their positions at time $\left(t\mathrm{\Delta}t\right)$. Specifically, define the vector field
$${v}_{t}\left({x}_{t}\right):=\frac{\lambda}{\mathrm{\Delta}t}\left(\mathbb{E}\left[{x}_{t\mathrm{\Delta}t}\mid {x}_{t}\right]{x}_{t}\right).$$  (38) 
Then the DDIM update algorithm of Equation (31) can be written as:
${\widehat{x}}_{t\mathrm{\Delta}t}:=$  ${x}_{t}+\lambda \left({\mu}_{t\mathrm{\Delta}t}\left({x}_{t}\right){x}_{t}\right)$  from Equation (31)  
$=$  ${x}_{t}+{v}_{t}\left({x}_{t}\right)\mathrm{\Delta}t.$  (39) 
The physical intuition for ${v}_{t}$ is: imagine a gas of noninteracting particles, with density field given by ${p}_{t}$. Then, suppose a particle at position $z$ moves in the direction ${v}_{t}\left(z\right)$. The resulting gas will have density field ${p}_{t\mathrm{\Delta}t}$. We write this process as
$${p}_{t}\stackrel{{v}_{t}}{\u27f6}{p}_{t\mathrm{\Delta}t}.$$  (40) 
In the limit of small stepsize $\mathrm{\Delta}t$, speaking informally, we can think of ${v}_{t}$ as a velocity field — which specifies the instantaneous velocity of particles moving according to the DDIM algorithm.
As a concrete example, if the target distribution ${p}_{0}={\delta}_{{x}_{0}}$, as in Section 3.1, then the velocity field of DDIM is ${v}_{t}\left({x}_{t}\right)=\left(\frac{{\sigma}_{t}{\sigma}_{t\mathrm{\Delta}t}}{{\sigma}_{t}}\right)\left({x}_{0}{x}_{t}\right)/\mathrm{\Delta}t$ which is a vector field that always points towards the initial point ${x}_{0}$ (see Figure 3.2).
3.3 Case 2: Two Points
Now let us show Algorithm 2 is correct when the target distribution is a mixture of two points:
$${p}_{0}:=\frac{1}{2}{\delta}_{a}+\frac{1}{2}{\delta}_{b},$$  (41) 
for some $a,b\in {\mathbb{R}}^{d}$. According to the diffusion forward process, the distribution at time $t$ will be a mixture of Gaussians^{28}^{28}28Linearity of the forward process (with respect to ${p}_{0}$) was important here. That is, roughly speaking, diffusing a distribution is equivalent to diffusing each individual point in that distribution independently; the points don’t interact. :
$${p}_{t}:=\frac{1}{2}\mathcal{N}(a,{\sigma}_{t}^{2})+\frac{1}{2}\mathcal{N}(b,{\sigma}_{t}^{2}).$$  (42) 
We want to show that with these distributions ${p}_{t}$, the DDIM velocity field ${v}_{t}$ (of Equation 38) transports ${p}_{t}\stackrel{{v}_{t}}{\u27f6}{p}_{t\mathrm{\Delta}t}$.
Let us first try to construct some velocity field ${v}_{t}^{\ast}$ such that ${p}_{t}\stackrel{{v}_{t}^{\ast}}{\u27f6}{p}_{t\mathrm{\Delta}t}$. From our result in Section 3.1 — the fact that DDIM update works for single points — we already know velocity fields which transport each mixture component $\{a,b\}$ individually. That is, we know the velocity field ${v}_{t}^{\left[a\right]}$ defined as
$${v}_{t}^{\left[a\right]}\left({x}_{t}\right):=\lambda \underset{{x}_{0}\sim {\delta}_{a}}{\mathbb{E}}\left[{x}_{t\mathrm{\Delta}t}{x}_{t}\mid {x}_{t}\right]$$  (43) 
transports^{29}^{29}29Pay careful attention to which distributions we take expectations over! The expectation in Equation (43) is w.r.t. the singlepoint distribution ${\delta}_{a}$, but our definition of the DDIM algorithm, and its vector field in Equation (38), are always w.r.t. the target distribution. In our case, the target distribution is ${p}_{0}$ of Equation (41).
$$\mathcal{N}(a,{\sigma}_{t}^{2})\stackrel{{v}_{t}^{\left[a\right]}}{\u27f6}\mathcal{N}(a,{\sigma}_{t\mathrm{\Delta}t}^{2}),$$  (44) 
and similarly for ${v}_{t}^{\left[b\right]}$.
We now want some way of combining these two velocity fields into a single velocity ${v}_{t}^{\ast}$, which transports the mixture:
$$\underset{{p}_{t}}{\underset{\u23df}{\left(\frac{1}{2}\mathcal{N}(a,{\sigma}_{t}^{2})+\frac{1}{2}\mathcal{N}(b,{\sigma}_{t}^{2})\right)}}\stackrel{{v}_{t}^{\ast}}{\u27f6}\underset{{p}_{t\mathrm{\Delta}t}}{\underset{\u23df}{\left(\frac{1}{2}\mathcal{N}(a,{\sigma}_{t\mathrm{\Delta}t}^{2})+\frac{1}{2}\mathcal{N}(b,{\sigma}_{t\mathrm{\Delta}t}^{2})\right)}}$$  (45) 
We may be tempted to just take the average velocity field $\left({v}_{t}^{\ast}=0.5{v}_{t}^{\left[a\right]}+0.5{v}_{t}^{\left[b\right]}\right)$, but this is incorrect. The correct combined velocity ${v}_{t}^{\ast}$ is a weightedaverage of the individual velocity fields, weighted by their corresponding density fields^{30}^{30}30Note that we can write the density $\mathcal{N}({x}_{t};a,{\sigma}_{t}^{2})$ as $p({x}_{t}\mid {x}_{0}=a)$..
${v}_{t}^{\ast}\left({x}_{t}\right)$  $={\displaystyle \frac{{v}_{t}^{\left[a\right]}\left({x}_{t}\right)\cdot p\left({x}_{t}\mid {x}_{0}=a\right)+{v}_{t}^{\left[b\right]}\left({x}_{t}\right)\cdot p\left({x}_{t}\mid {x}_{0}=b\right)}{p\left({x}_{t}\mid {x}_{0}=a\right)+p\left({x}_{t}\mid {x}_{0}=b\right)}}$  (46)  
$={v}_{t}^{\left[a\right]}\left({x}_{t}\right)\cdot p\left({x}_{0}=a\mid {x}_{t}\right)+{v}_{t}^{\left[b\right]}\left({x}_{t}\right)\cdot p\left({x}_{0}=b\mid {x}_{t}\right).$  (47) 
Explicitly, the weight for ${v}_{t}^{\left[a\right]}$ at a point ${x}_{t}$ is the probability that ${x}_{t}$ was generated from initial point ${x}_{0}=a$, rather than ${x}_{0}=b$.
To be intuitively convinced of this^{31}^{31}31The time step must be small enough for this analogy to hold, so the DDIM updates are essentially infinitesimal steps. Otherwise, if the step size is large, it may not be possible to combine the two transport maps with “local” (i.e. pointwise) operations alone., consider the corresponding question about gasses illustrated in Figure 3. Suppose we have two overlapping gases, a red gas with density $\mathcal{N}(a,{\sigma}^{2})$ and velocity ${v}_{t}^{\left[a\right]}$, and a blue gas with density $\mathcal{N}(b,{\sigma}^{2})$ and velocity ${v}_{t}^{\left[b\right]}$. We want to know, what is the effective velocity of the combined gas (as if we saw only in grayscale)? We should clearly take a weightedaverage of the individual gas velocities, weighted by their respective densities — just as in Equation (47).
We have now solved the main subproblem of this section: we have found one particular vector field ${v}_{t}^{\ast}$ which transports ${p}_{t}$ to ${p}_{t\mathrm{\Delta}t}$, for our twopoint distribution ${p}_{0}$. It remains to show that this ${v}_{t}^{\ast}$ is equivalent to the velocity field of Algorithm 2 (${v}_{t}$ from Equation 38).
To show this, first notice that the individual vector field ${v}_{t}^{\left[a\right]}$ can be written as a conditional expectation. Using the definition in Equation (43)^{32}^{32}32We add conditioning ${x}_{0}=a$, because we want to take expectations w.r.t the twopoint mixture distribution, not the singlepoint distribution.,
${v}_{t}^{\left[a\right]}\left({x}_{t}\right)$  $=\lambda \underset{{x}_{0}\sim {\delta}_{a}}{\mathbb{E}}\left[{x}_{t\mathrm{\Delta}t}{x}_{t}\mid {x}_{t}\right]$  (48)  
$=\lambda \underset{{x}_{0}\sim 1/2{\delta}_{a}+1/2{\delta}_{b}}{\mathbb{E}}\left[{x}_{t\mathrm{\Delta}t}{x}_{t}\mid {x}_{0}=a,{x}_{t}\right].$  (49) 
Now the entire vector field ${v}_{t}^{\ast}$ can be written as a conditional expectation:
${v}_{t}^{\ast}\left({x}_{t}\right)$  $={v}_{t}^{\left[a\right]}\left({x}_{t}\right)\cdot p\left({x}_{0}=a\mid {x}_{t}\right)+{v}_{t}^{\left[b\right]}\left({x}_{t}\right)\cdot p\left({x}_{0}=b\mid {x}_{t}\right)$  (50)  
$=\lambda \mathbb{E}\left[{x}_{t\mathrm{\Delta}t}{x}_{t}\mid {x}_{0}=a,{x}_{t}\right]\cdot p\left({x}_{0}=a\mid {x}_{t}\right)$  (51)  
$+\lambda \mathbb{E}\left[{x}_{t\mathrm{\Delta}t}{x}_{t}\mid {x}_{0}=b,{x}_{t}\right]\cdot p\left({x}_{0}=b\mid {x}_{t}\right)$  (52)  
$=\lambda \mathbb{E}\left[{x}_{t\mathrm{\Delta}t}{x}_{t}\mid {x}_{t}\right]$  (53)  
$={v}_{t}\left({x}_{t}\right)$  (from Equation 38) 
where all expectations are w.r.t. the distribution ${x}_{0}\sim 1/2{\delta}_{a}+1/2{\delta}_{b}$. Thus, the combined velocity field ${v}_{t}^{\ast}$ is exactly the velocity field ${v}_{t}$ given by the updates of Algorithm 2 — so Algorithm 2 is a correct reverse sampler for our twopoint mixture distribution.
3.4 Case 3: Arbitrary Distributions
Now that we know how to handle two points, we can generalize this idea to arbitrary distributions of ${x}_{0}$. We will not go into details here, because the general proof will be subsumed by the subsequent section.
It turns out that our overall proof strategy for Algorithm 2 can be generalized significantly to other types of diffusions, without much work. This yields the idea of flow matching, which we will see in the following section. Once we develop the machinery of flows, it is actually straightforward to derive DDIM directly from the simple singlepoint scaling algorithm of Equation (35): see Appendix B.5.
3.4.1 The Probability Flow ODE [Optional]
Finally, we generalize our discretetime deterministic sampler to an ordinary differential equation (ODE) called the probability flow ODE (Song et al., 2020). The following section builds on our discussion of SDEs as the continuous limit of diffusion in section 2.4. Just as the reversetime SDEs of section 2.4 offered a flexible continuoustime generalization of discrete stochastic samplers, so we will see that discrete deterministic samplers generalize to ODEs. The ODE formulation offers both a useful theoretical lens through which to view diffusion, as well as practical advantages, like the opportunity to choose from a variety of offtheshelf and custom ODE solvers to improve sampling (like the popular DPM++ method, as discussed in chapter 5).
Recall the general SDE (26) from section 2.4:
$$dx=f(x,t)dt+g\left(t\right)dw.$$ 
Song et al. (2020) showed that is possible to convert this SDE into a deterministic equivalent called the probability flow ODE (PFODE): ^{33}^{33}33A proof sketch is in appendix B.2. It involves rewriting the SDE noise term as the deterministic score (recall the connection between noise and score in equation (18)). Although it is deterministic, the score is unknown since it depends on ${p}_{t}$.
$\frac{dx}{dt}$  $=\stackrel{~}{f}(x,t),\text{where}\stackrel{~}{f}(x,t)=f(x,t){\displaystyle \frac{1}{2}}g{\left(t\right)}^{2}{\nabla}_{x}\mathrm{log}{p}_{t}\left(x\right)$  (54) 
SDE (26) and ODE (54) are equivalent in the sense that trajectories obtained by solving the PFODE have the same marginal distributions as the SDE trajectories at every point in time^{34}^{34}34To use a gas analogy: the SDE describes the (Brownian) motion of individual particles in a gas, while the PFODE describes the streamlines of the gas’s velocity field. That is, the PFODE describes the motion of a “test particle” being transported by the gas— like a feather in the wind.. However, note that the score appears here again, as it did in the reverse SDE (27); just as for the reverse SDE, we must learn the score to make the ODE (54) practically useful.
Just as DDPM was a (discretized) specialcase of the reversetime SDE (27), so DDIM can be seen as a (discretized) special case of the PFODE (54). Recall from section 2.4 that the simple diffusion we have been studying corresponds to the SDE (25) with $f=0$ and $g={\sigma}_{q}$. The corresponding ODE is
$\frac{dx}{dt}$  $={\displaystyle \frac{1}{2}}{\sigma}_{q}^{2}{\nabla}_{x}\mathrm{log}{p}_{t}\left(x\right)$  (55)  
$={\displaystyle \frac{1}{2t}}\mathbb{E}\left[{x}_{0}{x}_{t}\mid {x}_{t}\right]\phantom{\rule{1.44em}{0ex}}\text{(by eq.}\text{28}\text{)}$  (56) 
Reversing and discretizing yields
${x}_{t\mathrm{\Delta}t}$  $={x}_{t}+{\displaystyle \frac{\mathrm{\Delta}t}{2t}}\mathbb{E}\left[{x}_{0}{x}_{t}\mid {x}_{t}\right]$  
$={x}_{t}+{\displaystyle \frac{1}{2}}\left(\mathbb{E}\left[{x}_{t\mathrm{\Delta}t}\mid {x}_{t}\right]{x}_{t}\right)\phantom{\rule{1.44em}{0ex}}\text{(by eq.}\text{23}\text{).}$ 
Noting that ${lim}_{\mathrm{\Delta}t\to 0}\left(\frac{{\sigma}_{t}}{{\sigma}_{t\mathrm{\Delta}t}+{\sigma}_{t}}\right)=\frac{1}{2}$, we recover the deterministic (DDIM) sampler (31).
3.5 Discussion: DDPM vs DDIM
The two reverse samplers defined above (DDPM and DDIM) are conceptually significantly different: one is deterministic, and the other stochastic. To review, these samplers use the following strategies:

1.
DDPM ideally implements a stochastic map ${F}_{t}$, such that the output ${F}_{t}\left({x}_{t}\right)$ is, pointwise, a sample from the conditional distribution $p\left({x}_{t\mathrm{\Delta}t}\mid {x}_{t}\right)$.

2.
DDIM ideally implements a deterministic map ${F}_{t}$, such that the output ${F}_{t}\left({x}_{t}\right)$ is marginally distributed as ${p}_{t\mathrm{\Delta}t}$. That is, ${F}_{t}\mathrm{\u266f}{p}_{t}={p}_{t\mathrm{\Delta}t}$.
Although they both happen to take steps in the same direction^{35}^{35}35Steps proportional to