Flow Matching and Diffusion Models

Flow and Diffusion Models

Flow Models

Ordinary differential equations (ODEs): Every ODE defines a vector field \(u\), i.e., for every time \(t\) and position \(x\), we have a vector \(u_t(x)\) specifying the velocity in space. The solution of this ODE would be a trajectory \(X\) that follows the vector field \(u\):

\[ \begin{aligned} \frac{d}{dt} X_t &= u_t(X_t) \\ X_0 &= x_0 \in \mathbb{R}^d \quad \text{initial condition} \end{aligned} \]

Flow is a function \(\psi: \sR^d \rightarrow \sR^d\) that maps the initial position \(x_0\) to the position \(x_t\) at time \(t\) following the given vector field \(u\):

\[ X_t = \psi_t(X_0) \]

Thus, we could say vector field defines an ODE whose solution is a flow.

Theorem 3: Flow exists and is unique if the vector field \(u\) is Lipschitz continuous.

Simulating an ODE

In general, it is not possible to calculate flow \(\psi_t\) exactly. We can approximate it using the Euler method. Given a time step \(h\), we can approximate the flow as follows:

\[ X_{t+h} = X_t + h u_t(X_t) \]

We can also use some more complex methods like Heun's method:

\[ \begin{aligned} X^\prime_{t+h} &= X_t + h u_t(X_t) \\ X_{t+h} &= X_t + \frac{h}{2} (u_t(X_t) + u_{t+h}(X^\prime_{t+h})) \end{aligned} \]

Intuitively, Heun's method first guess the next position using Euler's method and then uses the average of the current and estimated next position to get an updated guess.

Flow models

We can construct a generative model that converts a simple distribution \(p_{\mathrm{init}}\) to a complex distribution \(p_{\mathrm{data}}\) via ODE, this is called a flow model that is defined by the following equation

\[ \begin{aligned} X_0 &\sim p_{\mathrm{init}} \\ \frac{d}{dt} X_t &= u_t^\theta(X_t) \\ \end{aligned} \]

Here \(u_t^\theta: \sR^d \times [0, 1] \rightarrow \sR^d\) is a neural network. Our goal is to make the endpoint \(X_1\) of the trajectory have a distribution close to \(p_{\mathrm{data}}\):

\[ X_1 \sim p_{\mathrm{data}} \Leftrightarrow \psi_1^\theta(X_0) \sim p_{\mathrm{data}} \]

NOTE: Although this is called a flow model, the network itself parameterized the vector field not the flow.

Algorithm 1: Sampling from a flow model via Euler's method

Sample \(X_0 \sim p_{\mathrm{init}}\)

For \(i= 0, \ldots, n-1\) do

\(X_{t+1} = X_t + h u_t^\theta(X_t)\)

Update \(t \leftarrow t+h\)

Return \(X_1\)

Diffusion Models

Brownian motion. SDEs are constructed via a Brownian motion being defined by \(W = (W_t)_{0\leq t \leq 1}\), a stochastic process that satisfies the following properties:

\(W_0 = 0\) and \(W_t - W_s \sim \mathcal{N}(0, (t-s)I_d)\) for all \(0 \leq s < t \leq 1\) (i.e., the variance increases linearly with time).
\(W_t - W_s\) is independent of \(W_u - W_v\) for all \(0 \leq s < t \leq 1\) and \(0 \leq u < v \leq 1\).

We can easily simulate a Brownian motion approximately with step size \(h\) by setting \(W_0 = 0\) and updating:

\[ W_{t+h} = W_t + \sqrt{h} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I_d) \]

From ODEs to SDEs. The idea of an SDE is to extend deterministic dynamics of an ODE by adding Brownian motion. Since everything is stochastic, we cannot take derivatives anymore. We can rewrite the ODE via infinitesimal updates:

\[ \begin{aligned} \frac{d}{dt} X_t &= u_t(X_t) \\ \Leftrightarrow \frac{1}{h} (X_{t+h} - X_t) &= u_t(X_t) + R_t(h)\\ \Leftrightarrow X_{t+h} &= X_t + h u_t(X_t) + h R_t(h) \end{aligned} \]

Here \(R_t(h)\) is the error term that \(\lim_{h \rightarrow 0} R_t(h) = 0\). For SDEs, we need an additional contribution from Brownian motion:

\[ X_{t+h} = X_t + \underbrace{h u_t(X_t)}_{\text{deterministic}} + \sigma_t \underbrace{(W_{t+h} - W_t)}_{\text{stochastic}} + \underbrace{h R_t(h)}_{\text{error term}} \]

where \(\sigma_t\geq 0\) is the diffusion coefficient. The above describes a stochastic differential equation, it is common to write it in the form:

\[ \begin{aligned} dX_t &= u_t(X_t) dt + \sigma_t dW_t\\ X_0 &= x_0 \end{aligned} \]

Here for SDEs, there is no flow \(\psi_t\) anymore as \(X_t\) is not fully determined by \(X_0\) anymore. Note that every ODE is also a specical case of SDE with \(\sigma_t = 0\).

Simulating an SDE. We can simulate an SDE using the Euler-Maruyama method. Given a time step \(h\), we can approximate the SDE as follows:

\[ X_{t+h} = X_t + h u_t(X_t) + \sigma_t \sqrt{h} \epsilon_t, \quad \epsilon_t \sim \mathcal{N}(0, I_d) \]

Diffusion models. We can construct a generative model that converts a simple distribution \(p_{\mathrm{init}}\) to a complex distribution \(p_{\mathrm{data}}\) via SDE, this is called a diffusion model that is defined by the following equation:

\[ \begin{aligned} dX_t &= u_t^\theta(X_t) dt + \sigma_t(X_t) dW_t\\ X_0 &\sim p_{\mathrm{init}} \end{aligned} \]

Constructing the Training Target

We have constructed flow and diffusion models where we obtain trajectories \((X_t)_{0\leq t \leq 1}\) by simulating the ODE/SDE:

\[ \begin{aligned} X_0 \sim p_{\mathrm{init}}, \quad dX_t &= u_t^\theta(X_t) dt \quad &\text{(Flow model)}\\ X_0 \sim p_{\mathrm{init}}, \quad dX_t &= u_t^\theta(X_t) dt + \sigma_t(X_t) dW_t \quad &\text{(Diffusion model)} \end{aligned} \]

To train these models, we need to find the training target \(u_t^\text{target}\)

Conditional and Marginal Probability Path

The probability path can be considered as a gradual interpolation between noise \(p_{\mathrm{init}}\) and data \(p_{\mathrm{data}}\). Let us consider a data point \(z\in \sR^d\), denoted with Dirac delta "distribution" \(\Delta_z\). This makes it a deterministic process, i.e., sampling from \(\Delta_z\) always gives \(z\).

A conditional (interpolating) probability path is a set of distribution \(p_t(x|z)\) over \(\sR^d\) such that:

\[ p_0(\cdot|z) = p_{\mathrm{init}}, \quad p_1(\cdot|z) = \Delta_z \quad \text{for all } z\in \sR^d \]

That is to say, a conditional probability path gradually converts a simple distribution \(p_{\mathrm{init}}\) into a single data point \(z\).

Every conditional probability path induces a marginal probability path \(p_t(x)\):

\[ z\sim p_{\mathrm{data}}, x\sim p_t(\cdot|z) \Rightarrow x\sim p_t\\ p_t(x) = \int p_t(x|z) p_{\mathrm{data}}(z) dz \]

NOTE: Here we know how to sample from \(p_t(x)\) but we don't know the density value of \(p_t(x)\) as the integral is intractable.

Conditional and Marginal Vector Fields

We now consider the problem of constructing a training target \(u_t^\text{target}\) for a flow model.

Continuity Equation

The divergence of a vector field \(v_t\) is defined as:

\[ \text{div}(v_t)(x) = \sum_{i=1}^d \frac{\partial}{\partial x_i} v_{t}(x) \]

Intuitively, the divergence is a scalar field that represents the volume density of the outward flux of a vector field at each point. The continuity equation states that the divergence of the vector field \(u_t^\text{target}\) scaled by total probability mass should be equal to the change in the density of the probability path:

\[ \frac{\partial}{\partial t} p_t(x) = -\text{div}(p_t u_t^\text{target})(x) (x) \]

Marginalization trick

Based on the continuity equation, we derive the marginal vector field \(u_t^\text{target}\) as follows:

\[ \begin{aligned} \partial t p_t(x) &= -\text{div}(p_t u_t^\text{target})(x)\\ &=\partial t \int p_t(x) dx \\ &= \int \partial t p_t(x|z) p_{\text{data}}(z) dz\\ &= \int -\text{div}(p_t(\cdot|z) u_t^\text{target})(\cdot|z) p_{\text{data}}(z) dz\\ &= -\text{div}(\int p_t(x|z) u_t^\text{target}(x|z) p_{\text{data}}(z) dz)(x)\\ &= -\text{div}(p_t(x) \int u_t^\text{target}(x|z)\frac{p_t(x|z)}{p_t(x)} p_{\text{data}}(z) dz)(x)\\ \end{aligned} \]

Thus we have:

\[ \begin{aligned} -\text{div}(p_t u_t^\text{target})(x) &= -\text{div}(p_t(x) \int u_t^\text{target}(x|z)\frac{p_t(x|z)}{p_t(x)} p_{\text{data}}(z) dz)(x)\\ u_t^\text{target}(x) &= \int u_t^\text{target}(x|z)\frac{p_t(x|z)}{p_t(x)} p_{\text{data}}(z) dz \end{aligned} \]

Theorem: Marginalization trick

For every data point \(z\in \sR^d\), let \(u_t^\text{target}(\cdot|z)\) denote a conditional vector field, defined so the corresponding ODE yields the conditional probability path \(p_t(\cdot|z)\) viz:

\[ X_0 \sim p_{\text{init}}, \quad \frac{d}{dt} X_t = u_t^\text{target}(X_t|z) \Rightarrow X_t \sim p_t(\cdot|z) \quad (0\leq t \leq 1) \]

Then, in order to let the marginal probability path satisfies:

\[ \begin{aligned} X_0 \sim p_{\text{init}}, \quad \frac{d}{dt} X_t = u_t^\text{target}(X_t) \Rightarrow X_t \sim p_t \quad (0\leq t \leq 1) \end{aligned} \]

The marginal vector field \(u_t^\text{target}(x)\) should be defined as:

\[ u_t^\text{target}(x) = \int u_t^\text{target}(x|z)\frac{p_t(x|z)}{p_t(x)} p_{\text{data}}(z) dz \]

Importance of the marginalization trick: This allows us to construct a marginal vector field from a conditional vector field. This simplifies the problem as we could often find a conditional vector field \(u_t^\text{target}(\cdot|z)\) that satisfies \(d/dt X_t = u_t^\text{target}(X_t|z)\) for each data point \(z\). Here is an example:

Target ODE for Guassian probability path: Let \(p_t(\cdot|z) = \mathcal{N}(\alpha_t z, \beta_t^2 I_d)\) for noise schedulers \(\alpha_t, \beta_t\). Here \(\alpha_t, \beta_t\) are two continuously differentiable, monotonic functions with \(\alpha_0=\beta_1=0\) and \(\alpha_1=\beta_0=1\). Let \(\dot{\alpha}_t = \partial_t\alpha_t\) and \(\dot{\beta}_t = \partial_t\beta_t\). Then conditional Gaussian vector field given by:

\[ u_t^\text{target}(x|z) = (\dot{\alpha}_t - \frac{\dot{\beta}_t}{\beta_t} \alpha_t)z + \frac{\dot{\beta}_t}{\beta_t} x \]

is a valid conditional vector field with its ODE trajectories \(X_t\) satisfying \(X_t \sim p_t(\cdot|z) = \gN(\alpha_t z, \beta_t^2 I_d)\) if \(X_0\sim \gN(0, I_d)\).

Let us construct a conditional flow model \(\psi_t^{\text{target}}(x|z)\) first by defining:

\[ \psi_t^{\text{target}}(x|z) = \alpha_t z + \beta_t x \]

We can then extract the vector field \(u_t^{\text{target}}(x)\) from \(\psi_t^{\text{target}}(x|z)\):

\[ \begin{aligned} \frac{d}{dt} \psi_t^{\text{target}}(X_t|z) &= u_t^{\text{target}}(\psi_t^{\text{target}}(X_t|z)|z)\\ \dot{\alpha}_t z + \dot{\beta}_t x &= u_t^{\text{target}}(\alpha_t z + \beta_t x|z)\\ \dot{\alpha}_t z + \dot{\beta}_t (\frac{x-\alpha_t z}{\beta_t}) &= u_t^{\text{target}}(x|z)\\ (\dot{\alpha}_t - \frac{\dot{\beta}_t}{\beta_t} \alpha_t)z + \frac{\dot{\beta}_t}{\beta_t} x &= u_t^{\text{target}}(x|z) \end{aligned} \]

Conditional and Marginal Score Functions

Marginal score function \(\nabla \log p_t(x)\) is defined as the gradient of the log-likelihood of the marginal probability path \(p_t(x)\). We can express the marginal score function via condition score function \(\nabla \log p_t(x|z)\):

\[ \begin{aligned} \nabla \log p_t(x) &= \frac{\nabla p_t(x)}{p_t(x)}\\ &= \frac{\nabla \int p_t(x|z) p_{\text{data}}(z) dz}{p_t(x)}\\ &= \frac{\int \nabla p_t(x|z) p_{\text{data}}(z) dz}{p_t(x)}\\ &= \int \nabla \log p_t(x|z) \frac{p_t(x|z)}{p_t(x)} p_{\text{data}}(z) dz\\ \end{aligned} \]

and the conditional score function \(\nabla \log p_t(x|z)\) is something that can be derived analytically. For instance, for the Gaussian path \(p_t(x|z) = \gN(\alpha_t z, \beta_t^2 I_d)\), we have:

\[ \nabla \log p_t(x|z) = \nabla\log\gN(x|\alpha_t z, \beta_t^2 I_d) = \frac{x-\alpha_t z}{\beta_t^2} \]

Fokkoer-Planck equation

Let us firstly define the Laplacian operator \(\Delta\) via:

\[ \Delta w_t(x) = \sum_{i=1}^d \frac{\partial^2}{\partial x_i^2} w_t(x) = \text{div}(\nabla w_t)(x) \]

Let \(p_t\) be a probability path and let us consider the SDE:

\[ X_0 \sim p_{\text{init}}, \quad dX_t = u_t(X_t) dt + \sigma_t dW_t \]

Then \(X_t\) has distribution \(p_t\) for all \(0\leq t \leq 1\) if and only if the Fokker-Planck equation holds:

\[ \partial_t p_t(x) = -\text{div}(p_t u_t)(x) + \frac{\sigma_t^2}{2} \Delta p_t(x) \quad \text{for all } x\in \sR^d, 0\leq t \leq 1 \]

Marginalization trick for SDEs

Based on the Fokker-Planck equation, we can have:

\[ \begin{aligned} \partial_t p_t(x) &= -\text{div}(p_t u_t^{\text{target}})(x)\\ &= -\text{div}(p_t u_t^{\text{target}})(x) - \frac{\sigma_t^2}{2} \Delta p_t(x) + \frac{\sigma_t^2}{2} \Delta p_t(x)\\ &= -\text{div}(p_t u_t^{\text{target}})(x) - \text{div}(\frac{\sigma_t^2}{2} \nabla p_t)(x) + \frac{\sigma_t^2}{2} \Delta p_t(x)\\ &= -\text{div}(p_t u_t^{\text{target}})(x) - \text{div}(p_t \frac{\sigma_t^2}{2} \nabla \log p_t)(x) + \frac{\sigma_t^2}{2} \Delta p_t(x)\\ &= -\text{div}(p_t (u_t^{\text{target}} + \frac{\sigma_t^2}{2} \nabla \log p_t))(x) + \frac{\sigma_t^2}{2} \Delta p_t(x)\\ \end{aligned} \]

SDE extension trick

Based on the proof above, we can construct an SDE as follows:

\[ X_0 \sim p_{\text{init}}, \quad dX_t = (u_t^{\text{target}}(X_t) + \frac{\sigma_t^2}{2} \nabla \log p_t)(X_t) dt + \sigma_t dW_t \]

Here \(X_1\sim p_{\text{data}}\) for this SDE. The same identity holds if we replace the marginal probability \(p_t(x)\) and vector field \(u_t^{\text{target}}9x0\) with conditional probability \(p_t(x|z)\) and vector field \(u_t^{\text{target}}(x|z)\) respectively.

Langevin dynamics

The above cosntruction has a specical case when the probability path is static, i.e., \(p_t=p\) for a fixed distribution \(p\). In this case, the corresponding \(u_t^{\text{target}}=0\) and we will have:

\[ dX_t = \frac{\sigma_t^2}{2} \nabla \log p(X_t) dt + \sigma_t dW_t \]

This is commonly known as Langevin dynamics. The fact that \(p_t\) is static implies that \(\partial_t p_t(x) = 0\). Thus, this equation satisfies the Fokker-Planck equation. Thus, we have:

\[ X_0\sim p \Rightarrow X_t\sim p \quad \text{for all } 0\leq t \leq 1 \]

These dynamics converge to the stationary distribution \(p\) under rather general conditions. That is, if we instead take \(X_0\sim p^\prime \neq p\), so that \(X_t\sim p^\prime\), then under mild conditions \(p_t \rightarrow p\). That is to say, if we know the score function \(\nabla \log p(x)\), then even if we start from a different distribution \(p^\prime\), we will eventually converge to the target distribution \(p\) following the SDE.

This makes Langevin dynamics extremely useful, and it accordingly serves as the basis for e.g., molecular dynamics simulations, and many other Markov chain Monte Carlo (MCMC) methods. For instance, for score-based generative models, we learn the score function \(\nabla \log p(x)\) from data. Then, we can still use Langevin dynamics to sample from the learned distribution \(p(x)\) even we start from random nosie.

Summary - Derivation of the Training Target

The flow training target is the marginal vector field \(u_t^\text{target}\). To construct it, we choose a conditional probability path \(p_t(x|z)\) that fullfils \(p_0(\cdot|z) = p_{\text{init}}\) and \(p_1(\cdot|z) = \Delta_z\) for all \(z\in \sR^d\). Next, we find a conditional vector field \(u_t^\text{flow}(x|z)\) such that its corresponding flow \(\psi_t^{\text{flow}}(x|z)\) satisfies

\[ X_0\sim p_{\text{init}} \Rightarrow X_t=\psi_t^{\text{flow}}(x|z)\sim p_t(\cdot|z), \]

or equivalently, that \(u_t^\text{target}\) satisfies the continuity equation. Then, the marginal vector field is defined by:

\[ u_t^\text{target}(x) = \int u_t^\text{target}(x|z)\frac{p_t(x|z)}{p_t(x)} p_{\text{data}}(z) dz \]

follows the marginal probability path, i.e.,

\[ X_0\sim p_{\text{init}}, \quad dX_t = u_t^\text{target}(X_t) dt \Rightarrow X_t\sim p_t \quad (0\leq t \leq 1) \]

In particular, \(X_1\sim p_\text{data}\) for this ODE, so that \(y_t^\text{target}\) converts noise into data.

SDE extension For a time-dependent diffusion coefficient \(\sigma_t\leq 0\), we can extend the above ODE to an SDE with the same marginal probability path:

\[ \begin{aligned} & X_0\sim p_{\text{init}}, \quad dX_t = (u_t^\text{target}(X_t) + \frac{\sigma_t^2}{2} \nabla \log p_t)(X_t) dt + \sigma_t dW_t,\\ \Rightarrow&X_t\sim p_t \quad (0\leq t \leq 1) \end{aligned} \]

where \(\nabla\log p_t(x)\) is the marginal score function:

\[ \nabla \log p_t(x) = \int \nabla \log p_t(x|z) \frac{p_t(x|z)}{p_t(x)} p_{\text{data}}(z) dz \]

An important example is the Gaussian probability path, yielding the formulation:

\[ \begin{aligned} p_t(x|z) &= \gN(\alpha_t z, \beta_t^2 I_d)\\ u_t^\text{flow}(x|z) &= (\dot{\alpha}_t - \frac{\dot{\beta}_t}{\beta_t} \alpha_t)z + \frac{\dot{\beta}_t}{\beta_t} x\\ \nabla \log p_t(x|z) &= -\frac{x-\alpha_t z}{\beta_t^2}, \end{aligned} \]

for noise schedulers \(\alpha_t, \beta_t\in \gR\): continuously differentiable, monotonic functions such that \(\alpha_0=\beta_1=0\) and \(\alpha_1=\beta_0=1\).

Training the Generative Model

Flow Matching

Let us consider a flow model given by:

\[ X_0\sim p_{\text{init}}, \quad \frac{d}{dt} X_t = u_t^\theta(X_t)dt \]

An intuitive way of training this model is to use a mean squared error (MSE), i.e., to use flow matching loss defined as:

\[ \begin{aligned} \gL_{\text{FM}}(\theta) &= \sE_{t\sim\text{Unif},x\sim p_t}[\|u_t^\theta(x) - u_t^\text{target}(x)\|^2]\\ &= \sE_{t\sim\text{Unif},z\sim p_{\text{data}}, x\sim p_t(\cdot|z)}[\|u_t^\theta(x) - u_t^\text{target}(x|z)\|^2] \end{aligned} \]

Here \(\text{Unif}\) denotes the uniform distribution over \([0, 1]\). \(p_t(x)=\int p_t(x|z)p_{\text{data}}(z)dz\) is the marginal probability path. While we do know:

\[ u_t^\text{target}(x) = \int u_t^\text{target}(x|z)\frac{p_t(x|z)}{p_t(x)} p_{\text{data}}(z) dz \]

We cannot compute it efficiently as the above integreal is intractable. Thus, we instead look into conditional velocity field \(u_t^\text{target}(x|z)\) that is tractable. To do so, let us define the conditional flow matching loss:

\[ \gL_{\text{CFM}}(\theta) = \sE_{t\sim\text{Unif},z\sim p_{\text{data}}, x\sim p_t(\cdot|z)}[\|u_t^\theta(x) - u_t^\text{target}(x|z)\|^2] \]

However, don't forget that we aim to regress the marginal vector field. A good news is that by explicitly regressing against the tractable, conditional vector field, we are implicitly regressing against intractable, marginal vector field. We can prove it:

\[ \begin{aligned} \gL_{\text{FM}}(\theta) &= \sE_{t\sim\text{Unif},x\sim p_t}[\|u_t^\theta(x) - u_t^\text{target}(x)\|^2]\\ &= \sE_{t\sim\text{Unif},x\sim p_t}[\|u_t^\theta(x)\|^2 - 2u_t^\theta(x)^\top u_t^\text{target}(x) + \|u_t^\text{target}(x)\|^2]\\ &= \sE_{t\sim\text{Unif},x\sim p_t}[\|u_t^\theta(x)\|^2] - 2\sE_{t\sim\text{Unif},x\sim p_t}[u_t^\theta(x)^\top u_t^\text{target}(x)] + \underbrace{\sE_{t\sim\text{Unif},x\sim p_t}[\|u_t^\text{target}(x)\|^2]}_{\text{Because it does not contain } \theta}\\ &= \sE_{t\sim\text{Unif},z\sim p_{\text{data}},x\sim p_t(\cdot|z)}[\|u_t^\theta(x)\|^2] - 2\sE_{t\sim\text{Unif},x\sim p_t}[u_t^\theta(x)^\top u_t^\text{target}(x)] + C_1\\ \end{aligned} \]

We can re-express the second term as:

\[ \begin{aligned} \sE_{t\sim\text{Unif},x\sim p_t}[u_t^\theta(x)^\top u_t^\text{target}(x)] &= \int_0^1\int p_{t}(x) u_t^\theta(x)^\top u_t^\text{target}(x|z) p_t(x|z) dx dt\\ &= \int_0^1\int p_{t}(x) u_t^\theta(x)^\top \left[\int u_t^\text{target}(x|z)\frac{p_t(x|z)}{p_t(x)} p_{\text{data}}(z) dz\right] dx dt\\ &= \int_0^1\int\int u_t^\theta(x)^\top u_t^\text{target}(x|z) p_t(x|z) p_{\text{data}}(z) dz dx dt\\ &= \sE_{t\sim\text{Unif},z\sim p_{\text{data}},x\sim p_t(\cdot|z)}[u_t^\theta(x)^\top u_t^\text{target}(x|z)] \end{aligned} \]

Then, we have:

\[ \begin{aligned} \gL_{\text{FM}}(\theta) &= \sE_{t\sim\text{Unif},z\sim p_{\text{data}},x\sim p_t(\cdot|z)}[\|u_t^\theta(x)\|^2] - 2\sE_{t\sim\text{Unif},x\sim p_t}[u_t^\theta(x)^\top u_t^\text{target}(x)] + C_1\\ &= \sE_{t\sim\text{Unif},z\sim p_{\text{data}},x\sim p_t(\cdot|z)}[\|u_t^\theta(x)\|^2] - 2\sE_{t\sim\text{Unif},z\sim p_{\text{data}},x\sim p_t(\cdot|z)}[u_t^\theta(x)^\top u_t^\text{target}(x|z)] + C_1\\ &= \sE_{t\sim\text{Unif},z\sim p_{\text{data}},x\sim p_t(\cdot|z)}[\|u_t^\theta(x) - 2 u_t^\theta(x)^\top u_t^\text{target}(x|z) + \|u_t^\text{target}(x|z)-\|u_t^\text{target}(x|z)\|^2] + C_1\\ &= \sE_{t\sim\text{Unif},z\sim p_{\text{data}},x\sim p_t(\cdot|z)}[\|u_t^\theta(x) - u_t^\text{target}(x|z)\|^2] + \sE_{t\sim\text{Unif},z\sim p_{\text{data}},x\sim p_t(\cdot|z)}[-\|u_t^\text{target}(x|z)\|^2] + C_1\\ &= \sE_{t\sim\text{Unif},z\sim p_{\text{data}},x\sim p_t(\cdot|z)}[\|u_t^\theta(x) - u_t^\text{target}(x|z)\|^2] + C_2\\ &= \gL_{\text{CFM}}(\theta) + C_2 \end{aligned} \]

Thus, we can see that the flow matching loss \(\gL_{\text{FM}}\) is equivalent to the conditional flow matching loss \(\gL_{\text{CFM}}\) up to a constant \(C_2\). Once \(u_t^\theta\) is trained, we may simulate the flow model:

\[ dX_t = u_t^\theta(X_t) dt, \quad X_0\sim p_{\text{init}} \]

This procedure is called flow matching. Let us consider an example of Gaussian probability paths \(p_t(\cdot|z) = \gN(\alpha_t z, \beta_t^2 I_d)\), where we may sample from the conditional path via:

\[ \epsilon\sim\gN(0, I_d) \Rightarrow x_t = \alpha_t z + \beta_t \epsilon \sim \gN(\alpha_t z, \beta_t^2 I_d) = p_t(\cdot|z) \]

As we have already known, the conditional vector field \(u_t^\text{target}(x|z)\) is given by:

\[ u_t^\text{target}(x|z) = (\dot{\alpha}_t - \frac{\dot{\beta}_t}{\beta_t} \alpha_t)z + \frac{\dot{\beta}_t}{\beta_t} x \]

where \(\dot{\alpha}_t = \partial_t\alpha_t\) and \(\dot{\beta}_t = \partial_t\beta_t\) are respective time derivatives. Then, we can obtain the conditional flow matching loss:

\[ \begin{aligned} \gL_{\text{CFM}}(\theta) &= \sE_{t\sim\text{Unif},z\sim p_{\text{data}},x\sim \gN(\alpha_t z, \beta_t^2 I_d)}[\|u_t^\theta(x) - u_t^\text{target}(x|z)\|^2]\\ &= \sE_{t\sim\text{Unif},z\sim p_{\text{data}},x\sim \gN(\alpha_t z, \beta_t^2 I_d)}[\|u_t^\theta(x) - (\dot{\alpha}_t - \frac{\dot{\beta}_t}{\beta_t} \alpha_t)z - \frac{\dot{\beta}_t}{\beta_t} x\|^2]\\ &= \sE_{t\sim\text{Unif},z\sim p_{\text{data}},\epsilon\sim\gN(0, I_d)}[\|u_t^\theta(\alpha_t z + \beta_t \epsilon) - (\dot{\alpha}_t - (\dot{\alpha}_t z + \dot{\beta}_t \epsilon)\|^2]\\ \end{aligned} \]

Now if we consider a special case of \(\alpha_t=t\) and \(\beta_t=1-t\), the corresponding probability \(p_t(x|z) = \gN(tz, (1-t)^2)\) is sometimes referred to as the (Gaussian) CondOT probability path, then, we have \(\dot{\alpha}_t=1\) and \(\dot{\beta}_t=-1\). Thus, the conditional flow matching loss becomes:

\[ \gL_{\text{CFM}}(\theta) = \sE_{t\sim\text{Unif},z\sim p_{\text{data}},\epsilon\sim\gN(0, I_d)}[\|u_t^\theta(tz + (1-t)\epsilon) - (z - \epsilon)\|^2] \]

Many SOTA models like stable-diffusion 3, meta's movie gen video have been trained using this simple yet effective procedure.

Score Matching

We can also extend the algorithm we just found from ODEs to SDEs. Recall that we can extend the ODE to an SDE by:

\[ dX_t = [u_t^\text{target}(X_t) + \frac{\sigma_t^2}{2} \nabla \log p_t(X_t)] dt + \sigma_t dW_t\\ X_0\sim p_{\text{init}} \Rightarrow X_t\sim p_t \quad (0\leq t \leq 1) \]

where \(u_t^\text{target}\) is the marginal vector field and \(\nabla \log p_t(x)\) is the marginal score function represented by:

\[ \nabla \log p_t(x) = \int \nabla \log p_t(x|z) \frac{p_t(x|z)}{p_t(x)} p_{\text{data}}(z) dz \]

We can use a neural network \(s_t^\theta: \sR^d \times [0, 1] \rightarrow \sR^d\) to approximate the score function \(\nabla \log p_t(x)\), and design a score matching and a conditional score matching loss:

\[ \begin{aligned} \gL_{\text{SM}}(\theta) &= \sE_{t\sim\text{Unif},x\sim p_t}[\|s_t^\theta(x) - \nabla \log p_t(x)\|^2]\\ \gL_{\text{CSM}}(\theta) &= \sE_{t\sim\text{Unif},z\sim p_{\text{data}},x\sim p_t(\cdot|z)}[\|s_t^\theta(x) - \nabla \log p_t(x|z)\|^2] \end{aligned} \]

And of course, we can prove that the score matching loss \(\gL_{\text{SM}}\) is equivalent to the conditional score matching loss \(\gL_{\text{CSM}}\) up to a constant. After training with score matching, we can simulate the SDE with an arbitrary diffusion coefficient \(\sigma_t\geq 0\):

\[ X_0\sim p_{\text{init}}, \quad dX_t = [u_t^\theta(X_t) + \frac{\sigma_t^2}{2} s_t^\theta(X_t)] dt + \sigma_t dW_t \]

to generate \(X_1\sim p_{\text{data}}\). In theory, every \(\sigma_t\) should give samples \(X_1\sim p_{\text{data}}\) at perfect training. In practice, we can have 1) numerical errors, 2) training errors. Thus, there is an optimal \(\sigma_t\) that is determined empirically.

Note that it might seem to be a disadvantage that we have to learn both \(s_t^\theta\) and \(u_t^\theta\) if we want to use diffusion model However, we could model both \(s_t^\theta\) and \(u_t^\theta\) with a single neural network with two output heads. Also, for special case of the Guassian probability path, \(s_t^\theta\) and \(u_t^\theta\) can be converted to each other.

Denoising Diffusion Models: Score Matching for Gaussian Probability Paths

Let us instantiate the denoising score matching loss for the case of \(p_t(x|z) = \gN(\alpha_t z, \beta_t^2 I_d)\). The conditional score function is given by:

\[ \nabla \log p_t(x|z) = -\frac{x-\alpha_t z}{\beta_t^2} \]

Then, the conditional score matching loss becomes:

\[ \begin{aligned} \gL_{\text{CSM}}(\theta) &= \sE_{t\sim\text{Unif},z\sim p_{\text{data}},x\sim p_t(\cdot|z)}[\|s_t^\theta(x) + \frac{x-\alpha_t z}{\beta_t^2}\|^2]\\ &= \sE_{t\sim\text{Unif},z\sim p_{\text{data}},x\sim p_t(\cdot|z)}[\|s_t^\theta(\alpha_t z + \beta_t \epsilon) + \frac{\epsilon}{\beta_t}\|^2]\\ &= \sE_{t\sim\text{Unif},z\sim p_{\text{data}},x\sim p_t(\cdot|z)}[\frac{1}{\beta_t^2}\|s_t^\theta(\alpha_t z + \beta_t \epsilon) + \epsilon\|^2] \end{aligned} \]

Here the network \(s_t^\theta\) is essentially learnt to predcit the noise that is used for corrupting the data. Therefore, the above procedure is also known as denoising score matching. The above loss is numerically unstable for \(\beta_t\approx 0\), i.e., when the noise level is very low. Thus, In DDPM, they drop the constant term \(\frac{1}{\beta_t^2}\) and reparemterize \(s_t^\theta\) into a noise predictor network \(\epsilon_t^\theta\):

\[ -\beta_t s_t^\theta(x) = \epsilon_t^\theta(x)\Rightarrow \gL_{\text{DDPM}}(\theta) = \sE_{t\sim\text{Unif},z\sim p_{\text{data}},\epsilon\sim \gN(0, I_d)}[\|\epsilon_t^\theta(\alpha_t z + \beta_t \epsilon) - \epsilon\|^2] \]

Conversion formula for Gaussian probability path

For the Gaussian probability path \(p_t(x|z) = \gN(\alpha_t z, \beta_t^2 I_d)\), it holds that the conditional/marginal vector field can be converted into the conditional/marginal score:

\[ \begin{aligned} u_t^\text{target}(x|z) &= (\beta^2_t\frac{\dot{\alpha}_t}{\alpha_t} - \dot{\beta}_t\beta_t)\nabla \log p_t(x|z) + \frac{\dot{\alpha}_t}{\alpha_t} x\\ u_t^\text{target}(x) &= (\beta^2_t\frac{\dot{\alpha}_t}{\alpha_t} - \dot{\beta}_t\beta_t)\nabla \log p_t(x) + \frac{\dot{\alpha}_t}{\alpha_t} x \end{aligned} \]

where the formula for the above marginal vector field \(u_t^\text{target}\) is called probability flow ODE in the literature.

Proof:

\[ \begin{aligned} u_t^\text{target}(x|z) &= (\dot{\alpha}_t - \frac{\dot{\beta}_t}{\beta_t} \alpha_t)z + \frac{\dot{\beta}_t}{\beta_t} x\\ &= (\beta^2_t\frac{\dot{\alpha}_t}{\alpha_t} - \dot{\beta}_t\beta_t) (\frac{\alpha_t z - x}{\beta_t^2}) + \frac{\dot{\alpha}_t}{\alpha_t} x\\ &= (\beta^2_t\frac{\dot{\alpha}_t}{\alpha_t} - \dot{\beta}_t\beta_t)\nabla \log p_t(x|z) + \frac{\dot{\alpha}_t}{\alpha_t} x \end{aligned} \]

By taking the integral, the same identity holds for the marginal vector field \(u_t^\text{target}(x)\):

\[ \begin{aligned} u_t^\text{target}(x) &= \int u_t^\text{target}(x|z)\frac{p_t(x|z)}{p_t(x)} p_{\text{data}}(z) dz\\ &= \int \left[(\beta^2_t\frac{\dot{\alpha}_t}{\alpha_t} - \dot{\beta}_t\beta_t)\nabla \log p_t(x|z) + \frac{\dot{\alpha}_t}{\alpha_t} x\right] \frac{p_t(x|z)}{p_t(x)} p_{\text{data}}(z) dz\\ &= (\beta^2_t\frac{\dot{\alpha}_t}{\alpha_t} - \dot{\beta}_t\beta_t)\nabla \log p_t(x) + \frac{\dot{\alpha}_t}{\alpha_t} x \end{aligned} \]

Thus, we can use the conversion formula to parameterize the score network \(s_t^\theta\) and the vector field network \(u_t^\theta\) into one another via:

\[ u_t^\theta(x) = (\beta^2_t\frac{\dot{\alpha}_t}{\alpha_t} - \dot{\beta}_t\beta_t)s_t^\theta(x) + \frac{\dot{\alpha}_t}{\alpha_t} x \]

Similarly, as \(\beta^2_t\dot{\alpha}_t - \alpha_t\dot{\beta}_t\beta_t \neq 0\) (always true for \(t\in [0, 1)\)), we can also have:

\[ s_t^\theta(x) = \frac{\alpha_t u_t^\theta(x) - \dot{\alpha}_t x}{\beta^2_t\dot{\alpha}_t - \alpha_t\dot{\beta}_t\beta_t} \]

Here we can have a conclusion that for Gaussian probability paths there is no need to separately train both the marginal score and the marginal vector field, as knowledge of one is sufficient to compute the other. In particular, we can choose whether we want to use flow matching or score matching to train it.

Building an Image Generator

Guidance

The goal of guided generative modeling is to be able to sample from \(p_\text{data}(x|y)\) for any such \(y\).

Guidance for Flow Models

If we imagine fixing our choice of \(y\), and take out data distribution as \(p_{\text{data}}(x|y)\), then we have recovered the unguided generative problem, and can accordingly construct a generative model using the conditional flow matching objective:

\[ \sE_{z\sim p_{\text{data}}(\cdot|y),x\sim p_t(\cdot|z)}[\|u_t^\theta(x|y) - u_t^\text{target}(x|z)\|^2] \]

Note that \(y\) does not affect the conditional probability path \(p_t(x|z)\) or conditional vector field \(u_t^\text{target}(x|z)\) (In principle, we could make it dependent). Expanding the expectation over all such choices of \(y\) gives the guided conditional flow matching objective:

\[ \gL_\text{CFM}^\text{guided}(\theta) = \sE_{t\sim\text{Unif},y\sim p_{\text{data}},(z,y)\sim p_{\text{data}}(z,y),x\sim p_t(\cdot|z)}[\|u_t^\theta(x|y) - u_t^\text{target}(x|z)\|^2] \]

One of the main differences is that here we sample \((z, y)\sim p_{\text{data}}(z, y)\) rather than just \(z\sim p_{\text{data}}(z)\).

Classifier-free guidance It is empirically observed that this images samples with this procedure don't fit well enough to desired label \(y\), and perceptual quality is increased when the effect of the guidance variable \(y\) is artificially amplified. This concept is distilled classifier-free guidance. Recall that Gaussian conditional probability path is given by:

\[ p_t(\cdot|z) = \gN(\alpha_t z, \beta_t^2 I_d) \]

We can rewrite the guided vector field \(u_t^\text{target}(x|y)\) using the guided score function \(\nabla \log p_t(x|y)\):

\[ u_t^\text{target}(x|y) = a_t x + b_t\nabla \log p_t(x|y) \]

where \((a_t, b_t) = (\frac{\dot{\alpha}_t}{\alpha_t}, \beta_t^2\frac{\dot{\alpha}_t}{\alpha_t} - \dot{\beta}_t\beta_t)\). Using the Bayes' rule, we can rewrite the guided score function as:

\[ \begin{aligned} \nabla \log p_t(x|y) &= \nabla \log \left(\frac{p_t(y|x)p_t(x)}{p_t(y)}\right) \\ &= \nabla \log p_t(y|x) + \nabla \log p_t(x) - \underbrace{\nabla \log p_t(y)}_{\text{=0}} \\ &= \nabla \log p_t(y|x) + \nabla \log p_t(x) \end{aligned} \]

Then, we can rewrite the guided vector field as:

\[ \begin{aligned} u_t^\text{target}(x|y) &= a_t x + b_t(\nabla \log p_t(y|x) + \nabla \log p_t(x)) \\ &= u_t^\text{target}(x) + b_t\nabla \log p_t(y|x) \end{aligned} \]

Since people observed that the image samples don't fit their prompt well enough, it is a natural idea to scale up the contribution of the \(\nabla \log p_t(y|x)\) term, yielding:

\[ \tilde{u}_t^\text{target}(x|y) = u_t^\text{target}(x) + w b_t\nabla \log p_t(y|x) \]

where \(w>1\) is known as guidance scale. Note that for \(w\neq 1\), iy holds that \(\tilde{u}_t^\text{target}(x|y) \neq u_t^\text{target}(x|y)\). Therefore, this guided vector field is not techniqually correct, but empirically it works well.

We can use the above equality to obtain:

\[ \begin{aligned} \tilde{u}_t^\text{target}(x|y) &= u_t^\text{target}(x) + w b_t\nabla \log p_t(y|x)\\ &= u_t^\text{target}(x) + w b_t(\nabla \log p_t(x|y) - \nabla \log p_t(x))\\ &= u_t^\text{target}(x) - (wa_tx + wb_t\nabla \log p_t(x)) + (wa_tx + wb_t\nabla \log p_t(x|y))\\ &=(1-w)u_t^\text{target}(x) + wu_t^\text{target}(x|y) \end{aligned} \]

Thus, we can express the scaled guided vector field \(\tilde{u}_t^\text{target}(x|y)\) as a linear combination of the unguided vector field \(u_t^\text{target}(x)\) and the guided vector field \(u_t^\text{target}(x|y)\). We can train both field during the training and then combine them at inference time. The idea is for training both fields is to augment our label set with an additional \(\emptyset\) label that denoes the absence of conditioning. With that we can then treat \(u_t^\text{target}(x) = u_t^\text{target}(x|\emptyset)\). This approach of training both models in one is known as classifier-free guidance.

Guidance for Diffusion Models

Similarly, we can generalize the conditional score matching loss to obtain the guided conditional score matching objective:

\[ \gL_\text{CSM}^\text{guided}(\theta) = \sE_{t\sim\text{Unif},(z,y)\sim p_{\text{data}}(z,y),x\sim p_t(\cdot|z)}[\|s_t^\theta(x|y) - \nabla \log p_t(x|z)\|^2] \]

We combine the guided score network \(s_t^\theta(x|y)\) with the guided vector field \(u_t^\theta(x|y)\) to simulate the SDE:

\[ X_0\sim p_{\text{init}}, \quad dX_t = [u_t^\theta(X_t|y) + \frac{\sigma_t^2}{2} s_t^\theta(X_t|y)] dt + \sigma_t dW_t \]

By Bayes' rule, we can rewrite the guided score function as:

\[ \nabla \log p_t(x|y) = \nabla \log p_t(y|x) + \nabla \log p_t(x) \]

Then, we can rewrite the guided vector field as:

so for guidance scale \(w>1\), we can have:

\[ \begin{aligned} \tilde{s}_t^\theta(x|y) &= \nabla \log p_t(x) + w\nabla \log p_t(y|x)\\ &= \nabla \log p_t(x) + w(\nabla \log p_t(x|y) - \nabla \log p_t(x))\\ &= (1-w)\nabla \log p_t(x) + w\nabla \log p_t(x|y)\\ &= (1-w)\nabla \log p_t(x|\emptyset) + w\nabla \log p_t(x|y) \end{aligned} \]

References

[1] MIT 6.S184: Generative AI with Stochastic Differential Equations A really great course on diffusion models:)