Flow Matching and Diffusion Models
Flow and Diffusion Models
Flow Models
Ordinary differential equations (ODEs): Every ODE defines a vector field \(u\), i.e., for every time \(t\) and position \(x\), we have a vector \(u_t(x)\) specifying the velocity in space. The solution of this ODE would be a trajectory \(X\) that follows the vector field \(u\):
Flow is a function \(\psi: \sR^d \rightarrow \sR^d\) that maps the initial position \(x_0\) to the position \(x_t\) at time \(t\) following the given vector field \(u\):
Thus, we could say vector field defines an ODE whose solution is a flow.
Theorem 3: Flow exists and is unique if the vector field \(u\) is Lipschitz continuous.
Simulating an ODE
In general, it is not possible to calculate flow \(\psi_t\) exactly. We can approximate it using the Euler method. Given a time step \(h\), we can approximate the flow as follows:
We can also use some more complex methods like Heun's method:
Intuitively, Heun's method first guess the next position using Euler's method and then uses the average of the current and estimated next position to get an updated guess.
Flow models
We can construct a generative model that converts a simple distribution \(p_{\mathrm{init}}\) to a complex distribution \(p_{\mathrm{data}}\) via ODE, this is called a flow model that is defined by the following equation
Here \(u_t^\theta: \sR^d \times [0, 1] \rightarrow \sR^d\) is a neural network. Our goal is to make the endpoint \(X_1\) of the trajectory have a distribution close to \(p_{\mathrm{data}}\):
NOTE: Although this is called a flow model, the network itself parameterized the vector field not the flow.
Algorithm 1: Sampling from a flow model via Euler's method
- Sample \(X_0 \sim p_{\mathrm{init}}\)
- For \(i= 0, \ldots, n-1\) do
- \(X_{t+1} = X_t + h u_t^\theta(X_t)\)
- Update \(t \leftarrow t+h\)
- Return \(X_1\)
Diffusion Models
Brownian motion. SDEs are constructed via a Brownian motion being defined by \(W = (W_t)_{0\leq t \leq 1}\), a stochastic process that satisfies the following properties:
- \(W_0 = 0\) and \(W_t - W_s \sim \mathcal{N}(0, (t-s)I_d)\) for all \(0 \leq s < t \leq 1\) (i.e., the variance increases linearly with time).
- \(W_t - W_s\) is independent of \(W_u - W_v\) for all \(0 \leq s < t \leq 1\) and \(0 \leq u < v \leq 1\).
We can easily simulate a Brownian motion approximately with step size \(h\) by setting \(W_0 = 0\) and updating:
From ODEs to SDEs. The idea of an SDE is to extend deterministic dynamics of an ODE by adding Brownian motion. Since everything is stochastic, we cannot take derivatives anymore. We can rewrite the ODE via infinitesimal updates:
Here \(R_t(h)\) is the error term that \(\lim_{h \rightarrow 0} R_t(h) = 0\). For SDEs, we need an additional contribution from Brownian motion:
where \(\sigma_t\geq 0\) is the diffusion coefficient. The above describes a stochastic differential equation, it is common to write it in the form:
Here for SDEs, there is no flow \(\psi_t\) anymore as \(X_t\) is not fully determined by \(X_0\) anymore. Note that every ODE is also a specical case of SDE with \(\sigma_t = 0\).
Simulating an SDE. We can simulate an SDE using the Euler-Maruyama method. Given a time step \(h\), we can approximate the SDE as follows:
Diffusion models. We can construct a generative model that converts a simple distribution \(p_{\mathrm{init}}\) to a complex distribution \(p_{\mathrm{data}}\) via SDE, this is called a diffusion model that is defined by the following equation:
Constructing the Training Target
We have constructed flow and diffusion models where we obtain trajectories \((X_t)_{0\leq t \leq 1}\) by simulating the ODE/SDE:
To train these models, we need to find the training target \(u_t^\text{target}\)
Conditional and Marginal Probability Path
The probability path can be considered as a gradual interpolation between noise \(p_{\mathrm{init}}\) and data \(p_{\mathrm{data}}\). Let us consider a data point \(z\in \sR^d\), denoted with Dirac delta "distribution" \(\Delta_z\). This makes it a deterministic process, i.e., sampling from \(\Delta_z\) always gives \(z\).
A conditional (interpolating) probability path is a set of distribution \(p_t(x|z)\) over \(\sR^d\) such that:
That is to say, a conditional probability path gradually converts a simple distribution \(p_{\mathrm{init}}\) into a single data point \(z\).
Every conditional probability path induces a marginal probability path \(p_t(x)\):
NOTE: Here we know how to sample from \(p_t(x)\) but we don't know the density value of \(p_t(x)\) as the integral is intractable.
Conditional and Marginal Vector Fields
We now consider the problem of constructing a training target \(u_t^\text{target}\) for a flow model.
Continuity Equation
The divergence of a vector field \(v_t\) is defined as:
Intuitively, the divergence is a scalar field that represents the volume density of the outward flux of a vector field at each point. The continuity equation states that the divergence of the vector field \(u_t^\text{target}\) scaled by total probability mass should be equal to the change in the density of the probability path:
Marginalization trick
Based on the continuity equation, we derive the marginal vector field \(u_t^\text{target}\) as follows:
Thus we have:
Theorem: Marginalization trick
For every data point \(z\in \sR^d\), let \(u_t^\text{target}(\cdot|z)\) denote a conditional vector field, defined so the corresponding ODE yields the conditional probability path \(p_t(\cdot|z)\) viz:
Then, in order to let the marginal probability path satisfies:
The marginal vector field \(u_t^\text{target}(x)\) should be defined as:
Importance of the marginalization trick: This allows us to construct a marginal vector field from a conditional vector field. This simplifies the problem as we could often find a conditional vector field \(u_t^\text{target}(\cdot|z)\) that satisfies \(d/dt X_t = u_t^\text{target}(X_t|z)\) for each data point \(z\). Here is an example:
Target ODE for Guassian probability path: Let \(p_t(\cdot|z) = \mathcal{N}(\alpha_t z, \beta_t^2 I_d)\) for noise schedulers \(\alpha_t, \beta_t\). Here \(\alpha_t, \beta_t\) are two continuously differentiable, monotonic functions with \(\alpha_0=\beta_1=0\) and \(\alpha_1=\beta_0=1\). Let \(\dot{\alpha}_t = \partial_t\alpha_t\) and \(\dot{\beta}_t = \partial_t\beta_t\). Then conditional Gaussian vector field given by:
is a valid conditional vector field with its ODE trajectories \(X_t\) satisfying \(X_t \sim p_t(\cdot|z) = \gN(\alpha_t z, \beta_t^2 I_d)\) if \(X_0\sim \gN(0, I_d)\).
Let us construct a conditional flow model \(\psi_t^{\text{target}}(x|z)\) first by defining:
We can then extract the vector field \(u_t^{\text{target}}(x)\) from \(\psi_t^{\text{target}}(x|z)\):
Conditional and Marginal Score Functions
Marginal score function \(\nabla \log p_t(x)\) is defined as the gradient of the log-likelihood of the marginal probability path \(p_t(x)\). We can express the marginal score function via condition score function \(\nabla \log p_t(x|z)\):
and the conditional score function \(\nabla \log p_t(x|z)\) is something that can be derived analytically. For instance, for the Gaussian path \(p_t(x|z) = \gN(\alpha_t z, \beta_t^2 I_d)\), we have:
Fokkoer-Planck equation
Let us firstly define the Laplacian operator \(\Delta\) via:
Let \(p_t\) be a probability path and let us consider the SDE:
Then \(X_t\) has distribution \(p_t\) for all \(0\leq t \leq 1\) if and only if the Fokker-Planck equation holds:
Marginalization trick for SDEs
Based on the Fokker-Planck equation, we can have:
SDE extension trick
Based on the proof above, we can construct an SDE as follows:
Here \(X_1\sim p_{\text{data}}\) for this SDE. The same identity holds if we replace the marginal probability \(p_t(x)\) and vector field \(u_t^{\text{target}}9x0\) with conditional probability \(p_t(x|z)\) and vector field \(u_t^{\text{target}}(x|z)\) respectively.
Langevin dynamics
The above cosntruction has a specical case when the probability path is static, i.e., \(p_t=p\) for a fixed distribution \(p\). In this case, the corresponding \(u_t^{\text{target}}=0\) and we will have:
This is commonly known as Langevin dynamics. The fact that \(p_t\) is static implies that \(\partial_t p_t(x) = 0\). Thus, this equation satisfies the Fokker-Planck equation. Thus, we have:
These dynamics converge to the stationary distribution \(p\) under rather general conditions. That is, if we instead take \(X_0\sim p^\prime \neq p\), so that \(X_t\sim p^\prime\), then under mild conditions \(p_t \rightarrow p\). That is to say, if we know the score function \(\nabla \log p(x)\), then even if we start from a different distribution \(p^\prime\), we will eventually converge to the target distribution \(p\) following the SDE.
This makes Langevin dynamics extremely useful, and it accordingly serves as the basis for e.g., molecular dynamics simulations, and many other Markov chain Monte Carlo (MCMC) methods. For instance, for score-based generative models, we learn the score function \(\nabla \log p(x)\) from data. Then, we can still use Langevin dynamics to sample from the learned distribution \(p(x)\) even we start from random nosie.
Summary - Derivation of the Training Target
The flow training target is the marginal vector field \(u_t^\text{target}\). To construct it, we choose a conditional probability path \(p_t(x|z)\) that fullfils \(p_0(\cdot|z) = p_{\text{init}}\) and \(p_1(\cdot|z) = \Delta_z\) for all \(z\in \sR^d\). Next, we find a conditional vector field \(u_t^\text{flow}(x|z)\) such that its corresponding flow \(\psi_t^{\text{flow}}(x|z)\) satisfies
or equivalently, that \(u_t^\text{target}\) satisfies the continuity equation. Then, the marginal vector field is defined by:
follows the marginal probability path, i.e.,
In particular, \(X_1\sim p_\text{data}\) for this ODE, so that \(y_t^\text{target}\) converts noise into data.
SDE extension For a time-dependent diffusion coefficient \(\sigma_t\leq 0\), we can extend the above ODE to an SDE with the same marginal probability path:
where \(\nabla\log p_t(x)\) is the marginal score function:
An important example is the Gaussian probability path, yielding the formulation:
for noise schedulers \(\alpha_t, \beta_t\in \gR\): continuously differentiable, monotonic functions such that \(\alpha_0=\beta_1=0\) and \(\alpha_1=\beta_0=1\).
Training the Generative Model
Flow Matching
Let us consider a flow model given by:
An intuitive way of training this model is to use a mean squared error (MSE), i.e., to use flow matching loss defined as:
Here \(\text{Unif}\) denotes the uniform distribution over \([0, 1]\). \(p_t(x)=\int p_t(x|z)p_{\text{data}}(z)dz\) is the marginal probability path. While we do know:
We cannot compute it efficiently as the above integreal is intractable. Thus, we instead look into conditional velocity field \(u_t^\text{target}(x|z)\) that is tractable. To do so, let us define the conditional flow matching loss:
However, don't forget that we aim to regress the marginal vector field. A good news is that by explicitly regressing against the tractable, conditional vector field, we are implicitly regressing against intractable, marginal vector field. We can prove it:
We can re-express the second term as:
Then, we have:
Thus, we can see that the flow matching loss \(\gL_{\text{FM}}\) is equivalent to the conditional flow matching loss \(\gL_{\text{CFM}}\) up to a constant \(C_2\). Once \(u_t^\theta\) is trained, we may simulate the flow model:
This procedure is called flow matching. Let us consider an example of Gaussian probability paths \(p_t(\cdot|z) = \gN(\alpha_t z, \beta_t^2 I_d)\), where we may sample from the conditional path via:
As we have already known, the conditional vector field \(u_t^\text{target}(x|z)\) is given by:
where \(\dot{\alpha}_t = \partial_t\alpha_t\) and \(\dot{\beta}_t = \partial_t\beta_t\) are respective time derivatives. Then, we can obtain the conditional flow matching loss:
Now if we consider a special case of \(\alpha_t=t\) and \(\beta_t=1-t\), the corresponding probability \(p_t(x|z) = \gN(tz, (1-t)^2)\) is sometimes referred to as the (Gaussian) CondOT probability path, then, we have \(\dot{\alpha}_t=1\) and \(\dot{\beta}_t=-1\). Thus, the conditional flow matching loss becomes:
Many SOTA models like stable-diffusion 3, meta's movie gen video have been trained using this simple yet effective procedure.
Score Matching
We can also extend the algorithm we just found from ODEs to SDEs. Recall that we can extend the ODE to an SDE by:
where \(u_t^\text{target}\) is the marginal vector field and \(\nabla \log p_t(x)\) is the marginal score function represented by:
We can use a neural network \(s_t^\theta: \sR^d \times [0, 1] \rightarrow \sR^d\) to approximate the score function \(\nabla \log p_t(x)\), and design a score matching and a conditional score matching loss:
And of course, we can prove that the score matching loss \(\gL_{\text{SM}}\) is equivalent to the conditional score matching loss \(\gL_{\text{CSM}}\) up to a constant. After training with score matching, we can simulate the SDE with an arbitrary diffusion coefficient \(\sigma_t\geq 0\):
to generate \(X_1\sim p_{\text{data}}\). In theory, every \(\sigma_t\) should give samples \(X_1\sim p_{\text{data}}\) at perfect training. In practice, we can have 1) numerical errors, 2) training errors. Thus, there is an optimal \(\sigma_t\) that is determined empirically.
Note that it might seem to be a disadvantage that we have to learn both \(s_t^\theta\) and \(u_t^\theta\) if we want to use diffusion model However, we could model both \(s_t^\theta\) and \(u_t^\theta\) with a single neural network with two output heads. Also, for special case of the Guassian probability path, \(s_t^\theta\) and \(u_t^\theta\) can be converted to each other.
Denoising Diffusion Models: Score Matching for Gaussian Probability Paths
Let us instantiate the denoising score matching loss for the case of \(p_t(x|z) = \gN(\alpha_t z, \beta_t^2 I_d)\). The conditional score function is given by:
Then, the conditional score matching loss becomes:
Here the network \(s_t^\theta\) is essentially learnt to predcit the noise that is used for corrupting the data. Therefore, the above procedure is also known as denoising score matching. The above loss is numerically unstable for \(\beta_t\approx 0\), i.e., when the noise level is very low. Thus, In DDPM, they drop the constant term \(\frac{1}{\beta_t^2}\) and reparemterize \(s_t^\theta\) into a noise predictor network \(\epsilon_t^\theta\):
Conversion formula for Gaussian probability path
For the Gaussian probability path \(p_t(x|z) = \gN(\alpha_t z, \beta_t^2 I_d)\), it holds that the conditional/marginal vector field can be converted into the conditional/marginal score:
where the formula for the above marginal vector field \(u_t^\text{target}\) is called probability flow ODE in the literature.
Proof:
By taking the integral, the same identity holds for the marginal vector field \(u_t^\text{target}(x)\):
Thus, we can use the conversion formula to parameterize the score network \(s_t^\theta\) and the vector field network \(u_t^\theta\) into one another via:
Similarly, as \(\beta^2_t\dot{\alpha}_t - \alpha_t\dot{\beta}_t\beta_t \neq 0\) (always true for \(t\in [0, 1)\)), we can also have:
Here we can have a conclusion that for Gaussian probability paths there is no need to separately train both the marginal score and the marginal vector field, as knowledge of one is sufficient to compute the other. In particular, we can choose whether we want to use flow matching or score matching to train it.
Building an Image Generator
Guidance
The goal of guided generative modeling is to be able to sample from \(p_\text{data}(x|y)\) for any such \(y\).
Guidance for Flow Models
If we imagine fixing our choice of \(y\), and take out data distribution as \(p_{\text{data}}(x|y)\), then we have recovered the unguided generative problem, and can accordingly construct a generative model using the conditional flow matching objective:
Note that \(y\) does not affect the conditional probability path \(p_t(x|z)\) or conditional vector field \(u_t^\text{target}(x|z)\) (In principle, we could make it dependent). Expanding the expectation over all such choices of \(y\) gives the guided conditional flow matching objective:
One of the main differences is that here we sample \((z, y)\sim p_{\text{data}}(z, y)\) rather than just \(z\sim p_{\text{data}}(z)\).
Classifier-free guidance It is empirically observed that this images samples with this procedure don't fit well enough to desired label \(y\), and perceptual quality is increased when the effect of the guidance variable \(y\) is artificially amplified. This concept is distilled classifier-free guidance. Recall that Gaussian conditional probability path is given by:
We can rewrite the guided vector field \(u_t^\text{target}(x|y)\) using the guided score function \(\nabla \log p_t(x|y)\):
where \((a_t, b_t) = (\frac{\dot{\alpha}_t}{\alpha_t}, \beta_t^2\frac{\dot{\alpha}_t}{\alpha_t} - \dot{\beta}_t\beta_t)\). Using the Bayes' rule, we can rewrite the guided score function as:
Then, we can rewrite the guided vector field as:
Since people observed that the image samples don't fit their prompt well enough, it is a natural idea to scale up the contribution of the \(\nabla \log p_t(y|x)\) term, yielding:
where \(w>1\) is known as guidance scale. Note that for \(w\neq 1\), iy holds that \(\tilde{u}_t^\text{target}(x|y) \neq u_t^\text{target}(x|y)\). Therefore, this guided vector field is not techniqually correct, but empirically it works well.
We can use the above equality to obtain:
Thus, we can express the scaled guided vector field \(\tilde{u}_t^\text{target}(x|y)\) as a linear combination of the unguided vector field \(u_t^\text{target}(x)\) and the guided vector field \(u_t^\text{target}(x|y)\). We can train both field during the training and then combine them at inference time. The idea is for training both fields is to augment our label set with an additional \(\emptyset\) label that denoes the absence of conditioning. With that we can then treat \(u_t^\text{target}(x) = u_t^\text{target}(x|\emptyset)\). This approach of training both models in one is known as classifier-free guidance.
Guidance for Diffusion Models
Similarly, we can generalize the conditional score matching loss to obtain the guided conditional score matching objective:
We combine the guided score network \(s_t^\theta(x|y)\) with the guided vector field \(u_t^\theta(x|y)\) to simulate the SDE:
By Bayes' rule, we can rewrite the guided score function as:
Then, we can rewrite the guided vector field as:
so for guidance scale \(w>1\), we can have:
References
[1] MIT 6.S184: Generative AI with Stochastic Differential Equations A really great course on diffusion models:)