3D Diffusion Models

There are several types of 3D diffusion:

SDS-based model: Use SDS to sample from 2/3D diffusion models for optimizing 3D representations (e.g., NeRF).
Two-stage 3D diffusion: Stage 1: View dependent diffusion model for NVS. Stage 2: Fuse novel views into 3D representations w/ or w/o SDS.
Feed forward 3D diffusion: Use feed forward diffusion model to generate 3D representations itself.

Score Distillation Sampling (SDS)

The diffusion objective can be written as:

\[ \gL_{\mathrm{Diff}}(\phi, \rvx) = \sE_{t\sim \gU(0, 1), \epsilon\sim \gN(\boldsymbol{0}, \rmI)}\left[w(t) \|\epsilon_\phi(\alpha_t\rvx + \sigma_t\epsilon; y, t) - \epsilon\|^2_2\right] \]

Here, \(\epsilon_\phi(\rvz_t; y, t)\) is the learned denoiser that add structures to noisy latent \(\rvz_t\) given the text condition \(y\) and the time step \(t\).

An intuition: We know that each time \(\epsilon_\phi(\rvz_t; y, t)\) is predicting the noise \(\epsilon_t\) that aims to map the noisy latent \(\rvz_t\) to the mean of \(\rvz_{t-1}\). That is to say, the learned denoiser have a clue of what the original/better image looks like. The noise residual is the difference between a better image and the rendered image. This is the root of the score distillation sampling (SDS).

Given that intuition, we can obtain the gradient with respect to the parameter \(\theta\) of our image renderer \(g(\theta)\):

\[ \nabla_\theta \mathcal{L}_{\text {Diff }}(\phi, \mathbf{x}=g(\theta))=\mathbb{E}_{t, \epsilon}[w(t) \underbrace{\left(\hat{\epsilon}_\phi\left(\mathbf{z}_t ; y, t\right)-\epsilon\right)}_{\text {Noise Residual }} \underbrace{\frac{\partial \hat{\epsilon}_\phi\left(\mathbf{z}_t ; y, t\right)}{\mathbf{z}_t}}_{\text {U-Net Jacobian }} \underbrace{\frac{\partial \mathbf{x}}{\partial \theta}}_{\text {Generator Jacobian }}] \]

Here we absorb \(\partial\rvz_t/\partial\rvx=\alpha_t\rmI\) into \(w(t)\). However, in DeamFusion, they found this U-Net Jacobian is expensive and poorly conditioned for small noise. They find simply omitting the U-Net Jacobian term and using the generator Jacobian is sufficient for high-quality sampling:

\[ \nabla_\theta \mathcal{L}_{\text {Diff }}(\phi, \mathbf{x}=g(\theta))=\mathbb{E}_{t, \epsilon}[w(t) \left(\hat{\epsilon}_\phi\left(\mathbf{z}_t ; y, t\right)-\epsilon\right) \frac{\partial \mathbf{x}}{\partial \theta}] \]

Well, if we consider the intuition above, it is a natural idea to omit the Jacobian of the U-Net:)

Two-stage feed forward

Reference

Dreamfusion