License: CC BY 4.0
arXiv:2401.00110v5 [cs.CV] 06 Mar 2024

Diffusion Model with Perceptual Loss

Shanchuan Lin  Xiao Yang
ByteDance Inc.
{peterlin,yangxiao.0}@bytedance.com
Abstract

Diffusion models trained with mean squared error loss tend to generate unrealistic samples. Current state-of-the-art models rely on classifier-free guidance to improve sample quality, yet its surprising effectiveness is not fully understood. In this paper, we show that the effectiveness of classifier-free guidance partly originates from it being a form of implicit perceptual guidance. As a result, we can directly incorporate perceptual loss in diffusion training to improve sample quality. Since the score matching objective used in diffusion training strongly resembles the denoising autoencoder objective used in unsupervised training of perceptual networks, the diffusion model itself is a perceptual network and can be used to generate meaningful perceptual loss. We propose a novel self-perceptual objective that results in diffusion models capable of generating more realistic samples. For conditional generation, our method only improves sample quality without entanglement with the conditional input and therefore does not sacrifice sample diversity. Our method can also improve sample quality for unconditional generation, which was not possible with classifier-free guidance before.

footnotetext: Model: hf.co/ByteDance/sd2.1-base-zsnr-laionaes6-perceptualfootnotetext: Code: See Algorithm 1.

1 Introduction

Diffusion models [10, 35, 37] are a rising class of generative model. Conceptually, they work by converting pure noise to a data sample through repeated denoising. Formally, each denoising step can be viewed from the lens of score matching [37] such that the model learns to predict the gradient (score) of an ordinary differential equation (ODE) or a stochastic differential equation (SDE) that transforms samples from one distribution (noise) to another distribution (image, video, etc.) [18]. In this paper, we focus on image generation but the findings are also applicable to other modalities.

Diffusion models are commonly parameterized as neural networks and the training objective minimizes the squared distance between the model prediction and the ground truth score through stochastic gradient descent [12]. This is also commonly referred to as the mean squared error (MSE) loss.

Although diffusion models are supposed to transport samples from noise to image distribution by theory, images generated by a diffusion model in its raw form are often of poor quality, despite the improvements in model architecture [4, 14, 22, 24, 29, 27, 28], formulation [17, 18, 13], and sampling strategy [36, 19, 20, 13].

What drives diffusion models into the mainstream is the advent of classifier guidance [5] and classifier-free guidance [11]. Classifier guidance shows that we can add classifier gradients on top of the predicted score during the inference process to guide the sample generation toward the classifier’s direction. Therefore, it can turn an unconditional diffusion model conditional. However, it is surprising that applying classifier guidance on an already conditional diffusion model can significantly improve sample quality. Classifier-free guidance [11] improves classifier guidance by removing the need for an external classifier network. It simultaneously trains the diffusion model as both conditional and unconditional by dropping the condition with a certain probability. At inference, it queries the model both conditionally and unconditionally at every step and uses their difference as the conditional gradient direction. The score prediction is then amplified toward this conditional direction.

Classifier-free guidance is applied almost ubiquitously in state-of-the-art diffusion models across modalities, e.g. text-to-image [24, 28, 29, 4, 27, 45, 23], text-to-video [7, 2, 46, 9, 34, 1], text-to-3d [33, 41, 25], image-to-video [1, 43], video-to-video [3, 6], etc. Yet, its surprising effectiveness in improving sample quality is not fully understood. Classifier-free guidance also has many limitations, such as having a sensitive hyperparameter that can cause image over-exposure and over-saturation, etc. [15, 29], but there are no other viable alternatives.

In this paper, we elucidate that the effectiveness of classifier-free guidance in improving sample quality partly originates from it being an implicit form of perceptual guidance. We show that perceptual loss can be directly applied to diffusion training to improve sample quality. Instead of using external perceptual networks, we find the diffusion model itself is already a well-trained perceptual network. We propose a novel self-perceptual objective that utilizes the diffusion model itself to generate meaningful perceptual loss. Our training objective results in an improvement in sample quality. Unlike classifier-free guidance, our method does not rely on conditional input and therefore does not trade off sample diversity.

2 Problem

2.1 Diffusion Models Generate Bad Samples

Let’s first elucidate why diffusion models in their raw form generate bad samples. As shown in Figure 1, given finite training samples, the underlying data distribution is ambiguous. The maximum likelihood estimation (MLE) is a distribution that assigns even probability only to the observed samples and zero everywhere else. An ideal score model will learn this ground-truth probability flow to always produce observed samples and generate no new data. In practice, neural networks do not overfit to this flow exactly, allowing generalization and thus the generation of new data.

Refer to caption
Figure 1: The underlying data distribution is ambiguous given finite training data.
Refer to caption
(a)
Refer to caption
(b)
Figure 2: Diffusion models learn a plausible pixel distribution from the training samples but it does not align with the distribution of real images.

However, the learned distributions may not match the real underlying distribution. As illustrated in Figure 1(a), consider a simple distribution where the images always contain a solid circle at arbitrary locations. Given limited training samples (top row), we want to generate a new sample from this distribution (bottom left), but the actual generation can be far out of the distribution (bottom right). This problems exist for actual images as illustrated in Figure 1(b). The diffusion model tries to learn a plausible pixel distribution from finite training samples, but it does not align with the intended distribution of photorealistic images.

The ground-truth probability flow is determined by the dataset and forward diffusion function alone, but the learned distribution is determined by the model capacity and loss function. Existing works have focused on expanding the dataset [32, 4], improving the model architecture [24, 4, 14, 22, 29, 27], simplifying the ODE trajectory [18, 17, 13], and improving the sampler [36, 19, 20]. However, almost all works use the mean squared error (MSE) loss in training and no study has explored the use of perceptual loss.

2.2 Perceptual Loss is the Hidden Gem

Prior works have shown that the mean squared error (MSE) metric aligns poorly with human perception [40, 42, 44]. For example, when comparing two faces, humans are much more sensitive if there is a mismatch between the eyes than the hair strands. As another example, shifting an image by a few pixels is almost undetectable by human perception but can cause a large MSE value. Thus, when training diffusion models with MSE loss, it penalizes the imperceivable pixel mismatch more than perceivable structural features. This is clearly not an ideal use of model capacity.

Prior works have found that the distance computed on the hidden features of deep neural networks can be used as a metric that resembles human perception better than the distance computed directly on image pixels [44]. This is because deep neural networks can learn high-level semantics rather than only focusing on pixel differences.

In fact, we will show that this neural perceptual property has already been implicitly applied in classifier guidance [5] and classifier-free guidance [11] in the following sections.

2.3 Perception in Classifier Guidance

Originally, classifier guidance [5] was proposed to guide an unconditional diffusion model to generate class conditional samples, but prior works have found that it can also help an already conditional diffusion model to improve sample quality and attributed it as a side effect of trading off sample diversity.

However, we argue that the surprising effectiveness of classifier guidance also partially originates from it being a form of perceptual guidance. Specifically, the classifier network, i.e. CLIP [26], is a trained perceptual neural network that can provide perceptual guidance. The classifier gradients guide toward perceptually more probable images as a precondition to better text alignment. This results in the generation of more photorealistic images.

2.4 Perception in Classifier-Free Guidance

Classifier-free guidance (CFG) [11] finds that an implicit classifier can be derived from using Bayes’ rule: p(c|xt)p(xt|c)/p(xt) and the score model itself can be used to provide guidance logp(c|xt)s(xt|c)s(xt). Therefore, classifier-free guidance can also be viewed from the lens of perceptual guidance.

2.5 Limitations of Classifier-Free Guidance

Classifier-free guidance has numerous limitations. This incentivizes us to explore directly incorporating perceptual loss in diffusion training. Specifically:

  • Classifier-free guidance only works for conditional models. Perceptual loss works for unconditional model as well.

  • Classifier-free guidance tangles text alignment with sample quality. Perceptual loss only improves sample quality and is condition agnostic.

  • Classifier-free guidance is added post training. High scale can lead to overexposure [15, 29]. Perceptual loss is applied at training and does not suffer this issue.

3 Method

Our goal is to incorporate perceptual loss into diffusion training to improve sample quality. In Section 3.1 we introduce the diffusion background and our model formulation. In Section 3.2, we propose a novel self-perceptual objective and show that the diffusion model itself can be used as a perceptual network to provide meaningful perceptual loss.

3.1 Background

We follow the setup of Stable Diffusion, a latent diffusion model [28]. Given image latent samples of x0π0, noise samples of ϵ𝒩(0,𝐈), and time t𝒰(1,T), where t,T=1000, the forward diffusion process is defined as:

xt=forward(x0,ϵ,t)=α¯tx0+1α¯tϵ (1)

We use diffusion schedule with zero terminal SNR fix [15]. The specific α¯t values are defined in [15].

Our neural network fθ:dd is conditioned on text prompt c and uses the v-prediction formulation [31, 15]:

vt=α¯tϵ1α¯tx0 (2)
v^t=fθ(xt,t,c) (3)

With vanilla diffusion training objective, we optimize the following MSE loss:

mse=v^tvt22 (4)

3.2 Self-Perceptual Objective

Prior works have shown that the score matching objective strongly resembles the denoising autoencoder objective [38]. Denoising autoencoder objective is often used for unsupervised pre-training of neural networks [39]. Thus the diffusion model trained with vanilla MSE loss is effectively a perfect unsupervised perceptual network trained on the target dataset, on the latent space, and on all noise levels xt. In this section, we show that we can exploit the diffusion model itself as a perceptual network to provide meaningful perceptual loss.

First, we copy and freeze the diffusion model trained with vanilla MSE loss, and modify the architecture to return hidden feature at layer l. We denote this frozen network as f*l.

During training, we sample x0π0, ϵ𝒩(0,𝐈), t𝒰(1,T) and compute xt through forward diffusion:

xt=forward(x0,ϵ,t) (5)

We use online network fθ to predict v^ and convert the prediction to x^0 and ϵ^:

v^t=fθ(xt,t,c) (6)
x^0=α¯txt1α¯tv^ (7)
ϵ^=α¯tv^+1α¯txt (8)

Then, we sample a new t𝒰(1,T), and compute xt and x^t through forward diffusion:

xt=forward(x0,ϵ,t) (9)
x^t=forward(x^0,ϵ^,t) (10)

Finally, we pass them through the frozen network f*l and compute MSE on its hidden feature at layer l. We find only using the hidden feature at midblock layer yields the best result. We refer to our method as the Self-Perceptual (SP) objective:

sp=f*l(x^t,t,c)f*l(xt,t,c)22 (11)

The pseudo code is provided in Algorithm 1.

4 Evaluation

We first finetune Stable Diffusion v2.1 [28] using our formulation and MSE loss mse on a subset of LAION dataset [32] where the images have resolution greater than 512px and aesthetic score above 6. We use learning rate 3e-5, batch size 896, ema decay 0.9995 for 60k iterations. We also use 10% conditional dropout to support CFG for evaluation comparison. Then we copy and freeze the model as our perceptual network, and continue training the online network with our self-perceptual objective sp for 50k iterations. This does not have a conditional dropout.

For inference, we use deterministic DDIM sampler [36], and make sure the sampler starts from the last timestep T [15].

4.1 Qualitative

Figure 3 shows example generation results. Our self-perceptual objective has visible quality improvement over the vanilla MSE objective, but the overall sample quality is still worse than classifier-free guidance. This is because our objective only improves sample quality whereas classifier-free guidance has the additional effect of enhancing text alignment. This is especially obvious in Figure 2(j), where our objective only enhances the sample quality without extra emphasis on the text condition.

Notice that the results of vanilla MSE and self-perceptual objective share similar image content and layout when generated from the same initial noise, whereas classifier-free guidance will largely alter the result. Our self-perceptual objective maintains the same diversity whereas classifier-free guidance does not.

Additionally, Figure 2(i) shows the negative artifact of classifier-free guidance. The model has already overfitted the image to the very specific prompt, and the high classifier-free guidance scale causes unnatural artifacts. Our self-perceptual objective does not suffer from this issue.

MSE

Self-Perceptual

MSE + CFG

MSE

Self-Perceptual

MSE + CFG

Refer to caption
(a) New York Skyline with ‘Deep Learning’ written with fireworks on the sky.
Refer to caption
(b) An elephant is behind a tree. You can see the trunk on one side and the back legs on the other.
Refer to caption
(c) A couple of glasses are sitting on a table.
Refer to caption
(d) One car on the street.
Refer to caption
(e) A laptop on top of a teddy bear.
Refer to caption
(f) A giraffe underneath a microwave.
Refer to caption
(g) A yellow book and a red vase.
Refer to caption
(h) One cat and one dog sitting on the grass.
Refer to caption
(i) A painting by Grant Wood of an astronaut couple, american gothic style.
Refer to caption
(j) A blue colored dog.
Refer to caption
(k) McDonalds Church.
Refer to caption
(l) Photo of a cat singing in a barbershop quartet.
Figure 3: Text-to-image generation on DrawBench prompts [29]. Our self-perceptual objective improves sample quality over the vanilla MSE objective while largely maintaining the image content and layout. Classifier-free guidance has the additional effect of enhancing text alignment by sacrificing sample diversity. Images are generated with DDIM 50 NFEs. More analysis in Sec. 4.1.

MSE

Self-Perceptual

MSE + CFG

MSE

Self-Perceptual

MSE + CFG

Refer to caption
(a) A black colored car.
Refer to caption
(b) A black colored sandwich.
Refer to caption
(c) A fish eating a pelican.
Refer to caption
(d) A green apple and a black backpack.
Refer to caption
(e) A green cup and a blue cell phone.
Refer to caption
(f) A horse riding an astronaut.
Refer to caption
(g) A pink colored giraffe.
Refer to caption
(h) A red colored car.
Refer to caption
(i) A separate seat for one person, typically with a back and four legs.
Refer to caption
(j) A sign that says ‘Diffusion’.
Refer to caption
(k) A sheep to the right of a wine glass.
Refer to caption
(l) A photo of a confused grizzly bear in calculus class.
Figure 4: Text-to-image generation on DrawBench prompts [29]. Our self-perceptual objective improves sample quality over the vanilla MSE objective while largely maintaining the image content and layout. Classifier-free guidance has the additional effect of enhancing text alignment by sacrificing sample diversity. Images are generated with DDIM 50 NFEs. More analysis in Sec. 4.1.

4.2 Quantitative

Table 1 shows our quantitative evaluation. We follow the convention to calculate Fréchet Inception Distance (FID) [8, 21] and Inception Score (IS) [30]. We select the first 10k samples from the MSCOCO 2014 validation dataset [16] and use our models to generate images of the corresponding captions. Our self-perceptual objective has improved FID/IS over the vanilla MSE objective, this aligns with our observed improvement in sample quality. However, classifier-free guidance still achieves better sample quality than our self-perceptual objective.

Loss

CFG Rescale Steps NFE FID IS

Ground truth

00.00 35.28

mse

25 25 32.68 22.20
50 50 29.63 22.86

mse

7.5 25 50 24.41 32.10
7.5 0.7 25 50 18.67 34.17

sp

25 25 25.89 27.76
50 50 24.42 28.07
Table 1: Quantitative evaluation on MSCOCO 10K validation dataset. Our self-perceptual (SP) objective improves FID and IS metrics over vanilla MSE objective, but is still worse than using classifier-free guidance [11] with rescale [15]. This is expected because classifier-free guidance additionally enhances text alignment and trades off sample diversity. Since classifier-free guidance with 25 steps incurs 50 NFEs (number of function evaluations), we show both 25 steps and 50 steps metrics.

5 Ablation Study

In this section, we evaluate the individual hyperparameters we choose for our self-perceptual objective. All metrics are calculated on the same MSCOCO 10k validation samples as in Section 4.2 and use 25 steps DDIM inference.

5.1 Layer l

Layer

FID IS

All Encoder Layers

26.64 26.89

All Decoder Layers

42.42 19.98

All Encoder Layers + Midblock Layer

26.96 27.24

Only Midblock Layer

25.89 27.76
Table 2: Comparing computing perceptual loss on different layers. We find that only computing loss on the midblock hidden features yields the best results.

We compare the effect of computing loss on hidden features from different layers l. We try using all the hidden features from every encoder layer and decoder layer by summing up the losses. We also try using only features from the midblock layer. As shown in Table 2, we find that only using features from the midblock layer yields the best metrics.

5.2 Timestep t

We compare the effect of choosing timestep t for the perceptual network. First, we show that t=t is invalid because x^t always equals xt. This will make the input to the perceptual network identical which prevents any meaningful loss:

x^t=forward(x^0,ϵ^,t)=forward(x0,ϵ,t)=xt (12)

We compare three different choices for t. Table 3 shows that uniform sampling of t yields the best results.

Timestep (t clamped to [1,T])

FID IS

t=t±40

27.24 23.31

t𝒩(t,100)

24.54 25.42

t𝒰(1,T)

25.89 27.76
Table 3: Comparing the choice of timestep t. We find that simply uniformly sample t can yield reasonably good results.

5.3 Distance Function

We compare using different distance functions on the hidden features. Table 4 shows that MSE and MAE yield very similar results, so we stick to MSE.

Distance FID IS
Mean Absolute Distance

(1)

25.28 27.41
Mean Squared Distance

(22)

25.89 27.76
Table 4: Comparing the choice of distance function. We find that mean squared distance and mean absolute distance have similar results, so we stick to mean squared distance.

5.4 Formulation

We experiment with an alternative formulation, which combines predicted x^t,ϵ^t separately with ground-truth xt,ϵt. This formulation allows gradient feedback at t=t:

xt=forward(x0,ϵ,t) (13)
x^tx=forward(x^0,ϵ,t) (14)
x^tϵ=forward(x0,ϵ^,t) (15)
sp2= f*l(x^tx,t,c)f*l(xt,t,c)22 (16)
+ f*l(x^tϵ,t,c)f*l(xt,t,c)22

Table 5 shows that the alternative formulation yields worse performance.

Formulation

FID IS

sp

25.89 27.76

sp2

29.83 24.54
Table 5: Comparing different formulation. We find that merged formulation yields the best results.

5.5 Repeat Perceptual Network

We experiment using the network trained with self-perceptual objective as the perceptual metric network f*l and repeat the training process. Table 6 shows that repeating the self-perceptual training results in worse performance. This is why we decide to just freeze the MSE model instead of using an exponential moving average (EMA) for the perceptual network.

Formulation

FID IS

MSE model as perceptual network

25.89 27.76

SP model as perceptual network

26.61 26.41
Table 6: Repeating the self-perceptual process yields worse performance.

5.6 Combine with Classifier-Free Guidance

We experiment with applying classifier-free guidance on the model trained with our self-perceptual objective. Table 7 shows that classifier-free guidance indeed can improve sample quality further on the self-perceptual model but it does not surpass classifier-free guidance applied on the MSE model.

Loss

CFG Rescale FID IS

mse

7.5 0.7 18.67 34.17

sp

25.89 27.76
2.0 0.7 21.19 32.22
3.0 0.7 20.65 33.49
4.0 0.7 20.67 33.34
7.5 0.7 23.49 31.64
Table 7: Combining our self-perceptual objective with classifier-free guidance does improve sample quality but does not surpass the MSE objective with classifier-free guidance.

5.7 Unconditional Generation

MSE

Self-Perceptual

Refer to caption
Refer to caption
Figure 5: Unconditional generation. Both use DDIM 1000 steps with the same seed. Our self-perceptual objective can improve unconditional generation quality. This was previously not possible with classifier-free guidance because it only works for conditional models. More analysis in Sec. 5.7.

1000

900

800

700

600

500

400

300

200

100

Final

Refer to caption
(a) MSE
Refer to caption
(b) MSE + CFG
Refer to caption
(c) Self-Perceptual
Figure 6: Model prediction at each step during inference converted to x^0 space. This shows that the model trained with perceptual loss behaves very differently than MSE loss. More analysis in Sec. 5.8.

We train an unconditional image generation model following the same procedure except we always use an empty prompt during training and inference.

Figure 5 shows that the MSE objective generates unrealistic images even when using 1000 sample steps, whereas the self-perceptual objective generates much preferable results. This validates that our self-perceptual objective can improve the quality for unconditional diffusion models, which was not possible with classifier-free guidance before.

Table 8 shows the quantitative metric, which also shows that the self-perceptual objective is effective in improving unconditional generation quality.

Loss

IS

mse

11.18

sp

12.04
Table 8: Unconditional generation metric. Self-perceptural objective improves inception score.

5.8 Inference Behavior

In Figure 6, we visualize the model prediction by converting to x^0 at every inference step. We find that the model trained with self-perceptual generates shapes and contours earlier in the inference process. We also notice that it has grid-like pattern artifacts, likely resulting from the convolution downsampling nature of the perceptual network. This artifact does not affect overall generation much. We leave it for future investigation and improvement.

6 Conclusion

In summary, we have shown that the effectiveness of classifier guidance and classifier-free guidance can be viewed from the lens of perceptual guidance. We have found that perceptual loss can be directly applied to diffusion training to improve sample quality. Specifically, we have proposed a novel self-perceptual objective, which uses the diffusion model itself as a perceptual network. Our objective can be generalized to all modalities, i.e. image, video, audio, etc., and supports unconditional generation, which was not possible with classifier-free guidance before. For conditional generation, our objective improves sample quality without entanglement with conditional input.

However, for text-to-image generation, classifier-free guidance still generates overall better images than our self-perceptual objective. This is because classifier-free guidance has the additional effect of increasing text alignment by trading off diversity.

We hope our work paves the way for more future explorations on diffusion training loss.

1# Create dataloader
2dataloader = create_dataloader()
3
4# Create model by loading from mse pretrained weights.
5model = create_model(mse_pretrained=True)
6optimizer = Adam(model.parameters(), lr=3e-5)
7
8# Create perceptual model and freeze it.
9perceptual_model = deepcopy(model)
10perceptual_model.requires_grad_(False)
11perceptual_model.eval()
12
13# Dataloader yields image (latent) x_0, and conditional prompt c.
14for x_0, c in dataloader:
15
16    # Sample timesteps and epsilon noises.
17    # Then perform forward diffusion.
18    t = randint(0, 1000, size=[batch_size])
19    eps = randn_like(x_0)
20    x_t = forward(x_0, eps, t) # equation 2.
21
22    # Pass through model to get v prediction.
23    # Then convert v_pred to x_0_pred and eps_pred.
24    v_pred = model(x_t, t, c)
25    x_0_pred = to_x_0(v_pred, x_t, t) # equation 8.
26    eps_pred = to_eps(v_pred, x_t, t) # equation 9.
27
28    # Sample new timesteps.
29    # Then perform forward diffusion twice.
30    # One uses ground truth x_0 and eps.
31    # Another uses predicted x_0_pred and eps_pred.
32    tt = randint(0, 1000, size=[batch_size])
33    x_tt = forward(x_0, eps, tt)
34    x_tt_pred = forward(x_0_pred, eps_pred, tt)
35
36    # Pass through perceptual model.
37    # Get hidden feature from midblock.
38    feature_real = perceptual_model(x_tt, tt, c, return_feature="midblock")
39    feature_pred = perceptual_model(x_tt_pred, tt, c, return_feature="midblock")
40
41    # Compute loss on hidden features.
42    loss = mse_loss(feature_pred, feature_real)
43    loss.backward()
44    optimizer.step()
45    optimizer.zero_grad()
Algorithm 1 Pseudo PyTorch code for self-perceptual training.

References

  • [1] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023.
  • [2] A. Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22563–22575, 2023.
  • [3] Di Chang, Yichun Shi, Quankai Gao, Jessica Fu, Hongyi Xu, Guoxian Song, Qing Yan, Xiao Yang, and Mohammad Soleymani. Magicdance: Realistic human dance video generation with motions & facial expressions transfer, 2023.
  • [4] Junsong Chen, Jincheng YU, Chongjian GE, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In The Twelfth International Conference on Learning Representations, 2024.
  • [5] Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat gans on image synthesis. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 8780–8794, 2021.
  • [6] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7312–7322, 2023.
  • [7] Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In The Twelfth International Conference on Learning Representations, 2024.
  • [8] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6626–6637, 2017.
  • [9] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models, 2022.
  • [10] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  • [11] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  • [12] Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695–709, 2005.
  • [13] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  • [14] Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models, 2023.
  • [15] Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5404–5411, January 2024.
  • [16] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
  • [17] Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023.
  • [18] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022.
  • [19] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  • [20] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models, 2023.
  • [21] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11400–11410, 2022.
  • [22] William S. Peebles and Saining Xie. Scalable diffusion models with transformers. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182, 2022.
  • [23] Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubreville. Würstchen: An efficient architecture for large-scale text-to-image diffusion models. In The Twelfth International Conference on Learning Representations, 2024.
  • [24] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations, 2024.
  • [25] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, 2023.
  • [26] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021.
  • [27] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents, 2022.
  • [28] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
  • [29] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  • [30] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2226–2234, 2016.
  • [31] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022.
  • [32] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  • [33] Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. MVDream: Multi-view diffusion for 3d generation. In The Twelfth International Conference on Learning Representations, 2024.
  • [34] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data. In The Eleventh International Conference on Learning Representations, 2023.
  • [35] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Francis R. Bach and David M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 2256–2265. JMLR.org, 2015.
  • [36] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  • [37] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
  • [38] Pascal Vincent. A Connection Between Score Matching and Denoising Autoencoders. Neural Computation, 23(7):1661–1674, 2011.
  • [39] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res., 11:3371–3408, 2010.
  • [40] Zhou Wang, Alan Conrad Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13:600–612, 2004.
  • [41] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [42] Z. Wang, E.P. Simoncelli, and A.C. Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402 Vol.2, 2003.
  • [43] Hanshu Yan, Jun Hao Liew, Long Mai, Shanchuan Lin, and Jiashi Feng. Magicprop: Diffusion-based video editing via motion-aware appearance propagation, 2023.
  • [44] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 586–595. IEEE Computer Society, 2018.
  • [45] Chuanxia Zheng, Long Tung Vuong, Jianfei Cai, and Dinh Phung. MoVQ: Modulating quantized vectors for high-fidelity image generation. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  • [46] Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video generation with latent diffusion models, 2023.