Diffusion Model with Perceptual Loss
Abstract
Diffusion models trained with mean squared error loss tend to generate unrealistic samples. Current stateoftheart models rely on classifierfree guidance to improve sample quality, yet its surprising effectiveness is not fully understood. In this paper, we show that the effectiveness of classifierfree guidance partly originates from it being a form of implicit perceptual guidance. As a result, we can directly incorporate perceptual loss in diffusion training to improve sample quality. Since the score matching objective used in diffusion training strongly resembles the denoising autoencoder objective used in unsupervised training of perceptual networks, the diffusion model itself is a perceptual network and can be used to generate meaningful perceptual loss. We propose a novel selfperceptual objective that results in diffusion models capable of generating more realistic samples. For conditional generation, our method only improves sample quality without entanglement with the conditional input and therefore does not sacrifice sample diversity. Our method can also improve sample quality for unconditional generation, which was not possible with classifierfree guidance before.
1 Introduction
Diffusion models [10, 35, 37] are a rising class of generative model. Conceptually, they work by converting pure noise to a data sample through repeated denoising. Formally, each denoising step can be viewed from the lens of score matching [37] such that the model learns to predict the gradient (score) of an ordinary differential equation (ODE) or a stochastic differential equation (SDE) that transforms samples from one distribution (noise) to another distribution (image, video, etc.) [18]. In this paper, we focus on image generation but the findings are also applicable to other modalities.
Diffusion models are commonly parameterized as neural networks and the training objective minimizes the squared distance between the model prediction and the ground truth score through stochastic gradient descent [12]. This is also commonly referred to as the mean squared error (MSE) loss.
Although diffusion models are supposed to transport samples from noise to image distribution by theory, images generated by a diffusion model in its raw form are often of poor quality, despite the improvements in model architecture [4, 14, 22, 24, 29, 27, 28], formulation [17, 18, 13], and sampling strategy [36, 19, 20, 13].
What drives diffusion models into the mainstream is the advent of classifier guidance [5] and classifierfree guidance [11]. Classifier guidance shows that we can add classifier gradients on top of the predicted score during the inference process to guide the sample generation toward the classifier’s direction. Therefore, it can turn an unconditional diffusion model conditional. However, it is surprising that applying classifier guidance on an already conditional diffusion model can significantly improve sample quality. Classifierfree guidance [11] improves classifier guidance by removing the need for an external classifier network. It simultaneously trains the diffusion model as both conditional and unconditional by dropping the condition with a certain probability. At inference, it queries the model both conditionally and unconditionally at every step and uses their difference as the conditional gradient direction. The score prediction is then amplified toward this conditional direction.
Classifierfree guidance is applied almost ubiquitously in stateoftheart diffusion models across modalities, e.g. texttoimage [24, 28, 29, 4, 27, 45, 23], texttovideo [7, 2, 46, 9, 34, 1], textto3d [33, 41, 25], imagetovideo [1, 43], videotovideo [3, 6], etc. Yet, its surprising effectiveness in improving sample quality is not fully understood. Classifierfree guidance also has many limitations, such as having a sensitive hyperparameter that can cause image overexposure and oversaturation, etc. [15, 29], but there are no other viable alternatives.
In this paper, we elucidate that the effectiveness of classifierfree guidance in improving sample quality partly originates from it being an implicit form of perceptual guidance. We show that perceptual loss can be directly applied to diffusion training to improve sample quality. Instead of using external perceptual networks, we find the diffusion model itself is already a welltrained perceptual network. We propose a novel selfperceptual objective that utilizes the diffusion model itself to generate meaningful perceptual loss. Our training objective results in an improvement in sample quality. Unlike classifierfree guidance, our method does not rely on conditional input and therefore does not trade off sample diversity.
2 Problem
2.1 Diffusion Models Generate Bad Samples
Let’s first elucidate why diffusion models in their raw form generate bad samples. As shown in Figure 1, given finite training samples, the underlying data distribution is ambiguous. The maximum likelihood estimation (MLE) is a distribution that assigns even probability only to the observed samples and zero everywhere else. An ideal score model will learn this groundtruth probability flow to always produce observed samples and generate no new data. In practice, neural networks do not overfit to this flow exactly, allowing generalization and thus the generation of new data.
However, the learned distributions may not match the real underlying distribution. As illustrated in Figure 1(a), consider a simple distribution where the images always contain a solid circle at arbitrary locations. Given limited training samples (top row), we want to generate a new sample from this distribution (bottom left), but the actual generation can be far out of the distribution (bottom right). This problems exist for actual images as illustrated in Figure 1(b). The diffusion model tries to learn a plausible pixel distribution from finite training samples, but it does not align with the intended distribution of photorealistic images.
The groundtruth probability flow is determined by the dataset and forward diffusion function alone, but the learned distribution is determined by the model capacity and loss function. Existing works have focused on expanding the dataset [32, 4], improving the model architecture [24, 4, 14, 22, 29, 27], simplifying the ODE trajectory [18, 17, 13], and improving the sampler [36, 19, 20]. However, almost all works use the mean squared error (MSE) loss in training and no study has explored the use of perceptual loss.
2.2 Perceptual Loss is the Hidden Gem
Prior works have shown that the mean squared error (MSE) metric aligns poorly with human perception [40, 42, 44]. For example, when comparing two faces, humans are much more sensitive if there is a mismatch between the eyes than the hair strands. As another example, shifting an image by a few pixels is almost undetectable by human perception but can cause a large MSE value. Thus, when training diffusion models with MSE loss, it penalizes the imperceivable pixel mismatch more than perceivable structural features. This is clearly not an ideal use of model capacity.
Prior works have found that the distance computed on the hidden features of deep neural networks can be used as a metric that resembles human perception better than the distance computed directly on image pixels [44]. This is because deep neural networks can learn highlevel semantics rather than only focusing on pixel differences.
In fact, we will show that this neural perceptual property has already been implicitly applied in classifier guidance [5] and classifierfree guidance [11] in the following sections.
2.3 Perception in Classifier Guidance
Originally, classifier guidance [5] was proposed to guide an unconditional diffusion model to generate class conditional samples, but prior works have found that it can also help an already conditional diffusion model to improve sample quality and attributed it as a side effect of trading off sample diversity.
However, we argue that the surprising effectiveness of classifier guidance also partially originates from it being a form of perceptual guidance. Specifically, the classifier network, i.e. CLIP [26], is a trained perceptual neural network that can provide perceptual guidance. The classifier gradients guide toward perceptually more probable images as a precondition to better text alignment. This results in the generation of more photorealistic images.
2.4 Perception in ClassifierFree Guidance
Classifierfree guidance (CFG) [11] finds that an implicit classifier can be derived from using Bayes’ rule: $p(c{x}_{t})\propto p({x}_{t}c)/p({x}_{t})$ and the score model itself can be used to provide guidance $\nabla \mathrm{log}p(c{x}_{t})\propto s({x}_{t}c)s({x}_{t})$. Therefore, classifierfree guidance can also be viewed from the lens of perceptual guidance.
2.5 Limitations of ClassifierFree Guidance
Classifierfree guidance has numerous limitations. This incentivizes us to explore directly incorporating perceptual loss in diffusion training. Specifically:

•
Classifierfree guidance only works for conditional models. Perceptual loss works for unconditional model as well.

•
Classifierfree guidance tangles text alignment with sample quality. Perceptual loss only improves sample quality and is condition agnostic.

•
Classifierfree guidance is added post training. High scale can lead to overexposure [15, 29]. Perceptual loss is applied at training and does not suffer this issue.
3 Method
Our goal is to incorporate perceptual loss into diffusion training to improve sample quality. In Section 3.1 we introduce the diffusion background and our model formulation. In Section 3.2, we propose a novel selfperceptual objective and show that the diffusion model itself can be used as a perceptual network to provide meaningful perceptual loss.
3.1 Background
We follow the setup of Stable Diffusion, a latent diffusion model [28]. Given image latent samples of ${x}_{0}\sim {\pi}_{0}$, noise samples of $\u03f5\sim \mathcal{N}(0,\text{\mathbf{I}})$, and time $t\sim \mathcal{U}(1,T)$, where $t\in \mathbb{Z},T=1000$, the forward diffusion process is defined as:
$${x}_{t}=forward({x}_{0},\u03f5,t)=\sqrt{{\overline{\alpha}}_{t}}{x}_{0}+\sqrt{1{\overline{\alpha}}_{t}}\u03f5$$  (1) 
We use diffusion schedule with zero terminal SNR fix [15]. The specific ${\overline{\alpha}}_{t}$ values are defined in [15].
Our neural network ${f}_{\theta}:{\mathbb{R}}^{d}\to {\mathbb{R}}^{d}$ is conditioned on text prompt $c$ and uses the $v$prediction formulation [31, 15]:
$${v}_{t}=\sqrt{{\overline{\alpha}}_{t}}\u03f5\sqrt{1{\overline{\alpha}}_{t}}{x}_{0}$$  (2) 
$${\widehat{v}}_{t}={f}_{\theta}({x}_{t},t,c)$$  (3) 
With vanilla diffusion training objective, we optimize the following MSE loss:
$${\mathcal{L}}_{mse}={\Vert {\widehat{v}}_{t}{v}_{t}\Vert}_{2}^{2}$$  (4) 
3.2 SelfPerceptual Objective
Prior works have shown that the score matching objective strongly resembles the denoising autoencoder objective [38]. Denoising autoencoder objective is often used for unsupervised pretraining of neural networks [39]. Thus the diffusion model trained with vanilla MSE loss is effectively a perfect unsupervised perceptual network trained on the target dataset, on the latent space, and on all noise levels ${x}_{t}$. In this section, we show that we can exploit the diffusion model itself as a perceptual network to provide meaningful perceptual loss.
First, we copy and freeze the diffusion model trained with vanilla MSE loss, and modify the architecture to return hidden feature at layer $l$. We denote this frozen network as ${f}_{*}^{l}$.
During training, we sample ${x}_{0}\sim {\pi}_{0}$, $\u03f5\sim \mathcal{N}(0,\text{\mathbf{I}})$, $t\sim \mathcal{U}(1,T)$ and compute ${x}_{t}$ through forward diffusion:
$${x}_{t}=forward({x}_{0},\u03f5,t)$$  (5) 
We use online network ${f}_{\theta}$ to predict $\widehat{v}$ and convert the prediction to ${\widehat{x}}_{0}$ and $\widehat{\u03f5}$:
$${\widehat{v}}_{t}={f}_{\theta}({x}_{t},t,c)$$  (6) 
$${\widehat{x}}_{0}=\sqrt{{\overline{\alpha}}_{t}}{x}_{t}\sqrt{1{\overline{\alpha}}_{t}}\widehat{v}$$  (7) 
$$\widehat{\u03f5}=\sqrt{{\overline{\alpha}}_{t}}\widehat{v}+\sqrt{1{\overline{\alpha}}_{t}}{x}_{t}$$  (8) 
Then, we sample a new ${t}^{\prime}\sim \mathcal{U}(1,T)$, and compute ${x}_{{t}^{\prime}}$ and ${\widehat{x}}_{{t}^{\prime}}$ through forward diffusion:
$${x}_{{t}^{\prime}}=forward({x}_{0},\u03f5,{t}^{\prime})$$  (9) 
$${\widehat{x}}_{{t}^{\prime}}=forward({\widehat{x}}_{0},\widehat{\u03f5},{t}^{\prime})$$  (10) 
Finally, we pass them through the frozen network ${f}_{*}^{l}$ and compute MSE on its hidden feature at layer $l$. We find only using the hidden feature at midblock layer yields the best result. We refer to our method as the SelfPerceptual (SP) objective:
$${\mathcal{L}}_{sp}={\Vert {f}_{*}^{l}({\widehat{x}}_{{t}^{\prime}},{t}^{\prime},c){f}_{*}^{l}({x}_{{t}^{\prime}},{t}^{\prime},c)\Vert}_{2}^{2}$$  (11) 
The pseudo code is provided in Algorithm 1.
4 Evaluation
We first finetune Stable Diffusion v2.1 [28] using our formulation and MSE loss ${\mathcal{L}}_{mse}$ on a subset of LAION dataset [32] where the images have resolution greater than 512px and aesthetic score above 6. We use learning rate 3e5, batch size 896, ema decay 0.9995 for 60k iterations. We also use 10% conditional dropout to support CFG for evaluation comparison. Then we copy and freeze the model as our perceptual network, and continue training the online network with our selfperceptual objective ${\mathcal{L}}_{sp}$ for 50k iterations. This does not have a conditional dropout.
For inference, we use deterministic DDIM sampler [36], and make sure the sampler starts from the last timestep $T$ [15].
4.1 Qualitative
Figure 3 shows example generation results. Our selfperceptual objective has visible quality improvement over the vanilla MSE objective, but the overall sample quality is still worse than classifierfree guidance. This is because our objective only improves sample quality whereas classifierfree guidance has the additional effect of enhancing text alignment. This is especially obvious in Figure 2(j), where our objective only enhances the sample quality without extra emphasis on the text condition.
Notice that the results of vanilla MSE and selfperceptual objective share similar image content and layout when generated from the same initial noise, whereas classifierfree guidance will largely alter the result. Our selfperceptual objective maintains the same diversity whereas classifierfree guidance does not.
Additionally, Figure 2(i) shows the negative artifact of classifierfree guidance. The model has already overfitted the image to the very specific prompt, and the high classifierfree guidance scale causes unnatural artifacts. Our selfperceptual objective does not suffer from this issue.
MSE 
SelfPerceptual 
MSE + CFG 
MSE 
SelfPerceptual 
MSE + CFG 
MSE 
SelfPerceptual 
MSE + CFG 
MSE 
SelfPerceptual 
MSE + CFG 
4.2 Quantitative
Table 1 shows our quantitative evaluation. We follow the convention to calculate Fréchet Inception Distance (FID) [8, 21] and Inception Score (IS) [30]. We select the first 10k samples from the MSCOCO 2014 validation dataset [16] and use our models to generate images of the corresponding captions. Our selfperceptual objective has improved FID/IS over the vanilla MSE objective, this aligns with our observed improvement in sample quality. However, classifierfree guidance still achieves better sample quality than our selfperceptual objective.
Loss 
CFG  Rescale  Steps  NFE  FID $\downarrow $  IS $\uparrow $ 
Ground truth 
00.00  35.28  
${\mathcal{L}}_{mse}$ 
25  25  32.68  22.20  
50  50  29.63  22.86  
${\mathcal{L}}_{mse}$ 
7.5  25  50  24.41  32.10  
7.5  0.7  25  50  18.67  34.17  
${\mathcal{L}}_{sp}$ 
25  25  25.89  27.76  
50  50  24.42  28.07 
5 Ablation Study
In this section, we evaluate the individual hyperparameters we choose for our selfperceptual objective. All metrics are calculated on the same MSCOCO 10k validation samples as in Section 4.2 and use 25 steps DDIM inference.
5.1 Layer $l$
Layer 
FID $\downarrow $  IS $\uparrow $  
All Encoder Layers 
26.64  26.89  
All Decoder Layers 
42.42  19.98  
All Encoder Layers + Midblock Layer 
26.96  27.24  
Only Midblock Layer 
25.89  27.76  ✓ 
We compare the effect of computing loss on hidden features from different layers $l$. We try using all the hidden features from every encoder layer and decoder layer by summing up the losses. We also try using only features from the midblock layer. As shown in Table 2, we find that only using features from the midblock layer yields the best metrics.
5.2 Timestep ${t}^{\prime}$
We compare the effect of choosing timestep ${t}^{\prime}$ for the perceptual network. First, we show that ${t}^{\prime}=t$ is invalid because ${\widehat{x}}_{t}$ always equals ${x}_{t}$. This will make the input to the perceptual network identical which prevents any meaningful loss:
$${\widehat{x}}_{t}=forward({\widehat{x}}_{0},\widehat{\u03f5},t)=forward({x}_{0},\u03f5,t)={x}_{t}$$  (12) 
We compare three different choices for ${t}^{\prime}$. Table 3 shows that uniform sampling of ${t}^{\prime}$ yields the best results.
Timestep (${t}^{\prime}$ clamped to $[1,T]$) 
FID $\downarrow $  IS $\uparrow $  
${t}^{\prime}=t\pm 40$ 
27.24  23.31  
${t}^{\prime}\sim \mathcal{N}(t,100)$ 
24.54  25.42  
${t}^{\prime}\sim \mathcal{U}(1,T)$ 
25.89  27.76  ✓ 
5.3 Distance Function
We compare using different distance functions on the hidden features. Table 4 shows that MSE and MAE yield very similar results, so we stick to MSE.
Distance  FID $\downarrow $  IS $\uparrow $  
Mean Absolute Distance 
($\parallel \cdot {\parallel}_{1}$) 
25.28  27.41  
Mean Squared Distance 
($\parallel \cdot {\parallel}_{2}^{2}$) 
25.89  27.76  ✓ 
5.4 Formulation
We experiment with an alternative formulation, which combines predicted ${\widehat{x}}_{{t}^{\prime}},{\widehat{\u03f5}}_{{t}^{\prime}}$ separately with groundtruth ${x}_{{t}^{\prime}},{\u03f5}_{{t}^{\prime}}$. This formulation allows gradient feedback at ${t}^{\prime}=t$:
$${x}_{{t}^{\prime}}=forward({x}_{0},\u03f5,{t}^{\prime})$$  (13) 
$${\widehat{x}}_{{t}^{\prime}}^{x}=forward({\widehat{x}}_{0},\u03f5,{t}^{\prime})$$  (14) 
$${\widehat{x}}_{{t}^{\prime}}^{\u03f5}=forward({x}_{0},\widehat{\u03f5},{t}^{\prime})$$  (15) 
${\mathcal{L}}_{sp2}=$  ${\Vert {f}_{*}^{l}({\widehat{x}}_{{t}^{\prime}}^{x},{t}^{\prime},c){f}_{*}^{l}({x}_{{t}^{\prime}},{t}^{\prime},c)\Vert}_{2}^{2}$  (16)  
$+$  ${\Vert {f}_{*}^{l}({\widehat{x}}_{{t}^{\prime}}^{\u03f5},{t}^{\prime},c){f}_{*}^{l}({x}_{{t}^{\prime}},{t}^{\prime},c)\Vert}_{2}^{2}$ 
Table 5 shows that the alternative formulation yields worse performance.
Formulation 
FID $\downarrow $  IS $\uparrow $  
${\mathcal{L}}_{sp}$ 
25.89  27.76  ✓ 
${\mathcal{L}}_{sp2}$ 
29.83  24.54 
5.5 Repeat Perceptual Network
We experiment using the network trained with selfperceptual objective as the perceptual metric network ${f}_{*}^{l}$ and repeat the training process. Table 6 shows that repeating the selfperceptual training results in worse performance. This is why we decide to just freeze the MSE model instead of using an exponential moving average (EMA) for the perceptual network.
Formulation 
FID $\downarrow $  IS $\uparrow $  
MSE model as perceptual network 
25.89  27.76  ✓ 
SP model as perceptual network 
26.61  26.41 
5.6 Combine with ClassifierFree Guidance
We experiment with applying classifierfree guidance on the model trained with our selfperceptual objective. Table 7 shows that classifierfree guidance indeed can improve sample quality further on the selfperceptual model but it does not surpass classifierfree guidance applied on the MSE model.
Loss 
CFG  Rescale  FID $\downarrow $  IS $\uparrow $ 
${\mathcal{L}}_{mse}$ 
7.5  0.7  18.67  34.17 
${\mathcal{L}}_{sp}$ 
25.89  27.76  
2.0  0.7  21.19  32.22  
3.0  0.7  20.65  33.49  
4.0  0.7  20.67  33.34  
7.5  0.7  23.49  31.64 
5.7 Unconditional Generation
MSE 
SelfPerceptual 
1000 
900 
800 
700 
600 
500 
400 
300 
200 
100 
Final 
We train an unconditional image generation model following the same procedure except we always use an empty prompt during training and inference.
Figure 5 shows that the MSE objective generates unrealistic images even when using 1000 sample steps, whereas the selfperceptual objective generates much preferable results. This validates that our selfperceptual objective can improve the quality for unconditional diffusion models, which was not possible with classifierfree guidance before.
Table 8 shows the quantitative metric, which also shows that the selfperceptual objective is effective in improving unconditional generation quality.
Loss 
IS $\uparrow $ 
${\mathcal{L}}_{mse}$ 
11.18 
${\mathcal{L}}_{sp}$ 
12.04 
5.8 Inference Behavior
In Figure 6, we visualize the model prediction by converting to ${\widehat{x}}_{0}$ at every inference step. We find that the model trained with selfperceptual generates shapes and contours earlier in the inference process. We also notice that it has gridlike pattern artifacts, likely resulting from the convolution downsampling nature of the perceptual network. This artifact does not affect overall generation much. We leave it for future investigation and improvement.
6 Conclusion
In summary, we have shown that the effectiveness of classifier guidance and classifierfree guidance can be viewed from the lens of perceptual guidance. We have found that perceptual loss can be directly applied to diffusion training to improve sample quality. Specifically, we have proposed a novel selfperceptual objective, which uses the diffusion model itself as a perceptual network. Our objective can be generalized to all modalities, i.e. image, video, audio, etc., and supports unconditional generation, which was not possible with classifierfree guidance before. For conditional generation, our objective improves sample quality without entanglement with conditional input.
However, for texttoimage generation, classifierfree guidance still generates overall better images than our selfperceptual objective. This is because classifierfree guidance has the additional effect of increasing text alignment by trading off diversity.
We hope our work paves the way for more future explorations on diffusion training loss.
References
 [1] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023.
 [2] A. Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: Highresolution video synthesis with latent diffusion models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22563–22575, 2023.
 [3] Di Chang, Yichun Shi, Quankai Gao, Jessica Fu, Hongyi Xu, Guoxian Song, Qing Yan, Xiao Yang, and Mohammad Soleymani. Magicdance: Realistic human dance video generation with motions & facial expressions transfer, 2023.
 [4] Junsong Chen, Jincheng YU, Chongjian GE, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart$\alpha$: Fast training of diffusion transformer for photorealistic texttoimage synthesis. In The Twelfth International Conference on Learning Representations, 2024.
 [5] Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat gans on image synthesis. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 614, 2021, virtual, pages 8780–8794, 2021.
 [6] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and contentguided video synthesis with diffusion models. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7312–7322, 2023.
 [7] Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized texttoimage diffusion models without specific tuning. In The Twelfth International Conference on Learning Representations, 2024.
 [8] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 49, 2017, Long Beach, CA, USA, pages 6626–6637, 2017.
 [9] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models, 2022.
 [10] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, MariaFlorina Balcan, and HsuanTien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 612, 2020, virtual, 2020.
 [11] Jonathan Ho and Tim Salimans. Classifierfree diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
 [12] Aapo Hyvärinen. Estimation of nonnormalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695–709, 2005.
 [13] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusionbased generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
 [14] Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models, 2023.
 [15] Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5404–5411, January 2024.
 [16] TsungYi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
 [17] Yaron Lipman, Ricky T. Q. Chen, Heli BenHamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023.
 [18] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022.
 [19] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPMsolver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
 [20] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpmsolver++: Fast solver for guided sampling of diffusion probabilistic models, 2023.
 [21] Gaurav Parmar, Richard Zhang, and JunYan Zhu. On aliased resizing and surprising subtleties in gan evaluation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11400–11410, 2022.
 [22] William S. Peebles and Saining Xie. Scalable diffusion models with transformers. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182, 2022.
 [23] Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubreville. Würstchen: An efficient architecture for largescale texttoimage diffusion models. In The Twelfth International Conference on Learning Representations, 2024.
 [24] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for highresolution image synthesis. In The Twelfth International Conference on Learning Representations, 2024.
 [25] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Textto3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, 2023.
 [26] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 1824 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021.
 [27] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical textconditional image generation with clip latents, 2022.
 [28] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. Highresolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
 [29] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael GontijoLopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic texttoimage diffusion models with deep language understanding. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
 [30] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 510, 2016, Barcelona, Spain, pages 2226–2234, 2016.
 [31] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022.
 [32] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION5b: An open largescale dataset for training next generation imagetext models. In Thirtysixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
 [33] Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. MVDream: Multiview diffusion for 3d generation. In The Twelfth International Conference on Learning Representations, 2024.
 [34] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Makeavideo: Texttovideo generation without textvideo data. In The Eleventh International Conference on Learning Representations, 2023.
 [35] Jascha SohlDickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Francis R. Bach and David M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 2256–2265. JMLR.org, 2015.
 [36] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
 [37] Yang Song, Jascha SohlDickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Scorebased generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
 [38] Pascal Vincent. A Connection Between Score Matching and Denoising Autoencoders. Neural Computation, 23(7):1661–1674, 2011.
 [39] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and PierreAntoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res., 11:3371–3408, 2010.
 [40] Zhou Wang, Alan Conrad Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13:600–612, 2004.
 [41] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: Highfidelity and diverse textto3d generation with variational score distillation. In Thirtyseventh Conference on Neural Information Processing Systems, 2023.
 [42] Z. Wang, E.P. Simoncelli, and A.C. Bovik. Multiscale structural similarity for image quality assessment. In The ThritySeventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402 Vol.2, 2003.
 [43] Hanshu Yan, Jun Hao Liew, Long Mai, Shanchuan Lin, and Jiashi Feng. Magicprop: Diffusionbased video editing via motionaware appearance propagation, 2023.
 [44] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, pages 586–595. IEEE Computer Society, 2018.
 [45] Chuanxia Zheng, Long Tung Vuong, Jianfei Cai, and Dinh Phung. MoVQ: Modulating quantized vectors for highfidelity image generation. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
 [46] Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video generation with latent diffusion models, 2023.