Diffusion Model with Perceptual Loss
Abstract
Diffusion models trained with mean squared error loss tend to generate unrealistic samples. Current state-of-the-art models rely on classifier-free guidance to improve sample quality, yet its surprising effectiveness is not fully understood. In this paper, we show that the effectiveness of classifier-free guidance partly originates from it being a form of implicit perceptual guidance. As a result, we can directly incorporate perceptual loss in diffusion training to improve sample quality. Since the score matching objective used in diffusion training strongly resembles the denoising autoencoder objective used in unsupervised training of perceptual networks, the diffusion model itself is a perceptual network and can be used to generate meaningful perceptual loss. We propose a novel self-perceptual objective that results in diffusion models capable of generating more realistic samples. For conditional generation, our method only improves sample quality without entanglement with the conditional input and therefore does not sacrifice sample diversity. Our method can also improve sample quality for unconditional generation, which was not possible with classifier-free guidance before.
1 Introduction
Diffusion models [10, 35, 37] are a rising class of generative model. Conceptually, they work by converting pure noise to a data sample through repeated denoising. Formally, each denoising step can be viewed from the lens of score matching [37] such that the model learns to predict the gradient (score) of an ordinary differential equation (ODE) or a stochastic differential equation (SDE) that transforms samples from one distribution (noise) to another distribution (image, video, etc.) [18]. In this paper, we focus on image generation but the findings are also applicable to other modalities.
Diffusion models are commonly parameterized as neural networks and the training objective minimizes the squared distance between the model prediction and the ground truth score through stochastic gradient descent [12]. This is also commonly referred to as the mean squared error (MSE) loss.
Although diffusion models are supposed to transport samples from noise to image distribution by theory, images generated by a diffusion model in its raw form are often of poor quality, despite the improvements in model architecture [4, 14, 22, 24, 29, 27, 28], formulation [17, 18, 13], and sampling strategy [36, 19, 20, 13].
What drives diffusion models into the mainstream is the advent of classifier guidance [5] and classifier-free guidance [11]. Classifier guidance shows that we can add classifier gradients on top of the predicted score during the inference process to guide the sample generation toward the classifier’s direction. Therefore, it can turn an unconditional diffusion model conditional. However, it is surprising that applying classifier guidance on an already conditional diffusion model can significantly improve sample quality. Classifier-free guidance [11] improves classifier guidance by removing the need for an external classifier network. It simultaneously trains the diffusion model as both conditional and unconditional by dropping the condition with a certain probability. At inference, it queries the model both conditionally and unconditionally at every step and uses their difference as the conditional gradient direction. The score prediction is then amplified toward this conditional direction.
Classifier-free guidance is applied almost ubiquitously in state-of-the-art diffusion models across modalities, e.g. text-to-image [24, 28, 29, 4, 27, 45, 23], text-to-video [7, 2, 46, 9, 34, 1], text-to-3d [33, 41, 25], image-to-video [1, 43], video-to-video [3, 6], etc. Yet, its surprising effectiveness in improving sample quality is not fully understood. Classifier-free guidance also has many limitations, such as having a sensitive hyperparameter that can cause image over-exposure and over-saturation, etc. [15, 29], but there are no other viable alternatives.
In this paper, we elucidate that the effectiveness of classifier-free guidance in improving sample quality partly originates from it being an implicit form of perceptual guidance. We show that perceptual loss can be directly applied to diffusion training to improve sample quality. Instead of using external perceptual networks, we find the diffusion model itself is already a well-trained perceptual network. We propose a novel self-perceptual objective that utilizes the diffusion model itself to generate meaningful perceptual loss. Our training objective results in an improvement in sample quality. Unlike classifier-free guidance, our method does not rely on conditional input and therefore does not trade off sample diversity.
2 Problem
2.1 Diffusion Models Generate Bad Samples
Let’s first elucidate why diffusion models in their raw form generate bad samples. As shown in Figure 1, given finite training samples, the underlying data distribution is ambiguous. The maximum likelihood estimation (MLE) is a distribution that assigns even probability only to the observed samples and zero everywhere else. An ideal score model will learn this ground-truth probability flow to always produce observed samples and generate no new data. In practice, neural networks do not overfit to this flow exactly, allowing generalization and thus the generation of new data.
However, the learned distributions may not match the real underlying distribution. As illustrated in Figure 1(a), consider a simple distribution where the images always contain a solid circle at arbitrary locations. Given limited training samples (top row), we want to generate a new sample from this distribution (bottom left), but the actual generation can be far out of the distribution (bottom right). This problems exist for actual images as illustrated in Figure 1(b). The diffusion model tries to learn a plausible pixel distribution from finite training samples, but it does not align with the intended distribution of photorealistic images.
The ground-truth probability flow is determined by the dataset and forward diffusion function alone, but the learned distribution is determined by the model capacity and loss function. Existing works have focused on expanding the dataset [32, 4], improving the model architecture [24, 4, 14, 22, 29, 27], simplifying the ODE trajectory [18, 17, 13], and improving the sampler [36, 19, 20]. However, almost all works use the mean squared error (MSE) loss in training and no study has explored the use of perceptual loss.
2.2 Perceptual Loss is the Hidden Gem
Prior works have shown that the mean squared error (MSE) metric aligns poorly with human perception [40, 42, 44]. For example, when comparing two faces, humans are much more sensitive if there is a mismatch between the eyes than the hair strands. As another example, shifting an image by a few pixels is almost undetectable by human perception but can cause a large MSE value. Thus, when training diffusion models with MSE loss, it penalizes the imperceivable pixel mismatch more than perceivable structural features. This is clearly not an ideal use of model capacity.
Prior works have found that the distance computed on the hidden features of deep neural networks can be used as a metric that resembles human perception better than the distance computed directly on image pixels [44]. This is because deep neural networks can learn high-level semantics rather than only focusing on pixel differences.
In fact, we will show that this neural perceptual property has already been implicitly applied in classifier guidance [5] and classifier-free guidance [11] in the following sections.
2.3 Perception in Classifier Guidance
Originally, classifier guidance [5] was proposed to guide an unconditional diffusion model to generate class conditional samples, but prior works have found that it can also help an already conditional diffusion model to improve sample quality and attributed it as a side effect of trading off sample diversity.
However, we argue that the surprising effectiveness of classifier guidance also partially originates from it being a form of perceptual guidance. Specifically, the classifier network, i.e. CLIP [26], is a trained perceptual neural network that can provide perceptual guidance. The classifier gradients guide toward perceptually more probable images as a precondition to better text alignment. This results in the generation of more photorealistic images.
2.4 Perception in Classifier-Free Guidance
Classifier-free guidance (CFG) [11] finds that an implicit classifier can be derived from using Bayes’ rule: and the score model itself can be used to provide guidance . Therefore, classifier-free guidance can also be viewed from the lens of perceptual guidance.
2.5 Limitations of Classifier-Free Guidance
Classifier-free guidance has numerous limitations. This incentivizes us to explore directly incorporating perceptual loss in diffusion training. Specifically:
-
•
Classifier-free guidance only works for conditional models. Perceptual loss works for unconditional model as well.
-
•
Classifier-free guidance tangles text alignment with sample quality. Perceptual loss only improves sample quality and is condition agnostic.
-
•
Classifier-free guidance is added post training. High scale can lead to overexposure [15, 29]. Perceptual loss is applied at training and does not suffer this issue.
3 Method
Our goal is to incorporate perceptual loss into diffusion training to improve sample quality. In Section 3.1 we introduce the diffusion background and our model formulation. In Section 3.2, we propose a novel self-perceptual objective and show that the diffusion model itself can be used as a perceptual network to provide meaningful perceptual loss.
3.1 Background
We follow the setup of Stable Diffusion, a latent diffusion model [28]. Given image latent samples of , noise samples of , and time , where , the forward diffusion process is defined as:
(1) |
We use diffusion schedule with zero terminal SNR fix [15]. The specific values are defined in [15].
Our neural network is conditioned on text prompt and uses the -prediction formulation [31, 15]:
(2) |
(3) |
With vanilla diffusion training objective, we optimize the following MSE loss:
(4) |
3.2 Self-Perceptual Objective
Prior works have shown that the score matching objective strongly resembles the denoising autoencoder objective [38]. Denoising autoencoder objective is often used for unsupervised pre-training of neural networks [39]. Thus the diffusion model trained with vanilla MSE loss is effectively a perfect unsupervised perceptual network trained on the target dataset, on the latent space, and on all noise levels . In this section, we show that we can exploit the diffusion model itself as a perceptual network to provide meaningful perceptual loss.
First, we copy and freeze the diffusion model trained with vanilla MSE loss, and modify the architecture to return hidden feature at layer . We denote this frozen network as .
During training, we sample , , and compute through forward diffusion:
(5) |
We use online network to predict and convert the prediction to and :
(6) |
(7) |
(8) |
Then, we sample a new , and compute and through forward diffusion:
(9) |
(10) |
Finally, we pass them through the frozen network and compute MSE on its hidden feature at layer . We find only using the hidden feature at midblock layer yields the best result. We refer to our method as the Self-Perceptual (SP) objective:
(11) |
The pseudo code is provided in Algorithm 1.
4 Evaluation
We first finetune Stable Diffusion v2.1 [28] using our formulation and MSE loss on a subset of LAION dataset [32] where the images have resolution greater than 512px and aesthetic score above 6. We use learning rate 3e-5, batch size 896, ema decay 0.9995 for 60k iterations. We also use 10% conditional dropout to support CFG for evaluation comparison. Then we copy and freeze the model as our perceptual network, and continue training the online network with our self-perceptual objective for 50k iterations. This does not have a conditional dropout.
For inference, we use deterministic DDIM sampler [36], and make sure the sampler starts from the last timestep [15].
4.1 Qualitative
Figure 3 shows example generation results. Our self-perceptual objective has visible quality improvement over the vanilla MSE objective, but the overall sample quality is still worse than classifier-free guidance. This is because our objective only improves sample quality whereas classifier-free guidance has the additional effect of enhancing text alignment. This is especially obvious in Figure 2(j), where our objective only enhances the sample quality without extra emphasis on the text condition.
Notice that the results of vanilla MSE and self-perceptual objective share similar image content and layout when generated from the same initial noise, whereas classifier-free guidance will largely alter the result. Our self-perceptual objective maintains the same diversity whereas classifier-free guidance does not.
Additionally, Figure 2(i) shows the negative artifact of classifier-free guidance. The model has already overfitted the image to the very specific prompt, and the high classifier-free guidance scale causes unnatural artifacts. Our self-perceptual objective does not suffer from this issue.
MSE |
Self-Perceptual |
MSE + CFG |
MSE |
Self-Perceptual |
MSE + CFG |
MSE |
Self-Perceptual |
MSE + CFG |
MSE |
Self-Perceptual |
MSE + CFG |
4.2 Quantitative
Table 1 shows our quantitative evaluation. We follow the convention to calculate Fréchet Inception Distance (FID) [8, 21] and Inception Score (IS) [30]. We select the first 10k samples from the MSCOCO 2014 validation dataset [16] and use our models to generate images of the corresponding captions. Our self-perceptual objective has improved FID/IS over the vanilla MSE objective, this aligns with our observed improvement in sample quality. However, classifier-free guidance still achieves better sample quality than our self-perceptual objective.
Loss |
CFG | Rescale | Steps | NFE | FID | IS |
Ground truth |
00.00 | 35.28 | ||||
25 | 25 | 32.68 | 22.20 | |||
50 | 50 | 29.63 | 22.86 | |||
7.5 | 25 | 50 | 24.41 | 32.10 | ||
7.5 | 0.7 | 25 | 50 | 18.67 | 34.17 | |
25 | 25 | 25.89 | 27.76 | |||
50 | 50 | 24.42 | 28.07 |
5 Ablation Study
In this section, we evaluate the individual hyperparameters we choose for our self-perceptual objective. All metrics are calculated on the same MSCOCO 10k validation samples as in Section 4.2 and use 25 steps DDIM inference.
5.1 Layer
Layer |
FID | IS | |
All Encoder Layers |
26.64 | 26.89 | |
All Decoder Layers |
42.42 | 19.98 | |
All Encoder Layers + Midblock Layer |
26.96 | 27.24 | |
Only Midblock Layer |
25.89 | 27.76 | ✓ |
We compare the effect of computing loss on hidden features from different layers . We try using all the hidden features from every encoder layer and decoder layer by summing up the losses. We also try using only features from the midblock layer. As shown in Table 2, we find that only using features from the midblock layer yields the best metrics.
5.2 Timestep
We compare the effect of choosing timestep for the perceptual network. First, we show that is invalid because always equals . This will make the input to the perceptual network identical which prevents any meaningful loss:
(12) |
We compare three different choices for . Table 3 shows that uniform sampling of yields the best results.
Timestep ( clamped to ) |
FID | IS | |
27.24 | 23.31 | ||
24.54 | 25.42 | ||
25.89 | 27.76 | ✓ |
5.3 Distance Function
We compare using different distance functions on the hidden features. Table 4 shows that MSE and MAE yield very similar results, so we stick to MSE.
Distance | FID | IS | ||
Mean Absolute Distance |
() |
25.28 | 27.41 | |
Mean Squared Distance |
() |
25.89 | 27.76 | ✓ |
5.4 Formulation
We experiment with an alternative formulation, which combines predicted separately with ground-truth . This formulation allows gradient feedback at :
(13) |
(14) |
(15) |
(16) | ||||
Table 5 shows that the alternative formulation yields worse performance.
Formulation |
FID | IS | |
25.89 | 27.76 | ✓ | |
29.83 | 24.54 |
5.5 Repeat Perceptual Network
We experiment using the network trained with self-perceptual objective as the perceptual metric network and repeat the training process. Table 6 shows that repeating the self-perceptual training results in worse performance. This is why we decide to just freeze the MSE model instead of using an exponential moving average (EMA) for the perceptual network.
Formulation |
FID | IS | |
MSE model as perceptual network |
25.89 | 27.76 | ✓ |
SP model as perceptual network |
26.61 | 26.41 |
5.6 Combine with Classifier-Free Guidance
We experiment with applying classifier-free guidance on the model trained with our self-perceptual objective. Table 7 shows that classifier-free guidance indeed can improve sample quality further on the self-perceptual model but it does not surpass classifier-free guidance applied on the MSE model.
Loss |
CFG | Rescale | FID | IS |
7.5 | 0.7 | 18.67 | 34.17 | |
25.89 | 27.76 | |||
2.0 | 0.7 | 21.19 | 32.22 | |
3.0 | 0.7 | 20.65 | 33.49 | |
4.0 | 0.7 | 20.67 | 33.34 | |
7.5 | 0.7 | 23.49 | 31.64 |
5.7 Unconditional Generation
MSE |
Self-Perceptual |
1000 |
900 |
800 |
700 |
600 |
500 |
400 |
300 |
200 |
100 |
Final |
We train an unconditional image generation model following the same procedure except we always use an empty prompt during training and inference.
Figure 5 shows that the MSE objective generates unrealistic images even when using 1000 sample steps, whereas the self-perceptual objective generates much preferable results. This validates that our self-perceptual objective can improve the quality for unconditional diffusion models, which was not possible with classifier-free guidance before.
Table 8 shows the quantitative metric, which also shows that the self-perceptual objective is effective in improving unconditional generation quality.
Loss |
IS |
11.18 | |
12.04 |
5.8 Inference Behavior
In Figure 6, we visualize the model prediction by converting to at every inference step. We find that the model trained with self-perceptual generates shapes and contours earlier in the inference process. We also notice that it has grid-like pattern artifacts, likely resulting from the convolution downsampling nature of the perceptual network. This artifact does not affect overall generation much. We leave it for future investigation and improvement.
6 Conclusion
In summary, we have shown that the effectiveness of classifier guidance and classifier-free guidance can be viewed from the lens of perceptual guidance. We have found that perceptual loss can be directly applied to diffusion training to improve sample quality. Specifically, we have proposed a novel self-perceptual objective, which uses the diffusion model itself as a perceptual network. Our objective can be generalized to all modalities, i.e. image, video, audio, etc., and supports unconditional generation, which was not possible with classifier-free guidance before. For conditional generation, our objective improves sample quality without entanglement with conditional input.
However, for text-to-image generation, classifier-free guidance still generates overall better images than our self-perceptual objective. This is because classifier-free guidance has the additional effect of increasing text alignment by trading off diversity.
We hope our work paves the way for more future explorations on diffusion training loss.
References
- [1] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023.
- [2] A. Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22563–22575, 2023.
- [3] Di Chang, Yichun Shi, Quankai Gao, Jessica Fu, Hongyi Xu, Guoxian Song, Qing Yan, Xiao Yang, and Mohammad Soleymani. Magicdance: Realistic human dance video generation with motions & facial expressions transfer, 2023.
- [4] Junsong Chen, Jincheng YU, Chongjian GE, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In The Twelfth International Conference on Learning Representations, 2024.
- [5] Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat gans on image synthesis. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 8780–8794, 2021.
- [6] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7312–7322, 2023.
- [7] Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In The Twelfth International Conference on Learning Representations, 2024.
- [8] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6626–6637, 2017.
- [9] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models, 2022.
- [10] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- [11] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- [12] Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695–709, 2005.
- [13] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- [14] Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models, 2023.
- [15] Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5404–5411, January 2024.
- [16] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
- [17] Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023.
- [18] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022.
- [19] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- [20] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models, 2023.
- [21] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11400–11410, 2022.
- [22] William S. Peebles and Saining Xie. Scalable diffusion models with transformers. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182, 2022.
- [23] Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubreville. Würstchen: An efficient architecture for large-scale text-to-image diffusion models. In The Twelfth International Conference on Learning Representations, 2024.
- [24] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations, 2024.
- [25] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, 2023.
- [26] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021.
- [27] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents, 2022.
- [28] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
- [29] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- [30] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2226–2234, 2016.
- [31] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022.
- [32] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
- [33] Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. MVDream: Multi-view diffusion for 3d generation. In The Twelfth International Conference on Learning Representations, 2024.
- [34] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data. In The Eleventh International Conference on Learning Representations, 2023.
- [35] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Francis R. Bach and David M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 2256–2265. JMLR.org, 2015.
- [36] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
- [37] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
- [38] Pascal Vincent. A Connection Between Score Matching and Denoising Autoencoders. Neural Computation, 23(7):1661–1674, 2011.
- [39] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res., 11:3371–3408, 2010.
- [40] Zhou Wang, Alan Conrad Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13:600–612, 2004.
- [41] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- [42] Z. Wang, E.P. Simoncelli, and A.C. Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402 Vol.2, 2003.
- [43] Hanshu Yan, Jun Hao Liew, Long Mai, Shanchuan Lin, and Jiashi Feng. Magicprop: Diffusion-based video editing via motion-aware appearance propagation, 2023.
- [44] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 586–595. IEEE Computer Society, 2018.
- [45] Chuanxia Zheng, Long Tung Vuong, Jianfei Cai, and Dinh Phung. MoVQ: Modulating quantized vectors for high-fidelity image generation. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- [46] Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video generation with latent diffusion models, 2023.