HighResolution Image Synthesis with Latent Diffusion Models
Abstract
By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve stateoftheart synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a nearoptimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing crossattention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and highresolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve new stateoftheart scores for image inpainting and classconditional image synthesis and highly competitive performance on various tasks, including texttoimage synthesis, unconditional image generation and superresolution, while significantly reducing computational requirements compared to pixelbased DMs.
1 Introduction
Image synthesis is one of the computer vision fields with the most spectacular recent development, but also among those with the greatest computational demands. Especially highresolution synthesis of complex, natural scenes is presently dominated by scaling up likelihoodbased models, potentially containing billions of parameters in autoregressive (AR) transformers [66, 67]. In contrast, the promising results of GANs [27, 3, 40] have been revealed to be mostly confined to data with comparably limited variability as their adversarial learning procedure does not easily scale to modeling complex, multimodal distributions. Recently, diffusion models [82], which are built from a hierarchy of denoising autoencoders, have shown to achieve impressive results in image synthesis [30, 85] and beyond [45, 7, 48, 57], and define the stateoftheart in classconditional image synthesis [15, 31] and superresolution [72]. Moreover, even unconditional DMs can readily be applied to tasks such as inpainting and colorization [85] or strokebased synthesis [53], in contrast to other types of generative models [46, 69, 19]. Being likelihoodbased models, they do not exhibit modecollapse and training instabilities as GANs and, by heavily exploiting parameter sharing, they can model highly complex distributions of natural images without involving billions of parameters as in AR models [67].
Democratizing HighResolution Image Synthesis
DMs belong to the class of likelihoodbased models, whose modecovering behavior makes them prone to spend excessive amounts of capacity (and thus compute resources) on modeling imperceptible details of the data [16, 73]. Although the reweighted variational objective [30] aims to address this by undersampling the initial denoising steps, DMs are still computationally demanding, since training and evaluating such a model requires repeated function evaluations (and gradient computations) in the highdimensional space of RGB images. As an example, training the most powerful DMs often takes hundreds of GPU days (e.g. 150  1000 V100 days in [15]) and repeated evaluations on a noisy version of the input space render also inference expensive, so that producing 50k samples takes approximately 5 days [15] on a single A100 GPU. This has two consequences for the research community and users in general: Firstly, training such a model requires massive computational resources only available to a small fraction of the field, and leaves a huge carbon footprint [65, 86]. Secondly, evaluating an already trained model is also expensive in time and memory, since the same model architecture must run sequentially for a large number of steps (e.g. 25  1000 steps in [15]).
To increase the accessibility of this powerful model class and at the same time reduce its significant resource consumption, a method is needed that reduces the computational complexity for both training and sampling. Reducing the computational demands of DMs without impairing their performance is, therefore, key to enhance their accessibility.
Departure to Latent Space
Our approach starts with the analysis of already trained diffusion models in pixel space: Fig. 2 shows the ratedistortion tradeoff of a trained model. As with any likelihoodbased model, learning can be roughly divided into two stages: First is a perceptual compression stage which removes highfrequency details but still learns little semantic variation. In the second stage, the actual generative model learns the semantic and conceptual composition of the data (semantic compression). We thus aim to first find a perceptually equivalent, but computationally more suitable space, in which we will train diffusion models for highresolution image synthesis.
Following common practice [96, 67, 23, 11, 66], we separate training into two distinct phases: First, we train an autoencoder which provides a lowerdimensional (and thereby efficient) representational space which is perceptually equivalent to the data space. Importantly, and in contrast to previous work [23, 66], we do not need to rely on excessive spatial compression, as we train DMs in the learned latent space, which exhibits better scaling properties with respect to the spatial dimensionality. The reduced complexity also provides efficient image generation from the latent space with a single network pass. We dub the resulting model class Latent Diffusion Models (LDMs).
A notable advantage of this approach is that we need to train the universal autoencoding stage only once and can therefore reuse it for multiple DM trainings or to explore possibly completely different tasks [81]. This enables efficient exploration of a large number of diffusion models for various imagetoimage and texttoimage tasks. For the latter, we design an architecture that connects transformers to the DM’s UNet backbone [71] and enables arbitrary types of tokenbased conditioning mechanisms, see Sec. 3.3.
In sum, our work makes the following contributions:
(i) In contrast to purely transformerbased approaches [23, 66], our method scales more graceful to higher dimensional data and can thus (a) work on a compression level which provides more faithful and detailed reconstructions than previous work (see Fig. 1) and (b) can be efficiently applied to highresolution synthesis of megapixel images.
(ii) We achieve competitive performance on multiple tasks (unconditional image synthesis, inpainting, stochastic superresolution) and datasets while significantly lowering computational costs. Compared to pixelbased diffusion approaches, we also significantly decrease inference costs.
(iii) We show that, in contrast to previous work [93] which learns both an encoder/decoder architecture and a scorebased prior simultaneously, our approach does not require a delicate weighting of reconstruction and generative abilities. This ensures extremely faithful reconstructions and requires very little regularization of the latent space.
(iv) We find that for densely conditioned tasks such as superresolution, inpainting and semantic synthesis, our model can be applied in a convolutional fashion and render large, consistent images of $\sim {1024}^{2}$ px.
(v) Moreover, we design a generalpurpose conditioning mechanism based on crossattention, enabling multimodal training. We use it to train classconditional, texttoimage and layouttoimage models.
(vi) Finally, we release pretrained latent diffusion and autoencoding models at https://github.com/CompVis/latentdiffusion which might be reusable for a various tasks besides training of DMs [81].
2 Related Work
Generative Models for Image Synthesis The high dimensional nature of images presents distinct challenges to generative modeling. Generative Adversarial Networks (GAN) [27] allow for efficient sampling of high resolution images with good perceptual quality [3, 42], but are difficult to optimize [54, 2, 28] and struggle to capture the full data distribution [55]. In contrast, likelihoodbased methods emphasize good density estimation which renders optimization more wellbehaved. Variational autoencoders (VAE) [46] and flowbased models [18, 19] enable efficient synthesis of high resolution images [9, 92, 44], but sample quality is not on par with GANs. While autoregressive models (ARM) [95, 94, 6, 10] achieve strong performance in density estimation, computationally demanding architectures [97] and a sequential sampling process limit them to low resolution images. Because pixel based representations of images contain barely perceptible, highfrequency details [16, 73], maximumlikelihood training spends a disproportionate amount of capacity on modeling them, resulting in long training times. To scale to higher resolutions, several twostage approaches [101, 67, 23, 103] use ARMs to model a compressed latent image space instead of raw pixels.
Recently, Diffusion Probabilistic Models (DM) [82], have achieved stateoftheart results in density estimation [45] as well as in sample quality [15]. The generative power of these models stems from a natural fit to the inductive biases of imagelike data when their underlying neural backbone is implemented as a UNet [71, 30, 85, 15]. The best synthesis quality is usually achieved when a reweighted objective [30] is used for training. In this case, the DM corresponds to a lossy compressor and allow to trade image quality for compression capabilities. Evaluating and optimizing these models in pixel space, however, has the downside of low inference speed and very high training costs. While the former can be partially adressed by advanced sampling strategies [84, 75, 47] and hierarchical approaches [31, 93], training on highresolution image data always requires to calculate expensive gradients. We adress both drawbacks with our proposed LDMs, which work on a compressed latent space of lower dimensionality. This renders training computationally cheaper and speeds up inference with almost no reduction in synthesis quality (see Fig. 1).
TwoStage Image Synthesis To mitigate the shortcomings of individual generative approaches, a lot of research [11, 70, 23, 103, 101, 67] has gone into combining the strengths of different methods into more efficient and performant models via a two stage approach. VQVAEs [101, 67] use autoregressive models to learn an expressive prior over a discretized latent space. [66] extend this approach to texttoimage generation by learning a joint distributation over discretized image and text representations. More generally, [70] uses conditionally invertible networks to provide a generic transfer between latent spaces of diverse domains. Different from VQVAEs, VQGANs [23, 103] employ a first stage with an adversarial and perceptual objective to scale autoregressive transformers to larger images. However, the high compression rates required for feasible ARM training, which introduces billions of trainable parameters [66, 23], limit the overall performance of such approaches and less compression comes at the price of high computational cost [66, 23]. Our work prevents such tradeoffs, as our proposed LDMs scale more gently to higher dimensional latent spaces due to their convolutional backbone. Thus, we are free to choose the level of compression which optimally mediates between learning a powerful first stage, without leaving too much perceptual compression up to the generative diffusion model while guaranteeing highfidelity reconstructions (see Fig. 1).
While approaches to jointly [93] or separately [80] learn an encoding/decoding model together with a scorebased prior exist, the former still require a difficult weighting between reconstruction and generative capabilities [11] and are outperformed by our approach (Sec. 4), and the latter focus on highly structured images such as human faces.
3 Method
To lower the computational demands of training diffusion models towards highresolution image synthesis, we observe that although diffusion models allow to ignore perceptually irrelevant details by undersampling the corresponding loss terms [30], they still require costly function evaluations in pixel space, which causes huge demands in computation time and energy resources.
We propose to circumvent this drawback by introducing an explicit separation of the compressive from the generative learning phase (see Fig. 2). To achieve this, we utilize an autoencoding model which learns a space that is perceptually equivalent to the image space, but offers significantly reduced computational complexity.
Such an approach offers several advantages: (i) By leaving the highdimensional image space, we obtain DMs which are computationally much more efficient because sampling is performed on a lowdimensional space. (ii) We exploit the inductive bias of DMs inherited from their UNet architecture [71], which makes them particularly effective for data with spatial structure and therefore alleviates the need for aggressive, qualityreducing compression levels as required by previous approaches [23, 66]. (iii) Finally, we obtain generalpurpose compression models whose latent space can be used to train multiple generative models and which can also be utilized for other downstream applications such as singleimage CLIPguided synthesis [25].
3.1 Perceptual Image Compression
Our perceptual compression model is based on previous work [23] and consists of an autoencoder trained by combination of a perceptual loss [106] and a patchbased [33] adversarial objective [20, 23, 103]. This ensures that the reconstructions are confined to the image manifold by enforcing local realism and avoids bluriness introduced by relying solely on pixelspace losses such as ${L}_{2}$ or ${L}_{1}$ objectives.
More precisely, given an image $x\in {\mathbb{R}}^{H\times W\times 3}$ in RGB space, the encoder $\mathcal{E}$ encodes $x$ into a latent representation $z=\mathcal{E}(x)$, and the decoder $\mathcal{D}$ reconstructs the image from the latent, giving $\stackrel{~}{x}=\mathcal{D}(z)=\mathcal{D}(\mathcal{E}(x))$, where $z\in {\mathbb{R}}^{h\times w\times c}$. Importantly, the encoder downsamples the image by a factor $f=H/h=W/w$, and we investigate different downsampling factors $f={2}^{m}$, with $m\in \mathbb{N}$.
In order to avoid arbitrarily highvariance latent spaces, we experiment with two different kinds of regularizations. The first variant, KLreg., imposes a slight KLpenalty towards a standard normal on the learned latent, similar to a VAE [46, 69], whereas VQreg. uses a vector quantization layer [96] within the decoder. This model can be interpreted as a VQGAN [23] but with the quantization layer absorbed by the decoder. Because our subsequent DM is designed to work with the twodimensional structure of our learned latent space $z=\mathcal{E}(x)$, we can use relatively mild compression rates and achieve very good reconstructions. This is in contrast to previous works [23, 66], which relied on an arbitrary 1D ordering of the learned space $z$ to model its distribution autoregressively and thereby ignored much of the inherent structure of $z$. Hence, our compression model preserves details of $x$ better (see Tab. 8). The full objective and training details can be found in the supplement.
3.2 Latent Diffusion Models
Diffusion Models [82] are probabilistic models designed to learn a data distribution $p(x)$ by gradually denoising a normally distributed variable, which corresponds to learning the reverse process of a fixed Markov Chain of length $T$. For image synthesis, the most successful models [30, 15, 72] rely on a reweighted variant of the variational lower bound on $p(x)$, which mirrors denoising scorematching [85]. These models can be interpreted as an equally weighted sequence of denoising autoencoders ${\u03f5}_{\theta}({x}_{t},t);t=1\mathrm{\dots}T$, which are trained to predict a denoised variant of their input ${x}_{t}$, where ${x}_{t}$ is a noisy version of the input $x$. The corresponding objective can be simplified to (Sec. B)
$${L}_{DM}={\mathbb{E}}_{x,\u03f5\sim \mathcal{N}(0,1),t}\left[{\Vert \u03f5{\u03f5}_{\theta}({x}_{t},t)\Vert}_{2}^{2}\right],$$  (1) 
with $t$ uniformly sampled from $\{1,\mathrm{\dots},T\}$.
Generative Modeling of Latent Representations With our trained perceptual compression models consisting of $\mathcal{E}$ and $\mathcal{D}$, we now have access to an efficient, lowdimensional latent space in which highfrequency, imperceptible details are abstracted away. Compared to the highdimensional pixel space, this space is more suitable for likelihoodbased generative models, as they can now (i) focus on the important, semantic bits of the data and (ii) train in a lower dimensional, computationally much more efficient space.
Unlike previous work that relied on autoregressive, attentionbased transformer models in a highly compressed, discrete latent space [66, 23, 103], we can take advantage of imagespecific inductive biases that our model offers. This includes the ability to build the underlying UNet primarily from 2D convolutional layers, and further focusing the objective on the perceptually most relevant bits using the reweighted bound, which now reads
$${L}_{LDM}:={\mathbb{E}}_{\mathcal{E}(x),\u03f5\sim \mathcal{N}(0,1),t}\left[{\Vert \u03f5{\u03f5}_{\theta}({z}_{t},t)\Vert}_{2}^{2}\right].$$  (2) 
The neural backbone ${\u03f5}_{\theta}(\circ ,t)$ of our model is realized as a timeconditional UNet [71]. Since the forward process is fixed, ${z}_{t}$ can be efficiently obtained from $\mathcal{E}$ during training, and samples from $p(z$) can be decoded to image space with a single pass through $\mathcal{D}$.
CelebAHQ  FFHQ  LSUNChurches  LSUNBeds  ImageNet  

3.3 Conditioning Mechanisms
Similar to other types of generative models [56, 83], diffusion models are in principle capable of modeling conditional distributions of the form $p(zy)$. This can be implemented with a conditional denoising autoencoder ${\u03f5}_{\theta}({z}_{t},t,y)$ and paves the way to controlling the synthesis process through inputs $y$ such as text [68], semantic maps [61, 33] or other imagetoimage translation tasks [34].
In the context of image synthesis, however, combining the generative power of DMs with other types of conditionings beyond classlabels [15] or blurred variants of the input image [72] is so far an underexplored area of research.
We turn DMs into more flexible conditional image generators by augmenting their underlying UNet backbone with the crossattention mechanism [97], which is effective for learning attentionbased models of various input modalities [36, 35]. To preprocess $y$ from various modalities (such as language prompts) we introduce a domain specific encoder ${\tau}_{\theta}$ that projects $y$ to an intermediate representation ${\tau}_{\theta}(y)\in {\mathbb{R}}^{M\times {d}_{\tau}}$, which is then mapped to the intermediate layers of the UNet via a crossattention layer implementing $\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{Q{K}^{T}}{\sqrt{d}}\right)\cdot V$, with
$$Q={W}_{Q}^{(i)}\cdot {\phi}_{i}({z}_{t}),K={W}_{K}^{(i)}\cdot {\tau}_{\theta}(y),V={W}_{V}^{(i)}\cdot {\tau}_{\theta}(y).$$ 
Here, ${\phi}_{i}({z}_{t})\in {\mathbb{R}}^{N\times {d}_{\u03f5}^{i}}$ denotes a (flattened) intermediate representation of the UNet implementing ${\u03f5}_{\theta}$ and ${W}_{V}^{(i)}\in {\mathbb{R}}^{d\times {d}_{\u03f5}^{i}}$, ${W}_{Q}^{(i)}\in {\mathbb{R}}^{d\times {d}_{\tau}}$ & ${W}_{K}^{(i)}\in {\mathbb{R}}^{d\times {d}_{\tau}}$ are learnable projection matrices [97, 36]. See Fig. 3 for a visual depiction.
Based on imageconditioning pairs, we then learn the conditional LDM via
$${L}_{LDM}:={\mathbb{E}}_{\mathcal{E}(x),y,\u03f5\sim \mathcal{N}(0,1),t}\left[{\Vert \u03f5{\u03f5}_{\theta}({z}_{t},t,{\tau}_{\theta}(y))\Vert}_{2}^{2}\right],$$  (3) 
where both ${\tau}_{\theta}$ and ${\u03f5}_{\theta}$ are jointly optimized via Eq. 3. This conditioning mechanism is flexible as ${\tau}_{\theta}$ can be parameterized with domainspecific experts, e.g. (unmasked) transformers [97] when $y$ are text prompts (see Sec. 4.3.1)
4 Experiments
TexttoImage Synthesis on LAION. 1.45B Model.  

’A street sign that reads “Latent Diffusion” ’  ’A zombie in the style of Picasso’  ’An image of an animal half mouse half octopus’  ’An illustration of a slightly conscious neural network’  ’A painting of a squirrel eating a burger’  ’A watercolor painting of a chair that looks like an octopus’  ’A shirt with the inscription: “I love generative models!” ’ 
LDMs provide means to flexible and computationally tractable diffusion based image synthesis of various image modalities, which we empirically show in the following. Firstly, however, we analyze the gains of our models compared to pixelbased diffusion models in both training and inference. Interestingly, we find that LDMs trained in VQregularized latent spaces sometimes achieve better sample quality, even though the reconstruction capabilities of VQregularized first stage models slightly fall behind those of their continuous counterparts, cf. Tab. 8. A visual comparison between the effects of first stage regularization schemes on LDM training and their generalization abilities to resolutions $>{256}^{2}$ can be found in Appendix D.1. In E.2 we list details on architecture, implementation, training and evaluation for all results presented in this section.
4.1 On Perceptual Compression Tradeoffs
This section analyzes the behavior of our LDMs with different downsampling factors $f\in \{1,2,4,8,16,32\}$ (abbreviated as LDM$f$, where LDM1 corresponds to pixelbased DMs). To obtain a comparable testfield, we fix the computational resources to a single NVIDIA A100 for all experiments in this section and train all models for the same number of steps and with the same number of parameters.
Tab. 8 shows hyperparameters and reconstruction performance of the first stage models used for the LDMs compared in this section. Fig. 6 shows sample quality as a function of training progress for 2M steps of classconditional models on the ImageNet [12] dataset. We see that, i) small downsampling factors for LDM$\{$1,2$\}$ result in slow training progress, whereas ii) overly large values of $f$ cause stagnating fidelity after comparably few training steps. Revisiting the analysis above (Fig. 1 and 2) we attribute this to i) leaving most of perceptual compression to the diffusion model and ii) too strong first stage compression resulting in information loss and thus limiting the achievable quality. LDM$\{$416$\}$ strike a good balance between efficiency and perceptually faithful results, which manifests in a significant FID [29] gap of 38 between pixelbased diffusion (LDM1) and LDM8 after 2M training steps.
In Fig. 7, we compare models trained on CelebAHQ [39] and ImageNet in terms sampling speed for different numbers of denoising steps with the DDIM sampler [84] and plot it against FIDscores [29]. LDM$\{$48$\}$ outperform models with unsuitable ratios of perceptual and conceptual compression. Especially compared to pixelbased LDM1, they achieve much lower FID scores while simultaneously significantly increasing sample throughput. Complex datasets such as ImageNet require reduced compression rates to avoid reducing quality. In summary, LDM4 and 8 offer the best conditions for achieving highquality synthesis results.
CelebAHQ $256\times 256$ FFHQ $256\times 256$ Method FID $\downarrow $ Prec. $\uparrow $ Recall $\uparrow $ Method FID $\downarrow $ Prec. $\uparrow $ Recall $\uparrow $ DCVAE [63] 15.8   ImageBART [21] 9.57   VQGAN+T. [23] (k=400) 10.2   UNet GAN (+aug) [77] 10.9 (7.6)   PGGAN [39] 8.0   UDM [43] 5.54   LSGM [93] 7.22   StyleGAN [41] 4.16 0.71 0.46 UDM [43] 7.16   ProjectedGAN[76] 3.08 0.65 0.46 LDM4 (ours, 500s^{†}) 5.11 0.72 0.49 LDM4 (ours, 200s) 4.98 0.73 0.50
LSUNChurches $256\times 256$ LSUNBedrooms $256\times 256$ Method FID $\downarrow $ Prec. $\uparrow $ Recall $\uparrow $ Method FID $\downarrow $ Prec. $\uparrow $ Recall $\uparrow $ DDPM [30] 7.89   ImageBART [21] 5.51   ImageBART[21] 7.32   DDPM [30] 4.9   PGGAN [39] 6.42   UDM [43] 4.57   StyleGAN[41] 4.21   StyleGAN[41] 2.35 0.59 0.48 StyleGAN2[42] 3.86   ADM [15] 1.90 0.66 0.51 ProjectedGAN[76] 1.59 0.61 0.44 ProjectedGAN[76] 1.52 0.61 0.34 LDM8^{∗} (ours, 200s) 4.02 0.64 0.52 LDM4 (ours, 200s) 2.95 0.66 0.48
TextConditional Image Synthesis Method FID $\downarrow $ IS$\uparrow $ ${N}_{\text{params}}$ CogView^{†} [17] 27.10 18.20 4B selfranking, rejection rate 0.017 LAFITE^{†} [109] 26.94 26.02 75M GLIDE^{∗} [59] 12.24  6B 277 DDIM steps, c.f.g. [32] $s=3$ MakeAScene^{∗} [26] 11.84  4B c.f.g for AR models [98] $s=5$ LDMKL8 23.31 20.03$\pm \text{0.33}$ 1.45B 250 DDIM steps LDMKL8G^{∗} 12.63 30.29$\pm \text{0.42}$ 1.45B 250 DDIM steps, c.f.g. [32] $s=1.5$
4.2 Image Generation with Latent Diffusion
We train unconditional models of ${256}^{2}$ images on CelebAHQ [39], FFHQ [41], LSUNChurches and Bedrooms [102] and evaluate the i) sample quality and ii) their coverage of the data manifold using ii) FID [29] and ii) PrecisionandRecall [50]. Tab. 1 summarizes our results. On CelebAHQ, we report a new stateoftheart FID of $5.11$, outperforming previous likelihoodbased models as well as GANs. We also outperform LSGM [93] where a latent diffusion model is trained jointly together with the first stage. In contrast, we train diffusion models in a fixed space and avoid the difficulty of weighing reconstruction quality against learning the prior over the latent space, see Fig. 12.
We outperform prior diffusion based approaches on all but the LSUNBedrooms dataset, where our score is close to ADM [15], despite utilizing half its parameters and requiring 4times less train resources (see Appendix E.3.5). Moreover, LDMs consistently improve upon GANbased methods in Precision and Recall, thus confirming the advantages of their modecovering likelihoodbased training objective over adversarial approaches. In Fig. 4 we also show qualitative results on each dataset.
4.3 Conditional Latent Diffusion
4.3.1 Transformer Encoders for LDMs
By introducing crossattention based conditioning into LDMs we open them up for various conditioning modalities previously unexplored for diffusion models. For texttoimage image modeling, we train a 1.45B parameter KLregularized LDM conditioned on language prompts on LAION400M [78]. We employ the BERTtokenizer [14] and implement ${\tau}_{\theta}$ as a transformer [97] to infer a latent code which is mapped into the UNet via (multihead) crossattention (Sec. 3.3). This combination of domain specific experts for learning a language representation and visual synthesis results in a powerful model, which generalizes well to complex, userdefined text prompts, cf. Fig. 8 and 5. For quantitative analysis, we follow prior work and evaluate texttoimage generation on the MSCOCO [51] validation set, where our model improves upon powerful AR [66, 17] and GANbased [109] methods, cf. Tab. 2. We note that applying classifierfree diffusion guidance [32] greatly boosts sample quality, such that the guided LDMKL8G is on par with the recent stateoftheart AR [26] and diffusion models [59] for texttoimage synthesis, while substantially reducing parameter count. To further analyze the flexibility of the crossattention based conditioning mechanism we also train models to synthesize images based on semantic layouts on OpenImages [49], and finetune on COCO [4], see Fig. 8. See Sec. D.3 for the quantitative evaluation and implementation details.
Lastly, following prior work [15, 3, 23, 21], we evaluate our bestperforming classconditional ImageNet models with $f\in \{4,8\}$ from Sec. 4.1 in Tab. 3, Fig. 4 and Sec. D.4. Here we outperform the state of the art diffusion model ADM [15] while significantly reducing computational requirements and parameter count, cf. Tab 18.
Method FID$\downarrow $ IS$\uparrow $ Precision$\uparrow $ Recall$\uparrow $ ${N}_{\text{params}}$ BigGandeep [3] 6.95 203.6$\pm \text{2.6}$ 0.87 0.28 340M  ADM [15] 10.94 100.98 0.69 0.63 554M 250 DDIM steps ADMG [15] 4.59 186.7 0.82 0.52 608M 250 DDIM steps LDM4 (ours) 10.56 103.49$\pm \text{1.24}$ 0.71 0.62 400M 250 DDIM steps LDM4G (ours) 3.60 247.67$\pm \text{5.59}$ 0.87 0.48 400M 250 steps, c.f.g [32], $s=1.5$
4.3.2 Convolutional Sampling Beyond ${256}^{2}$
By concatenating spatially aligned conditioning information to the input of ${\u03f5}_{\theta}$, LDMs can serve as efficient generalpurpose imagetoimage translation models. We use this to train models for semantic synthesis, superresolution (Sec. 4.4) and inpainting (Sec. 4.5). For semantic synthesis, we use images of landscapes paired with semantic maps [61, 23] and concatenate downsampled versions of the semantic maps with the latent image representation of a $f=4$ model (VQreg., see Tab. 8). We train on an input resolution of ${256}^{2}$ (crops from ${384}^{2}$) but find that our model generalizes to larger resolutions and can generate images up to the megapixel regime when evaluated in a convolutional manner (see Fig. 9). We exploit this behavior to also apply the superresolution models in Sec. 4.4 and the inpainting models in Sec. 4.5 to generate large images between ${512}^{2}$ and ${1024}^{2}$. For this application, the signaltonoise ratio (induced by the scale of the latent space) significantly affects the results. In Sec. D.1 we illustrate this when learning an LDM on (i) the latent space as provided by a $f=4$ model (KLreg., see Tab. 8), and (ii) a rescaled version, scaled by the componentwise standard deviation.
The latter, in combination with classifierfree guidance [32], also enables the direct synthesis of $>{256}^{2}$ images for the textconditional LDMKL8G as in Fig. 13.
4.4 SuperResolution with Latent Diffusion
LDMs can be efficiently trained for superresolution by diretly conditioning on lowresolution images via concatenation (cf. Sec. 3.3). In a first experiment, we follow SR3 [72] and fix the image degradation to a bicubic interpolation with $4\times $downsampling and train on ImageNet following SR3’s data processing pipeline. We use the $f=4$ autoencoding model pretrained on OpenImages (VQreg., cf. Tab. 8) and concatenate the lowresolution conditioning $y$ and the inputs to the UNet, i.e. ${\tau}_{\theta}$ is the identity. Our qualitative and quantitative results (see Fig. 10 and Tab. 5) show competitive performance and LDMSR outperforms SR3 in FID while SR3 has a better IS. A simple image regression model achieves the highest PSNR and SSIM scores; however these metrics do not align well with human perception [106] and favor blurriness over imperfectly aligned high frequency details [72]. Further, we conduct a user study comparing the pixelbaseline with LDMSR. We follow SR3 [72] where human subjects were shown a lowres image in between two highres images and asked for preference. The results in Tab. 4 affirm the good performance of LDMSR. PSNR and SSIM can be pushed by using a posthoc guiding mechanism [15] and we implement this imagebased guider via a perceptual loss, see Sec. D.6.
bicubic  LDMSR  SR3 

SR on ImageNet Inpainting on Places User Study PixelDM ($f1$) LDM4 LAMA [88] LDM4 Task 1: Preference vs GT $\uparrow $ 16.0% 30.4% 13.6% 21.0% Task 2: Preference Score $\uparrow $ 29.4% 70.6% 31.9% 68.1%
Since the bicubic degradation process does not generalize well to images which do not follow this preprocessing, we also train a generic model, LDMBSR, by using more diverse degradation. The results are shown in Sec. D.6.1.
Method FID $\downarrow $ IS $\uparrow $ PSNR $\uparrow $ SSIM $\uparrow $ ${N}_{\text{params}}$ $\left[\frac{\text{samples}}{s}\right]{(}^{\ast})$ Image Regression [72] 15.2 121.1 27.9 0.801 625M N/A SR3 [72] 5.2 180.1 26.4 0.762 625M N/A LDM4 (ours, 100 steps) 2.8^{†}/4.8^{‡} 166.3 24.4$\pm $3.8 0.69$\pm $0.14 169M 4.62 emphLDM4 (ours, big, 100 steps) 2.4^{†}/4.3^{‡} 174.9 24.7$\pm $4.1 0.71$\pm $0.15 552M 4.5 LDM4 (ours, 50 steps, guiding) 4.4^{†}/6.4^{‡} 153.7 25.8$\pm $3.7 0.74$\pm $0.12 184M 0.38
4.5 Inpainting with Latent Diffusion
Inpainting is the task of filling masked regions of an image with new content either because parts of the image are are corrupted or to replace existing but undesired content within the image. We evaluate how our general approach for conditional image generation compares to more specialized, stateoftheart approaches for this task. Our evaluation follows the protocol of LaMa[88], a recent inpainting model that introduces a specialized architecture relying on Fast Fourier Convolutions[8]. The exact training & evaluation protocol on Places[108] is described in Sec. E.2.2.
We first analyze the effect of different design choices for the first stage.
train throughput sampling throughput^{†} train+val FID@2k Model (reg.type) samples/sec. @256 @512 hours/epoch epoch 6 LDM1 (no first stage) 0.11 0.26 0.07 20.66 24.74 LDM4 (KL, w/ attn) 0.32 0.97 0.34 7.66 15.21 LDM4 (VQ, w/ attn) 0.33 0.97 0.34 7.04 14.99 LDM4 (VQ, w/o attn) 0.35 0.99 0.36 6.66 15.95
input  result 

In particular, we compare the inpainting efficiency of LDM1 (i.e. a pixelbased conditional DM) with LDM4, for both KL and VQ regularizations, as well as VQLDM4 without any attention in the first stage (see Tab. 8), where the latter reduces GPU memory for decoding at high resolutions. For comparability, we fix the number of parameters for all models. Tab. 6 reports the training and sampling throughput at resolution ${256}^{2}$ and ${512}^{2}$, the total training time in hours per epoch and the FID score on the validation split after six epochs. Overall, we observe a speedup of at least $2.7\times $ between pixel and latentbased diffusion models while improving FID scores by a factor of at least $1.6\times $.
The comparison with other inpainting approaches in Tab. 7 shows that our model with attention improves the overall image quality as measured by FID over that of [88]. LPIPS between the unmasked images and our samples is slightly higher than that of [88]. We attribute this to [88] only producing a single result which tends to recover more of an average image compared to the diverse results produced by our LDM cf. Fig. 21. Additionally in a user study (Tab. 4) human subjects favor our results over those of [88].
Based on these initial results, we also trained a larger diffusion model (big in Tab. 7) in the latent space of the VQregularized first stage without attention. Following [15], the UNet of this diffusion model uses attention layers on three levels of its feature hierarchy, the BigGAN [3] residual block for up and downsampling and has 387M parameters instead of 215M. After training, we noticed a discrepancy in the quality of samples produced at resolutions ${256}^{2}$ and ${512}^{2}$, which we hypothesize to be caused by the additional attention modules. However, finetuning the model for half an epoch at resolution ${512}^{2}$ allows the model to adjust to the new feature statistics and sets a new state of the art FID on image inpainting (big, w/o attn, w/ ft in Tab. 7, Fig. 11.).
4050% masked All samples Method FID $\downarrow $ LPIPS $\downarrow $ FID $\downarrow $ LPIPS $\downarrow $ LDM4 (ours, big, w/ ft) 9.39 0.246$\pm $ 0.042 1.50 0.137$\pm $ 0.080 LDM4 (ours, big, w/o ft) 12.89 0.257$\pm $ 0.047 2.40 0.142$\pm $ 0.085 LDM4 (ours, w/ attn) 11.87 0.257$\pm $ 0.042 2.15 0.144$\pm $ 0.084 LDM4 (ours, w/o attn) 12.60 0.259$\pm $ 0.041 2.37 0.145$\pm $ 0.084 LaMa[88]^{†} 12.31 0.243$\pm $ 0.038 2.23 0.134$\pm $ 0.080 LaMa[88] 12.0 0.24 2.21 0.14 CoModGAN[107] 10.4 0.26 1.82 0.15 RegionWise[52] 21.3 0.27 4.75 0.15 DeepFill v2[104] 22.1 0.28 5.20 0.16 EdgeConnect[58] 30.5 0.28 8.37 0.16
5 Limitations & Societal Impact
Limitations
While LDMs significantly reduce computational requirements compared to pixelbased approaches, their sequential sampling process is still slower than that of GANs. Moreover, the use of LDMs can be questionable when high precision is required: although the loss of image quality is very small in our $f=4$ autoencoding models (see Fig. 1), their reconstruction capability can become a bottleneck for tasks that require finegrained accuracy in pixel space. We assume that our superresolution models (Sec. 4.4) are already somewhat limited in this respect.
Societal Impact
Generative models for media like imagery are a doubleedged sword: On the one hand, they enable various creative applications, and in particular approaches like ours that reduce the cost of training and inference have the potential to facilitate access to this technology and democratize its exploration. On the other hand, it also means that it becomes easier to create and disseminate manipulated data or spread misinformation and spam. In particular, the deliberate manipulation of images (“deep fakes”) is a common problem in this context, and women in particular are disproportionately affected by it [13, 24].
Generative models can also reveal their training data [5, 90], which is of great concern when the data contain sensitive or personal information and were collected without explicit consent. However, the extent to which this also applies to DMs of images is not yet fully understood.
Finally, deep learning modules tend to reproduce or exacerbate biases that are already present in the data [91, 38, 22]. While diffusion models achieve better coverage of the data distribution than e.g. GANbased approaches, the extent to which our twostage approach that combines adversarial training and a likelihoodbased objective misrepresents the data remains an important research question.
For a more general, detailed discussion of the ethical considerations of deep generative models, see e.g. [13].
6 Conclusion
We have presented latent diffusion models, a simple and efficient way to significantly improve both the training and sampling efficiency of denoising diffusion models without degrading their quality. Based on this and our crossattention conditioning mechanism, our experiments could demonstrate favorable results compared to stateoftheart methods across a wide range of conditional image synthesis tasks without taskspecific architectures. ^{†}^{†}This work has been supported by the German Federal Ministry for Economic Affairs and Energy within the project ’KIAbsicherung  Safe AI for automated driving’ and by the German Research Foundation (DFG) project 421703927.
References
 [1] Eirikur Agustsson and Radu Timofte. NTIRE 2017 challenge on single image superresolution: Dataset and study. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 2126, 2017, pages 1122–1131. IEEE Computer Society, 2017.
 [2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan, 2017.
 [3] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In Int. Conf. Learn. Represent., 2019.
 [4] Holger Caesar, Jasper R. R. Uijlings, and Vittorio Ferrari. Cocostuff: Thing and stuff classes in context. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, pages 1209–1218. Computer Vision Foundation / IEEE Computer Society, 2018.
 [5] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel HerbertVoss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.
 [6] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 1691–1703. PMLR, 2020.
 [7] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. In ICLR. OpenReview.net, 2021.
 [8] Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolution. In NeurIPS, 2020.
 [9] Rewon Child. Very deep vaes generalize autoregressive models and can outperform them on images. CoRR, abs/2011.10650, 2020.
 [10] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019.
 [11] Bin Dai and David P. Wipf. Diagnosing and enhancing VAE models. In ICLR (Poster). OpenReview.net, 2019.
 [12] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and FeiFei Li. Imagenet: A largescale hierarchical image database. In CVPR, pages 248–255. IEEE Computer Society, 2009.
 [13] Emily Denton. Ethical considerations of generative ai. AI for Content Creation Workshop, CVPR, 2021.
 [14] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. BERT: pretraining of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
 [15] Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. CoRR, abs/2105.05233, 2021.
 [16] Sander Dieleman. Musings on typicality, 2020.
 [17] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. Cogview: Mastering texttoimage generation via transformers. CoRR, abs/2105.13290, 2021.
 [18] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Nonlinear independent components estimation, 2015.
 [19] Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using real NVP. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings. OpenReview.net, 2017.
 [20] Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Adv. Neural Inform. Process. Syst., pages 658–666, 2016.
 [21] Patrick Esser, Robin Rombach, Andreas Blattmann, and Björn Ommer. Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis. CoRR, abs/2108.08827, 2021.
 [22] Patrick Esser, Robin Rombach, and Björn Ommer. A note on data biases in generative models. arXiv preprint arXiv:2012.02516, 2020.
 [23] Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for highresolution image synthesis. CoRR, abs/2012.09841, 2020.
 [24] Mary Anne Franks and Ari Ezra Waldman. Sex, lies, and videotape: Deep fakes and free speech delusions. Md. L. Rev., 78:892, 2018.
 [25] Kevin Frans, Lisa B. Soros, and Olaf Witkowski. Clipdraw: Exploring texttodrawing synthesis through languageimage encoders. ArXiv, abs/2106.14843, 2021.
 [26] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Makeascene: Scenebased texttoimage generation with human priors. CoRR, abs/2203.13131, 2022.
 [27] Ian J. Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial networks. CoRR, 2014.
 [28] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans, 2017.
 [29] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. In Adv. Neural Inform. Process. Syst., pages 6626–6637, 2017.
 [30] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020.
 [31] Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. CoRR, abs/2106.15282, 2021.
 [32] Jonathan Ho and Tim Salimans. Classifierfree diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
 [33] Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A. Efros. Imagetoimage translation with conditional adversarial networks. In CVPR, pages 5967–5976. IEEE Computer Society, 2017.
 [34] Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A. Efros. Imagetoimage translation with conditional adversarial networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5967–5976, 2017.
 [35] Andrew Jaegle, Sebastian Borgeaud, JeanBaptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier J. Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, and João Carreira. Perceiver IO: A general architecture for structured inputs &outputs. CoRR, abs/2107.14795, 2021.
 [36] Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and João Carreira. Perceiver: General perception with iterative attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 1824 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 4651–4664. PMLR, 2021.
 [37] Manuel Jahn, Robin Rombach, and Björn Ommer. Highresolution complex scene synthesis with transformers. CoRR, abs/2105.06458, 2021.
 [38] Niharika Jain, Alberto Olmo, Sailik Sengupta, Lydia Manikonda, and Subbarao Kambhampati. Imperfect imaganation: Implications of gans exacerbating biases on facial data augmentation and snapchat selfie lenses. arXiv preprint arXiv:2001.09528, 2020.
 [39] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. CoRR, abs/1710.10196, 2017.
 [40] Tero Karras, Samuli Laine, and Timo Aila. A stylebased generator architecture for generative adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4401–4410, 2019.
 [41] T. Karras, S. Laine, and T. Aila. A stylebased generator architecture for generative adversarial networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
 [42] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. CoRR, abs/1912.04958, 2019.
 [43] Dongjun Kim, Seungjae Shin, Kyungwoo Song, Wanmo Kang, and IlChul Moon. Score matching model for unbounded data score. CoRR, abs/2106.05527, 2021.
 [44] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, 2018.
 [45] Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. CoRR, abs/2107.00630, 2021.
 [46] Diederik P. Kingma and Max Welling. AutoEncoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR, 2014.
 [47] Zhifeng Kong and Wei Ping. On fast sampling of diffusion probabilistic models. CoRR, abs/2106.00132, 2021.
 [48] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In ICLR. OpenReview.net, 2021.
 [49] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R. Uijlings, Ivan Krasin, Jordi PontTuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, and Vittorio Ferrari. The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. CoRR, abs/1811.00982, 2018.
 [50] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. CoRR, abs/1904.06991, 2019.
 [51] TsungYi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014.
 [52] Yuqing Ma, Xianglong Liu, Shihao Bai, LeYi Wang, Aishan Liu, Dacheng Tao, and Edwin Hancock. Regionwise generative adversarial imageinpainting for large missing areas. ArXiv, abs/1909.12507, 2019.
 [53] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, JunYan Zhu, and Stefano Ermon. Sdedit: Image synthesis and editing with stochastic differential equations. CoRR, abs/2108.01073, 2021.
 [54] Lars M. Mescheder. On the convergence properties of GAN training. CoRR, abs/1801.04406, 2018.
 [55] Luke Metz, Ben Poole, David Pfau, and Jascha SohlDickstein. Unrolled generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings. OpenReview.net, 2017.
 [56] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014.
 [57] Gautam Mittal, Jesse H. Engel, Curtis Hawthorne, and Ian Simon. Symbolic music generation with diffusion models. CoRR, abs/2103.16091, 2021.
 [58] Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Z. Qureshi, and Mehran Ebrahimi. Edgeconnect: Generative image inpainting with adversarial edge learning. ArXiv, abs/1901.00212, 2019.
 [59] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: towards photorealistic image generation and editing with textguided diffusion models. CoRR, abs/2112.10741, 2021.
 [60] Anton Obukhov, Maximilian Seitzer, PoWei Wu, Semen Zhydenko, Jonathan Kyl, and Elvis YuJing Lin. Highfidelity performance metrics for generative models in pytorch, 2020. Version: 0.3.0, DOI: 10.5281/zenodo.4957738.
 [61] Taesung Park, MingYu Liu, TingChun Wang, and JunYan Zhu. Semantic image synthesis with spatiallyadaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
 [62] Taesung Park, MingYu Liu, TingChun Wang, and JunYan Zhu. Semantic image synthesis with spatiallyadaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 [63] Gaurav Parmar, Dacheng Li, Kwonjoon Lee, and Zhuowen Tu. Dual contradistinctive generative autoencoder. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 1925, 2021, pages 823–832. Computer Vision Foundation / IEEE, 2021.
 [64] Gaurav Parmar, Richard Zhang, and JunYan Zhu. On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:2104.11222, 2021.
 [65] David A. Patterson, Joseph Gonzalez, Quoc V. Le, Chen Liang, LluisMiquel Munguia, Daniel Rothchild, David R. So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. CoRR, abs/2104.10350, 2021.
 [66] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zeroshot texttoimage generation. CoRR, abs/2102.12092, 2021.
 [67] Ali Razavi, Aäron van den Oord, and Oriol Vinyals. Generating diverse highfidelity images with VQVAE2. In NeurIPS, pages 14837–14847, 2019.
 [68] Scott E. Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In ICML, 2016.
 [69] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on International Conference on Machine Learning, ICML, 2014.
 [70] Robin Rombach, Patrick Esser, and Björn Ommer. Networktonetwork translation with conditional invertible neural networks. In NeurIPS, 2020.
 [71] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In MICCAI (3), volume 9351 of Lecture Notes in Computer Science, pages 234–241. Springer, 2015.
 [72] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image superresolution via iterative refinement. CoRR, abs/2104.07636, 2021.
 [73] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. CoRR, abs/1701.05517, 2017.
 [74] Dave Salvator. NVIDIA Developer Blog. https://developer.nvidia.com/blog/gettingimmediatespeedupswitha100tf32, 2020.
 [75] Robin SanRoman, Eliya Nachmani, and Lior Wolf. Noise estimation for generative diffusion models. CoRR, abs/2104.02600, 2021.
 [76] Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas Geiger. Projected gans converge faster. CoRR, abs/2111.01007, 2021.
 [77] Edgar Schönfeld, Bernt Schiele, and Anna Khoreva. A unet based discriminator for generative adversarial networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 1319, 2020, pages 8204–8213. Computer Vision Foundation / IEEE, 2020.
 [78] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion400m: Open dataset of clipfiltered 400 million imagetext pairs, 2021.
 [79] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In Yoshua Bengio and Yann LeCun, editors, Int. Conf. Learn. Represent., 2015.
 [80] Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano Ermon. D2C: diffusiondenoising models for fewshot conditional generation. CoRR, abs/2106.06819, 2021.
 [81] Charlie Snell. Alien Dreams: An Emerging Art Scene. https://ml.berkeley.edu/blog/posts/clipart/, 2021. [Online; accessed November2021].
 [82] Jascha SohlDickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. CoRR, abs/1503.03585, 2015.
 [83] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
 [84] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR. OpenReview.net, 2021.
 [85] Yang Song, Jascha SohlDickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Scorebased generative modeling through stochastic differential equations. CoRR, abs/2011.13456, 2020.
 [86] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for modern deep learning research. In The ThirtyFourth AAAI Conference on Artificial Intelligence, AAAI 2020, The ThirtySecond Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 712, 2020, pages 13693–13696. AAAI Press, 2020.
 [87] Wei Sun and Tianfu Wu. Learning layout and style reconfigurable gans for controllable image synthesis. CoRR, abs/2003.11571, 2020.
 [88] Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor S. Lempitsky. Resolutionrobust large mask inpainting with fourier convolutions. ArXiv, abs/2109.07161, 2021.
 [89] Tristan Sylvain, Pengchuan Zhang, Yoshua Bengio, R. Devon Hjelm, and Shikhar Sharma. Objectcentric image generation from layouts. In ThirtyFifth AAAI Conference on Artificial Intelligence, AAAI 2021, ThirtyThird Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 29, 2021, pages 2647–2655. AAAI Press, 2021.
 [90] Patrick Tinsley, Adam Czajka, and Patrick Flynn. This face does not exist… but it might be yours! identity leakage in generative models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1320–1328, 2021.
 [91] Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. In CVPR 2011, pages 1521–1528. IEEE, 2011.
 [92] Arash Vahdat and Jan Kautz. NVAE: A deep hierarchical variational autoencoder. In NeurIPS, 2020.
 [93] Arash Vahdat, Karsten Kreis, and Jan Kautz. Scorebased generative modeling in latent space. CoRR, abs/2106.05931, 2021.
 [94] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, koray kavukcuoglu, Oriol Vinyals, and Alex Graves. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, 2016.
 [95] Aäron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. CoRR, abs/1601.06759, 2016.
 [96] Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In NIPS, pages 6306–6315, 2017.
 [97] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 5998–6008, 2017.
 [98] Rivers Have Wings. Tweet on Classifierfree guidance for autoregressive models. https://twitter.com/RiversHaveWings/status/1478093658716966912, 2022.
 [99] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface’s transformers: Stateoftheart natural language processing. CoRR, abs/1910.03771, 2019.
 [100] Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat. VAEBM: A symbiosis between variational autoencoders and energybased models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 37, 2021. OpenReview.net, 2021.
 [101] Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using VQVAE and transformers. CoRR, abs/2104.10157, 2021.
 [102] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. LSUN: construction of a largescale image dataset using deep learning with humans in the loop. CoRR, abs/1506.03365, 2015.
 [103] Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vectorquantized image modeling with improved vqgan, 2021.
 [104] Jiahui Yu, Zhe L. Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S. Huang. Freeform image inpainting with gated convolution. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4470–4479, 2019.
 [105] K. Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind image superresolution. ArXiv, abs/2103.14006, 2021.
 [106] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [107] Shengyu Zhao, Jianwei Cui, Yilun Sheng, Yue Dong, Xiao Liang, Eric IChao Chang, and Yan Xu. Large scale image completion via comodulated generative adversarial networks. ArXiv, abs/2103.10428, 2021.
 [108] Bolei Zhou, Àgata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40:1452–1464, 2018.
 [109] Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. LAFITE: towards languagefree training for texttoimage generation. CoRR, abs/2111.13792, 2021.
Appendix
’A painting of the last supper by Picasso.’  

’An oil painting of a latent space.’  ’An epic painting of Gandalf the Black summoning thunder and lightning in the mountains.’ 
’A sunset over a mountain range, vector image.’  
Appendix A Changelog
Here we list changes between this version (https://arxiv.org/abs/2112.10752v2) of the paper and the previous version, i.e. https://arxiv.org/abs/2112.10752v1.

•
We updated the results on texttoimage synthesis in Sec. 4.3 which were obtained by training a new, larger model (1.45B parameters). This also includes a new comparison to very recent competing methods on this task that were published on arXiv at the same time as ([59, 109]) or after ([26]) the publication of our work.

•
We updated results on classconditional synthesis on ImageNet in Sec. 4.1, Tab. 3 (see also Sec. D.4) obtained by retraining the model with a larger batch size. The corresponding qualitative results in Fig. 26 and Fig. 27 were also updated. Both the updated texttoimage and the classconditional model now use classifierfree guidance [32] as a measure to increase visual fidelity.
 •
 •
Appendix B Detailed Information on Denoising Diffusion Models
Diffusion models can be specified in terms of a signaltonoise ratio $\text{SNR}(t)=\frac{{\alpha}_{t}^{2}}{{\sigma}_{t}^{2}}$ consisting of sequences ${({\alpha}_{t})}_{t=1}^{T}$ and ${({\sigma}_{t})}_{t=1}^{T}$ which, starting from a data sample ${x}_{0}$, define a forward diffusion process $q$ as
$$q({x}_{t}{x}_{0})=\mathcal{N}({x}_{t}{\alpha}_{t}{x}_{0},{\sigma}_{t}^{2}\mathbb{I})$$  (4) 
with the Markov structure for $$:
$q({x}_{t}{x}_{s})$  $=\mathcal{N}({x}_{t}{\alpha}_{ts}{x}_{s},{\sigma}_{ts}^{2}\mathbb{I})$  (5)  
${\alpha}_{ts}$  $={\displaystyle \frac{{\alpha}_{t}}{{\alpha}_{s}}}$  (6)  
${\sigma}_{ts}^{2}$  $={\sigma}_{t}^{2}{\alpha}_{ts}^{2}{\sigma}_{s}^{2}$  (7) 
Denoising diffusion models are generative models $p({x}_{0})$ which revert this process with a similar Markov structure running backward in time, i.e. they are specified as
$$p({x}_{0})={\int}_{z}p({x}_{T})\prod _{t=1}^{T}p({x}_{t1}{x}_{t})$$  (8) 
The evidence lower bound (ELBO) associated with this model then decomposes over the discrete time steps as
$$\mathrm{log}p({x}_{0})\le \mathbb{K}\mathbb{L}(q({x}_{T}{x}_{0})p({x}_{T}))+\sum _{t=1}^{T}{\mathbb{E}}_{q({x}_{t}{x}_{0})}\mathbb{K}\mathbb{L}(q({x}_{t1}{x}_{t},{x}_{0})p({x}_{t1}{x}_{t}))$$  (9) 
The prior $p({x}_{T})$ is typically choosen as a standard normal distribution and the first term of the ELBO then depends only on the final signaltonoise ratio $\text{SNR}(T)$. To minimize the remaining terms, a common choice to parameterize $p({x}_{t1}{x}_{t})$ is to specify it in terms of the true posterior $q({x}_{t1}{x}_{t},{x}_{0})$ but with the unknown ${x}_{0}$ replaced by an estimate ${x}_{\theta}({x}_{t},t)$ based on the current step ${x}_{t}$. This gives [45]
$p({x}_{t1}{x}_{t})$  $\u2254q({x}_{t1}{x}_{t},{x}_{\theta}({x}_{t},t))$  (10)  
$=\mathcal{N}({x}_{t1}{\mu}_{\theta}({x}_{t},t),{\sigma}_{tt1}^{2}{\displaystyle \frac{{\sigma}_{t1}^{2}}{{\sigma}_{t}^{2}}}\mathbb{I}),$  (11) 
where the mean can be expressed as
$${\mu}_{\theta}({x}_{t},t)=\frac{{\alpha}_{tt1}{\sigma}_{t1}^{2}}{{\sigma}_{t}^{2}}{x}_{t}+\frac{{\alpha}_{t1}{\sigma}_{tt1}^{2}}{{\sigma}_{t}^{2}}{x}_{\theta}({x}_{t},t).$$  (12) 
In this case, the sum of the ELBO simplify to
$$\sum _{t=1}^{T}{\mathbb{E}}_{q({x}_{t}{x}_{0})}\mathbb{K}\mathbb{L}(q({x}_{t1}{x}_{t},{x}_{0})p({x}_{t1})=\sum _{t=1}^{T}{\mathbb{E}}_{\mathcal{N}(\u03f50,\mathbb{I})}\frac{1}{2}(\text{SNR}(t1)\text{SNR}(t))\parallel {x}_{0}{x}_{\theta}({\alpha}_{t}{x}_{0}+{\sigma}_{t}\u03f5,t){\parallel}^{2}$$  (13) 
Following [30], we use the reparameterization
$${\u03f5}_{\theta}({x}_{t},t)=({x}_{t}{\alpha}_{t}{x}_{\theta}({x}_{t},t))/{\sigma}_{t}$$  (14) 
to express the reconstruction term as a denoising objective,
$${\Vert {x}_{0}{x}_{\theta}({\alpha}_{t}{x}_{0}+{\sigma}_{t}\u03f5,t)\Vert}^{2}=\frac{{\sigma}_{t}^{2}}{{\alpha}_{t}^{2}}{\Vert \u03f5{\u03f5}_{\theta}({\alpha}_{t}{x}_{0}+{\sigma}_{t}\u03f5,t)\Vert}^{2}$$  (15) 
and the reweighting, which assigns each of the terms the same weight and results in Eq. (1).
Appendix C Image Guiding Mechanisms
Samples ${256}^{2}$  Guided Convolutional Samples ${512}^{2}$  Convolutional Samples ${512}^{2}$ 

An intriguing feature of diffusion models is that unconditional models can be conditioned at testtime [85, 82, 15]. In particular, [15] presented an algorithm to guide both unconditional and conditional models trained on the ImageNet dataset with a classifier $\mathrm{log}{p}_{\mathrm{\Phi}}(y{x}_{t})$, trained on each ${x}_{t}$ of the diffusion process. We directly build on this formulation and introduce posthoc imageguiding:
For an epsilonparameterized model with fixed variance, the guiding algorithm as introduced in [15] reads:
$$\widehat{\u03f5}\leftarrow {\u03f5}_{\theta}({z}_{t},t)+\sqrt{1{\alpha}_{t}^{2}}{\nabla}_{{z}_{t}}\mathrm{log}{p}_{\mathrm{\Phi}}(y{z}_{t}).$$  (16) 
This can be interpreted as an update correcting the “score” ${\u03f5}_{\theta}$ with a conditional distribution $\mathrm{log}{p}_{\mathrm{\Phi}}(y{z}_{t})$.
So far, this scenario has only been applied to singleclass classification models. We reinterpret the guiding distribution ${p}_{\mathrm{\Phi}}(yT(\mathcal{D}({z}_{0}({z}_{t}))))$
as a general purpose imagetoimage translation task given a target image $y$,
where $T$ can be any differentiable transformation adopted to the imagetoimage translation task at hand,
such as the identity, a downsampling operation or similar.
As an example, we can assume a Gaussian guider with fixed variance ${\sigma}^{2}=1$, such that
$$\mathrm{log}{p}_{\mathrm{\Phi}}(y{z}_{t})=\frac{1}{2}{\Vert yT(\mathcal{D}({z}_{0}({z}_{t})))\Vert}_{2}^{2}$$  (17) 
becomes a ${L}_{2}$ regression objective.
Fig. 14 demonstrates how this formulation can serve as an upsampling mechanism of an unconditional model trained on ${256}^{2}$ images, where unconditional samples of size ${256}^{2}$ guide the convolutional synthesis of ${512}^{2}$ images and $T$ is a $2\times $ bicubic downsampling. Following this motivation, we also experiment with a perceptual similarity guiding and replace the ${L}_{2}$ objective with the LPIPS [106] metric, see Sec. 4.4.
Appendix D Additional Results
D.1 Choosing the SignaltoNoise Ratio for HighResolution Synthesis
KLreg, w/o rescaling  KLreg, w/ rescaling  VQreg, w/o rescaling 

As discussed in Sec. 4.3.2, the signaltonoise ratio induced by the variance of the latent space (i.e. $\text{Var(z)}/{\sigma}_{t}^{2}$) significantly affects the results for convolutional sampling. For example, when training a LDM directly in the latent space of a KLregularized model (see Tab. 8), this ratio is very high, such that the model allocates a lot of semantic detail early on in the reverse denoising process. In contrast, when rescaling the latent space by the componentwise standard deviation of the latents as described in Sec. G, the SNR is descreased. We illustrate the effect on convolutional sampling for semantic image synthesis in Fig. 15. Note that the VQregularized space has a variance close to $1$, such that it does not have to be rescaled.
D.2 Full List of all First Stage Models
We provide a complete list of various autoenconding models trained on the OpenImages dataset in Tab. 8.
$f$ $\mathcal{Z}$ $c$ RFID $\downarrow $ RIS $\uparrow $ PSNR $\uparrow $ PSIM $\downarrow $ SSIM $\uparrow $ 16 VQGAN [23] 16384 256 4.98 – 19.9 $\pm 3.4$ 1.83 $\pm 0.42$ 0.51 $\pm 0.18$ 16 VQGAN [23] 1024 256 7.94 – 19.4 $\pm 3.3$ 1.98 $\pm 0.43$ 0.50 $\pm 0.18$ 8 DALLE [66] 8192  32.01 – 22.8 $\pm 2.1$ 1.95 $\pm 0.51$ 0.73 $\pm 0.13$ 32 16384 16 31.83 40.40 $\pm 1.07$ 17.45 $\pm 2.90$ 2.58 $\pm 0.48$ 0.41 $\pm 0.18$ 16 16384 8 5.15 144.55 $\pm 3.74$ 20.83 $\pm 3.61$ 1.73 $\pm 0.43$ 0.54 $\pm 0.18$ 8 16384 4 1.14 201.92 $\pm 3.97$ 23.07 $\pm 3.99$ 1.17 $\pm 0.36$ 0.65 $\pm 0.16$ 8 256 4 1.49 194.20 $\pm 3.87$ 22.35 $\pm 3.81$ 1.26 $\pm 0.37$ 0.62 $\pm 0.16$ 4 8192 3 0.58 224.78 $\pm 5.35$ 27.43 $\pm 4.26$ 0.53 $\pm 0.21$ 0.82 $\pm 0.10$ 4^{†} 8192 3 1.06 221.94 $\pm 4.58$ 25.21 $\pm 4.17$ 0.72 $\pm 0.26$ 0.76 $\pm 0.12$ 4 256 3 0.47 223.81 $\pm 4.58$ 26.43 $\pm 4.22$ 0.62 $\pm 0.24$ 0.80 $\pm 0.11$ 2 2048 2 0.16 232.75 $\pm 5.09$ 30.85 $\pm 4.12$ 0.27 $\pm 0.12$ 0.91 $\pm 0.05$ 2 64 2 0.40 226.62 $\pm 4.83$ 29.13 $\pm 3.46$ 0.38 $\pm 0.13$ 0.90 $\pm 0.05$ 32 KL 64 2.04 189.53 $\pm 3.68$ 22.27 $\pm 3.93$ 1.41 $\pm 0.40$ 0.61 $\pm 0.17$ 32 KL 16 7.3 132.75 $\pm 2.71$ 20.38 $\pm 3.56$ 1.88 $\pm 0.45$ 0.53 $\pm 0.18$ 16 KL 16 0.87 210.31 $\pm 3.97$ 24.08 $\pm 4.22$ 1.07 $\pm 0.36$ 0.68 $\pm 0.15$ 16 KL 8 2.63 178.68 $\pm 4.08$ 21.94 $\pm 3.92$ 1.49 $\pm 0.42$ 0.59 $\pm 0.17$ 8 KL 4 0.90 209.90 $\pm 4.92$ 24.19 $\pm 4.19$ 1.02 $\pm 0.35$ 0.69 $\pm 0.15$ 4 KL 3 0.27 227.57 $\pm 4.89$ 27.53 $\pm 4.54$ 0.55 $\pm 0.24$ 0.82 $\pm 0.11$ 2 KL 2 0.086 232.66 $\pm 5.16$ 32.47 $\pm 4.19$ 0.20 $\pm 0.09$ 0.93 $\pm 0.04$
D.3 LayouttoImage Synthesis
layouttoimage synthesis on the COCO dataset  

COCO$256\times 256$ OpenImages $256\times 256$ OpenImages $512\times 512$ Method FID$\downarrow $ FID$\downarrow $ FID$\downarrow $ LostGANV2 [87] 42.55   OCGAN [89] 41.65   SPADE [62] 41.11   VQGAN+T [37] 56.58 45.33 48.11 LDM8 (100 steps, ours) 42.06^{†}   LDM4 (200 steps, ours) 40.91^{∗} 32.02 35.80
Here we provide the quantitative evaluation and additional samples for our layouttoimage models from Sec. 4.3.1. We train a model on the COCO [4] and one on the OpenImages [49] dataset, which we subsequently additionally finetune on COCO. Tab 9 shows the result. Our COCO model reaches the performance of recent stateofthe art models in layouttoimage synthesis, when following their training and evaluation protocol [89]. When finetuning from the OpenImages model, we surpass these works. Our OpenImages model surpasses the results of Jahn et al [37] by a margin of nearly 11 in terms of FID. In Fig. 16 we show additional samples of the model finetuned on COCO.
D.4 ClassConditional Image Synthesis on ImageNet
Tab. 10 contains the results for our classconditional LDM measured in FID and Inception score (IS). LDM8 requires significantly fewer parameters and compute requirements (see Tab. 18) to achieve very competitive performance. Similar to previous work, we can further boost the performance by training a classifier on each noise scale and guiding with it, see Sec. C. Unlike the pixelbased methods, this classifier is trained very cheaply in latent space. For additional qualitative results, see Fig. 26 and Fig. 27.
Method FID$\downarrow $ IS$\uparrow $ Precision$\uparrow $ Recall$\uparrow $ ${N}_{\text{params}}$ SR3 [72] 11.30    625M  ImageBART [21] 21.19    3.5B  ImageBART [21] 7.44    3.5B 0.05 acc. rate^{∗} VQGAN+T [23] 17.04 70.6$\pm \text{1.8}$   1.3B  VQGAN+T [23] 5.88 304.8$\pm \text{3.6}$   1.3B 0.05 acc. rate^{∗} BigGandeep [3] 6.95 203.6$\pm \text{2.6}$ 0.87 0.28 340M  ADM [15] 10.94 100.98 0.69 0.63 554M 250 DDIM steps ADMG [15] 4.59 186.7 0.82 0.52 608M 250 DDIM steps ADMG,ADMU [15] 3.85 221.72 0.84 0.53 n/a 2 $\times $ 250 DDIM steps CDM [31] 4.88 158.71$\pm \text{2.26}$   n/a 2 $\times $ 100 DDIM steps LDM8 (ours) 17.41 72.92$\pm \text{2.6}$ 0.65 0.62 395M 200 DDIM steps, 2.9M train steps, batch size 64 LDM8G (ours) 8.11 190.43$\pm \text{2.60}$ 0.83 0.36 506M 200 DDIM steps, classifier scale 10, 2.9M train steps, batch size 64 LDM8 (ours) 15.51 79.03$\pm \text{1.03}$ 0.65 0.63 395M 200 DDIM steps, 4.8M train steps, batch size 64 LDM8G (ours) 7.76 209.52$\pm \text{4.24}$ 0.84 0.35 506M 200 DDIM steps, classifier scale 10, 4.8M train steps, batch size 64 LDM4 (ours) 10.56 103.49$\pm \text{1.24}$ 0.71 0.62 400M 250 DDIM steps, 178K train steps, batch size 1200 LDM4G (ours) 3.95 178.22$\pm \text{2.43}$ 0.81 0.55 400M 250 DDIM steps, unconditional guidance [32] scale 1.25, 178K train steps, batch size 1200 LDM4G (ours) 3.60 247.67$\pm \text{5.59}$ 0.87 0.48 400M 250 DDIM steps, unconditional guidance [32] scale 1.5, 178K train steps, batch size 1200
D.5 Sample Quality vs. V100 Days (Continued from Sec. 4.1)
For the assessment of sample quality over the training progress in Sec. 4.1, we reported FID and IS scores as a function of train steps. Another possibility is to report these metrics over the used resources in V100 days. Such an analysis is additionally provided in Fig. 17, showing qualitatively similar results.
D.6 SuperResolution
Method FID $\downarrow $ IS $\uparrow $ PSNR $\uparrow $ SSIM $\uparrow $ Image Regression [72] 15.2 121.1 27.9 0.801 SR3 [72] 5.2 180.1 26.4 0.762 LDM4 (ours, 100 steps) 2.8^{†}/4.8^{‡} 166.3 24.4$\pm $3.8 0.69$\pm $0.14 LDM4 (ours, 50 steps, guiding) 4.4^{†}/6.4^{‡} 153.7 25.8$\pm $3.7 0.74$\pm $0.12 LDM4 (ours, 100 steps, guiding) 4.4^{†}/6.4^{‡} 154.1 25.7$\pm $3.7 0.73$\pm $0.12 LDM4 (ours, 100 steps, +15 ep.) 2.6^{†} / 4.6^{‡} 169.76$\pm $5.03 24.4$\pm $3.8 0.69$\pm $0.14 PixelDM (100 steps, +15 ep.) 5.1^{†} / 7.1^{‡} 163.06$\pm $4.67 24.1$\pm $3.3 0.59$\pm $0.12
For better comparability between LDMs and diffusion models in pixel space, we extend our analysis from Tab. 5 by comparing a diffusion model trained for the same number of steps and with a comparable number ^{1}^{1}1It is not possible to exactly match both architectures since the diffusion model operates in the pixel space of parameters to our LDM. The results of this comparison are shown in the last two rows of Tab. 11 and demonstrate that LDM achieves better performance while allowing for significantly faster sampling. A qualitative comparison is given in Fig. 20 which shows random samples from both LDM and the diffusion model in pixel space.
D.6.1 LDMBSR: General Purpose SR Model via Diverse Image Degradation
bicubic  LDMSR  LDMBSR 

To evaluate generalization of our LDMSR, we apply it both on synthetic LDM samples from a classconditional ImageNet model (Sec. 4.1) and images crawled from the internet. Interestingly, we observe that LDMSR, trained only with a bicubicly downsampled conditioning as in [72], does not generalize well to images which do not follow this preprocessing. Hence, to obtain a superresolution model for a wide range of real world images, which can contain complex superpositions of camera noise, compression artifacts, blurr and interpolations, we replace the bicubic downsampling operation in LDMSR with the degration pipeline from [105]. The BSRdegradation process is a degradation pipline which applies JPEG compressions noise, camera sensor noise, different image interpolations for downsampling, Gaussian blur kernels and Gaussian noise in a random order to an image. We found that using the bsrdegredation process with the original parameters as in [105] leads to a very strong degradation process. Since a more moderate degradation process seemed apppropiate for our application, we adapted the parameters of the bsrdegradation (our adapted degradation process can be found in our code base at https://github.com/CompVis/latentdiffusion). Fig. 18 illustrates the effectiveness of this approach by directly comparing LDMSR with LDMBSR. The latter produces images much sharper than the models confined to a fixed preprocessing, making it suitable for realworld applications. Further results of LDMBSR are shown on LSUNcows in Fig. 19.
bicubic  LDMBSR 

input  GT  Pixel Baseline #1  Pixel Baseline #2  LDM #1  LDM #2 

Appendix E Implementation Details and Hyperparameters
E.1 Hyperparameters
We provide an overview of the hyperparameters of all trained LDM models in Tab. 12, Tab. 13, Tab. 14 and Tab. 15.
CelebAHQ $256\times 256$ FFHQ $256\times 256$ LSUNChurches $256\times 256$ LSUNBedrooms $256\times 256$ $f$ 4 4 8 4 $z$shape $64\times 64\times 3$ $64\times 64\times 3$  $64\times 64\times 3$ $\mathcal{Z}$ 8192 8192  8192 Diffusion steps 1000 1000 1000 1000 Noise Schedule linear linear linear linear ${N}_{\text{params}}$ 274M 274M 294M 274M Channels 224 224 192 224 Depth 2 2 2 2 Channel Multiplier 1,2,3,4 1,2,3,4 1,2,2,4,4 1,2,3,4 Attention resolutions 32, 16, 8 32, 16, 8 32, 16, 8, 4 32, 16, 8 Head Channels 32 32 24 32 Batch Size 48 42 96 48 Iterations^{∗} 410k 635k 500k 1.9M Learning Rate 9.6e5 8.4e5 5.e5 9.6e5
LDM1 LDM2 LDM4 LDM8 LDM16 LDM32 $z$shape $256\times 256\times 3$ $128\times 128\times 2$ $64\times 64\times 3$ $32\times 32\times 4$ $16\times 16\times 8$ $88\times 8\times 32$ $\mathcal{Z}$  2048 8192 16384 16384 16384 Diffusion steps 1000 1000 1000 1000 1000 1000 Noise Schedule linear linear linear linear linear linear Model Size 396M 391M 391M 395M 395M 395M Channels 192 192 192 256 256 256 Depth 2 2 2 2 2 2 Channel Multiplier 1,1,2,2,4,4 1,2,2,4,4 1,2,3,5 1,2,4 1,2,4 1,2,4 Number of Heads 1 1 1 1 1 1 Batch Size 7 9 40 64 112 112 Iterations 2M 2M 2M 2M 2M 2M Learning Rate 4.9e5 6.3e5 8e5 6.4e5 4.5e5 4.5e5 Conditioning CA CA CA CA CA CA CAresolutions 32, 16, 8 32, 16, 8 32, 16, 8 32, 16, 8 16, 8, 4 8, 4, 2 Embedding Dimension 512 512 512 512 512 512 Transformers Depth 1 1 1 1 1 1
LDM1 LDM2 LDM4 LDM8 LDM16 LDM32 $z$shape $256\times 256\times 3$ $128\times 128\times 2$ $64\times 64\times 3$ $32\times 32\times 4$ $16\times 16\times 8$ $88\times 8\times 32$ $\mathcal{Z}$  2048 8192 16384 16384 16384 Diffusion steps 1000 1000 1000 1000 1000 1000 Noise Schedule linear linear linear linear linear linear Model Size 270M 265M 274M 258M 260M 258M Channels 192 192 224 256 256 256 Depth 2 2 2 2 2 2 Channel Multiplier 1,1,2,2,4,4 1,2,2,4,4 1,2,3,4 1,2,4 1,2,4 1,2,4 Attention resolutions 32, 16, 8 32, 16, 8 32, 16, 8 32, 16, 8 16, 8, 4 8, 4, 2 Head Channels 32 32 32 32 32 32 Batch Size 9 11 48 96 128 128 Iterations^{∗} 500k 500k 500k 500k 500k 500k Learning Rate 9e5 1.1e4 9.6e5 9.6e5 1.3e4 1.3e4
Task TexttoImage LayouttoImage ClassLabeltoImage Super Resolution Inpainting SemanticMaptoImage Dataset LAION OpenImages COCO ImageNet ImageNet Places Landscapes $f$ 8 4 8 4 4 4 8 $z$shape $32\times 32\times 4$ $64\times 64\times 3$ $32\times 32\times 4$ $64\times 64\times 3$ $64\times 64\times 3$ $64\times 64\times 3$ $32\times 32\times 4$ $\mathcal{Z}$  8192 16384 8192 8192 8192 16384 Diffusion steps 1000 1000 1000 1000 1000 1000 1000 Noise Schedule linear linear linear linear linear linear linear Model Size 1.45B 306M 345M 395M 169M 215M 215M Channels 320 128 192 192 160 128 128 Depth 2 2 2 2 2 2 2 Channel Multiplier 1,2,4,4 1,2,3,4 1,2,4 1,2,3,5 1,2,2,4 1,4,8 1,4,8 Number of Heads 8 1 1 1 1 1 1 Dropout   0.1     Batch Size 680 24 48 1200 64 128 48 Iterations 390K 4.4M 170K 178K 860K 360K 360K Learning Rate 1.0e4 4.8e5 4.8e5 1.0e4 6.4e5 1.0e6 4.8e5 Conditioning CA CA CA CA concat concat concat (C)Aresolutions 32, 16, 8 32, 16, 8 32, 16, 8 32, 16, 8    Embedding Dimension 1280 512 512 512    Transformer Depth 1 3 2 1   
E.2 Implementation Details
E.2.1 Implementations of ${\tau}_{\theta}$ for conditional LDMs
For the experiments on texttoimage and layouttoimage (Sec. 4.3.1) synthesis, we implement the conditioner ${\tau}_{\theta}$ as an unmasked transformer which processes a tokenized version of the input $y$ and produces an output $\zeta :={\tau}_{\theta}(y)$, where $\zeta \in {\mathbb{R}}^{M\times {d}_{\tau}}$. More specifically, the transformer is implemented from $N$ transformer blocks consisting of global selfattention layers, layernormalization and positionwise MLPs as follows^{2}^{2}2adapted from https://github.com/lucidrains/xtransformers:
$\zeta \leftarrow \text{TokEmb}(y)+\text{PosEmb(y)}$  (18)  
$\text{for}i=1,\mathrm{\dots},N:$  
${\zeta}_{1}\leftarrow \text{LayerNorm}(\zeta )$  (19)  
${\zeta}_{2}\leftarrow \text{MultiHeadSelfAttention}({\zeta}_{1})+\zeta $  (20)  
${\zeta}_{3}\leftarrow \text{LayerNorm}({\zeta}_{2})$  (21)  
$\zeta \leftarrow \text{MLP}({\zeta}_{3})+{\zeta}_{2}$  (22)  
$\zeta \leftarrow \text{LayerNorm}(\zeta )$  (23) 
With $\zeta $ available, the conditioning is mapped into the UNet via the crossattention mechanism as depicted in Fig. 3. We modify the “ablated UNet” [15] architecture and replace the selfattention layer with a shallow (unmasked) transformer consisting of $T$ blocks with alternating layers of (i) selfattention, (ii) a positionwise MLP and (iii) a crossattention layer; see Tab. 16. Note that without (ii) and (iii), this architecture is equivalent to the “ablated UNet”.
While it would be possible to increase the representational power of ${\tau}_{\theta}$ by additionally conditioning on the time step $t$, we do not pursue this choice as it reduces the speed of inference. We leave a more detailed analysis of this modification to future work.
For the texttoimage model, we rely on a publicly available^{3}^{3}3https://huggingface.co/transformers/model_doc/bert.html#berttokenizerfast tokenizer
[99].
The layouttoimage model discretizes the spatial locations of the bounding boxes
and encodes each box as a $(l,b,c)$tuple, where $l$ denotes the (discrete) topleft and $b$ the
bottomright position. Class information is contained in $c$.
See Tab. 17 for the hyperparameters of ${\tau}_{\theta}$ and Tab. 13 for those of the UNet for both of the above tasks.
Note that the classconditional model as described in Sec. 4.1 is also implemented via crossattention, where ${\tau}_{\theta}$ is a single learnable embedding layer with a dimensionality of 512, mapping classes $y$ to $\zeta \in {\mathbb{R}}^{1\times 512}$.
input ${\mathbb{R}}^{h\times w\times c}$ LayerNorm ${\mathbb{R}}^{h\times w\times c}$ Conv1x1 ${\mathbb{R}}^{h\times w\times d\cdot {n}_{h}}$ Reshape ${\mathbb{R}}^{h\cdot w\times d\cdot {n}_{h}}$ $\times T\{\begin{array}{cc}\text{SelfAttention}\hfill & \\ \text{MLP}\hfill & \\ \text{CrossAttention}\hfill & \end{array}$ ${\mathbb{R}}^{h\cdot w\times d\cdot {n}_{h}}$ ${\mathbb{R}}^{h\cdot w\times d\cdot {n}_{h}}$ ${\mathbb{R}}^{h\cdot w\times d\cdot {n}_{h}}$ Reshape ${\mathbb{R}}^{h\times w\times d\cdot {n}_{h}}$ Conv1x1 ${\mathbb{R}}^{h\times w\times c}$
TexttoImage LayouttoImage seqlength 77 92 depth $N$ 32 16 dim 1280 512
E.2.2 Inpainting
input  GT  LaMa[88]  LDM #1  LDM #2  LDM #3 

input  result  input  result 

For our experiments on imageinpainting in Sec. 4.5, we used the code of [88] to generate synthetic masks. We use a fixed set of 2k validation and 30k testing samples from Places[108]. During training, we use random crops of size $256\times 256$ and evaluate on crops of size $512\times 512$. This follows the training and testing protocol in [88] and reproduces their reported metrics (see ^{†} in Tab. 7). We include additional qualitative results of LDM4, w/ attn in Fig. 21 and of LDM4, w/o attn, big, w/ ft in Fig. 22.
E.3 Evaluation Details
This section provides additional details on evaluation for the experiments shown in Sec. 4.
E.3.1 Quantitative Results in Unconditional and ClassConditional Image Synthesis
We follow common practice and estimate the statistics for calculating the FID, Precision and Recallscores [29, 50] shown in Tab. 1 and 10 based on 50k samples from our models and the entire training set of each of the shown datasets. For calculating FID scores we use the torchfidelity package [60]. However, since different data processing pipelines might lead to different results [64], we also evaluate our models with the script provided by Dhariwal and Nichol [15]. We find that results mainly coincide, except for the ImageNet and LSUNBedrooms datasets, where we notice slightly varying scores of 7.76 (torchfidelity) vs. 7.77 (Nichol and Dhariwal) and 2.95 vs 3.0. For the future we emphasize the importance of a unified procedure for sample quality assessment. Precision and Recall are also computed by using the script provided by Nichol and Dhariwal.
E.3.2 TexttoImage Synthesis
Following the evaluation protocol of [66] we compute FID and Inception Score for the TexttoImage models from Tab. 2 by comparing generated samples with 30000 samples from the validation set of the MSCOCO dataset [51]. FID and Inception Scores are computed with torchfidelity.
E.3.3 LayouttoImage Synthesis
For assessing the sample quality of our LayouttoImage models from Tab. 9 on the COCO dataset, we follow common practice [89, 37, 87] and compute FID scores the 2048 unaugmented examples of the COCO Segmentation Challenge split. To obtain better comparability, we use the exact same samples as in [37]. For the OpenImages dataset we similarly follow their protocol and use 2048 centercropped test images from the validation set.
E.3.4 Super Resolution
We evaluate the superresolution models on ImageNet following the pipeline suggested in [72], i.e. images with a shorter size less than $256$ px are removed (both for training and evaluation). On ImageNet, the lowresolution images are produced using bicubic interpolation with antialiasing. FIDs are evaluated using torchfidelity [60], and we produce samples on the validation split. For FID scores, we additionally compare to reference features computed on the train split, see Tab. 5 and Tab. 11.
E.3.5 Efficiency Analysis
For efficiency reasons we compute the sample quality metrics plotted in Fig. 6, 17 and 7 based on 5k samples. Therefore, the results might vary from those shown in Tab. 1 and 10. All models have a comparable number of parameters as provided in Tab. 13 and 14. We maximize the learning rates of the individual models such that they still train stably. Therefore, the learning rates slightly vary between different runs cf. Tab. 13 and 14.
E.3.6 User Study
For the results of the user study presented in Tab. 4 we followed the protocoll of [72] and and use the 2alternative forcechoice paradigm to assess human preference scores for two distinct tasks. In Task1 subjects were shown a low resolution/masked image between the corresponding ground truth high resolution/unmasked version and a synthesized image, which was generated by using the middle image as conditioning. For SuperResolution subjects were asked: ’Which of the two images is a better high quality version of the low resolution image in the middle?’. For Inpainting we asked ’Which of the two images contains more realistic inpainted regions of the image in the middle?’. In Task2, humans were similarly shown the lowres/masked version and asked for preference between two corresponding images generated by the two competing methods. As in [72] humans viewed the images for 3 seconds before responding.
Appendix F Computational Requirements
Method Generator Classifier Overall Inference ${N}_{\text{params}}$ FID$\downarrow $ IS$\uparrow $ Precision$\uparrow $ Recall$\uparrow $ Compute Compute Compute Throughput^{∗} LSUN Churches ${256}^{2}$ StyleGAN2 [42]^{†} 64  64  59M 3.86    LDM8 (ours, 100 steps, 410K) 18  18 6.80 256M 4.02  0.64 0.52 LSUN Bedrooms ${256}^{2}$ ADM [15]^{†} (1000 steps) 232  232 0.03 552M 1.9  0.66 0.51 LDM4 (ours, 200 steps, 1.9M) 60  55 1.07 274M 2.95  0.66 0.48 CelebAHQ ${256}^{2}$ LDM4 (ours, 500 steps, 410K) 14.4  14.4 0.43 274M 5.11  0.72 0.49 FFHQ ${256}^{2}$ StyleGAN2 [42] 32.13^{‡}  32.13^{†}  59M 3.8    LDM4 (ours, 200 steps, 635K) 26  26 1.07 274M 4.98  0.73 0.50 ImageNet ${256}^{2}$ VQGANf4 (ours, first stage) 29  29  55M 0.58^{††}    VQGANf8 (ours, first stage) 66  66  68M 1.14^{††}    BigGANdeep [3]^{†} 128256 128256  340M 6.95 203.6$\pm \text{2.6}$ 0.87 0.28 ADM [15] (250 steps) ^{†} 916  916 0.12 554M 10.94 100.98 0.69 0.63 ADMG [15] (25 steps) ^{†} 916 46 962 0.7 608M 5.58  0.81 0.49 ADMG [15] (250 steps)^{†} 916 46 962 0.07 608M 4.59 186.7 0.82 0.52 ADMG,ADMU [15] (250 steps)^{†} 329 30 349 n/a n/a 3.85 221.72 0.84 0.53 LDM8G (ours, 100, 2.9M) 79 12 91 1.93 506M 8.11 190.4$\pm \text{2.6}$ 0.83 0.36 LDM8 (ours, 200 ddim steps 2.9M, batch size 64) 79  79 1.9 395M 17.41 72.92 0.65 0.62 LDM4 (ours, 250 ddim steps 178K, batch size 1200) 271  271 0.7 400M 10.56 103.49$\pm \text{1.24}$ 0.71 0.62 LDM4G (ours, 250 ddim steps 178K, batch size 1200, classifierfree guidance [32] scale 1.25) 271  271 0.4 400M 3.95 178.22$\pm \text{2.43}$ 0.81 0.55 LDM4G (ours, 250 ddim steps 178K, batch size 1200, classifierfree guidance [32] scale 1.5) 271  271 0.4 400M 3.60 247.67$\pm \text{5.59}$ 0.87 0.48
In Tab 18 we provide a more detailed analysis on our used compute ressources and compare our best performing models on the CelebAHQ, FFHQ, LSUN and ImageNet datasets with the recent state of the art models by using their provided numbers, cf. [15]. As they report their used compute in V100 days and we train all our models on a single NVIDIA A100 GPU, we convert the A100 days to V100 days by assuming a $\times 2.2$ speedup of A100 vs V100 [74]^{4}^{4}4This factor corresponds to the speedup of the A100 over the V100 for a UNet, as defined in Fig. 1 in [74]. To assess sample quality, we additionally report FID scores on the reported datasets. We closely reach the performance of state of the art methods as StyleGAN2 [42] and ADM [15] while significantly reducing the required compute resources.
Appendix G Details on Autoencoder Models
We train all our autoencoder models in an adversarial manner following [23], such that a patchbased discriminator ${D}_{\psi}$ is optimized to differentiate original images from reconstructions $\mathcal{D}(\mathcal{E}(x))$.
To avoid arbitrarily scaled latent spaces, we regularize the latent $z$ to be zero centered and obtain small variance by introducing an regularizing loss term ${L}_{reg}$.
We investigate two different regularization methods: (i) a lowweighted KullbackLeiblerterm between ${q}_{\mathcal{E}}(zx)=\mathcal{N}(z;{\mathcal{E}}_{\mu},{\mathcal{E}}_{{\sigma}^{2}})$ and a standard normal distribution
$\mathcal{N}(z;0,1)$ as in a standard variational autoencoder [46, 69], and, (ii) regularizing the latent space with a vector quantization layer by learning a codebook of $\mathcal{Z}$ different exemplars [96].
To obtain highfidelity reconstructions we only use a very small regularization for both scenarios, i.e. we either weight the $\mathbb{K}\mathbb{L}$ term
by a factor $\sim {10}^{6}$ or choose a high codebook dimensionality $\mathcal{Z}$.
The full objective to train the autoencoding model $(\mathcal{E},\mathcal{D})$ reads:
$${L}_{\text{Autoencoder}}=\underset{\mathcal{E},\mathcal{D}}{\mathrm{min}}\underset{\psi}{\mathrm{max}}\left({L}_{rec}(x,\mathcal{D}(\mathcal{E}(x))){L}_{adv}(\mathcal{D}(\mathcal{E}(x)))+\mathrm{log}{D}_{\psi}(x)+{L}_{reg}(x;\mathcal{E},\mathcal{D})\right)$$  (25) 
DM Training in Latent Space
Note that for training diffusion models on the learned latent space, we again distinguish two cases when learning $p(z)$ or $p(zy)$ (Sec. 4.3): (i) For a KLregularized latent space, we sample $z={\mathcal{E}}_{\mu}(x)+{\mathcal{E}}_{\sigma}(x)\cdot \epsilon =:\mathcal{E}(x)$, where $\epsilon \sim \mathcal{N}(0,1)$. When rescaling the latent, we estimate the componentwise variance
$${\widehat{\sigma}}^{2}=\frac{1}{bchw}\sum _{b,c,h,w}{({z}^{b,c,h,w}\widehat{\mu})}^{2}$$ 
from the first batch in the data, where $\widehat{\mu}=\frac{1}{bchw}{\sum}_{b,c,h,w}{z}^{b,c,h,w}$. The output of $\mathcal{E}$ is scaled such that the rescaled latent has unit standard deviation, i.e. $z\leftarrow \frac{z}{\widehat{\sigma}}=\frac{\mathcal{E}(x)}{\widehat{\sigma}}$. (ii) For a VQregularized latent space, we extract $z$ before the quantization layer and absorb the quantization operation into the decoder, i.e. it can be interpreted as the first layer of $\mathcal{D}$.
Appendix H Additional Qualitative Results
Finally, we provide additional qualitative results for our landscapes model (Fig. 12, 23, 24 and 25), our classconditional ImageNet model (Fig. 26  27) and our unconditional models for the CelebAHQ, FFHQ and LSUN datasets (Fig. 28  31). Similar as for the inpainting model in Sec. 4.5 we also finetuned the semantic landscapes model from Sec. 4.3.2 directly on ${512}^{2}$ images and depict qualitative results in Fig. 12 and Fig. 23. For our those models trained on comparably small datasets, we additionally show nearest neighbors in VGG [79] feature space for samples from our models in Fig. 32  34.
Semantic Synthesis on FlickrLandscapes [23] (${512}^{2}$ finetuning) 

Semantic Synthesis on FlickrLandscapes [23] 

Random class conditional samples on the ImageNet dataset 
Random class conditional samples on the ImageNet dataset 
Random samples on the CelebAHQ dataset 
Random samples on the FFHQ dataset 
Random samples on the LSUNChurches dataset 
Random samples on the LSUNBedrooms dataset 
Nearest Neighbors on the CelebAHQ dataset 

Nearest Neighbors on the FFHQ dataset 

Nearest Neighbors on the LSUNChurches dataset 
