Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space

Hengrui Zhang1 Jiani Zhang2  Balasubramaniam Srinivasan2  Zhengyuan Shen2
Xiao Qin2  Christos Faloutso2  Huzefa Rangwala2,3  George Karypis2
1Computer Science Department, University of Illinois at Chicago  2Amazon Web Services  
3Computer Science, George Mason University  
hzhan55@uic.edu
{zhajiani,srbalasu,donshen}@amazon.com
{drxqin,faloutso,rhuzefa,gkarypis}@amazon.com
Work conducted during an internship at Amazon Web ServicesCorresponding authorHuzefa Rangwala is on LOA as a Professor of Computer Science at George Mason University. This paper describes work performed at Amazon
Abstract

Recent advances in tabular data generation have greatly enhanced synthetic data quality. However, extending diffusion models to tabular data is challenging due to the intricately varied distributions and a blend of data types of tabular data. This paper introduces TabSyn, a methodology that synthesizes tabular data by leveraging a diffusion model within a variational autoencoder (VAE) crafted latent space. The key advantages of the proposed TabSyn include (1) Generality: the ability to handle a broad spectrum of data types by converting them into a single unified space and explicitly capture inter-column relations, (2) Quality: optimizing the distribution of latent embeddings to enhance the subsequent training of diffusion models, which helps generate high-quality synthetic data, (3) Speed: much fewer number of reverse steps and faster synthesis speed than existing diffusion-based methods. Extensive experiments on six datasets with five metrics demonstrate that TabSyn outperforms existing methods. Specifically, it reduces the error rates by 86% and 67% for column-wise distribution and pair-wise column correlation estimations compared with the most competitive baselines. Code has been made available at https://github.com/amazon-science/tabsyn.

1 Introduction

Refer to caption
Figure 1: TabSyn wins: Our TabSyn method consistently outperforms SOTA tabular data generation methods across five data quality metrics.

Tabular data synthesis has a wide range of applications, such as augmenting training data (Fonseca & Bacao, 2023), protecting private data instances (Assefa et al., 2021; Hernandez et al., 2022), and imputing missing values (Zheng & Charoenphakdee, 2022). Recent developments in tabular data generation have notably enhanced the quality of synthetic data (Xu et al., 2019; Borisov et al., 2023; Liu et al., 2023b), while the synthetic data is still far from the real one. To further improve the generation quality, researchers have explored adapting diffusion models, which have shown strong performance in image synthesis tasks (Ho et al., 2020; Rombach et al., 2022), for tabular data generation (Kim et al., 2022; Kotelnikov et al., 2023; Kim et al., 2023; Lee et al., 2023). Despite the progress made by these methods, tailoring a diffusion model for tabular data leads to several challenges. Unlike image data, which comprises pure continuous pixel values with local spatial correlations, tabular data features have complex and varied distributions (Xu et al., 2019), making it hard to learn joint probabilities across multiple columns. Moreover, typical tabular data often contains mixed data types, i.e., continuous (e.g., numerical features) and discrete (e.g., categorical features) variables. The standard diffusion process assumes a continuous input space with Gaussian noise perturbation, which leads to additional challenges with categorical features. Existing solutions either transform categorical features into numerical ones using techniques like one-hot encoding (Kim et al., 2023; Liu et al., 2023b) and analog bit encoding (Zheng & Charoenphakdee, 2022) or resort to two separate diffusion processes for numerical and categorical features (Kotelnikov et al., 2023; Lee et al., 2023). However, it has been proven that simple encoding methods lead to suboptimal performance (Lee et al., 2023), and learning separate models for different data types makes it challenging for the model to capture the co-occurrence patterns of different types of data. Therefore, we seek to develop a diffusion model in a joint space of numerical and categorical features that preserves the inter-column correlations.

This paper presents TabSyn, a principled approach for tabular data synthesis. TabSyn first transforms raw tabular data into a continuous embedding space, where well-developed diffusion models with Gaussian noises become feasible. Subsequently, we learn a score-based diffusion model in the embedding space to capture the distribution of latent embeddings. To learn an informative, smoothed latent space while maintaining the decoder’s reconstruction ability, we specifically designed a Variational AutoEncoder (VAE (Kingma & Welling, 2013)) model for tabular-structured data. Our proposed VAE model includes 1) Transformer-architecture encoders and decoders for modeling inter-column relationships and obtaining token-level representations, facilitating token-level tasks. 2) Adaptive loss weighting to dynamically adjust the reconstruction loss weights and KL-divergence weights, allowing the model to improve reconstruction performance gradually while maintaining a regularized embedding space. 3) Finally, when applying diffusion models in the latent space, we adopt a simplified forward diffusion process, which adds Gaussian noises of linear standard deviation with respect to time. We demonstrate through theoretical analysis and empirical justifications that this approach can reduce the errors in the reverse process, thus improving sampling speed. The advantages of TabSyn are three-fold: (1) Generality: Mixed-type Feature Handling - TabSyn transforms diverse input features, encompassing numerical, categorical, etc., into a unified embedding space. (2) Quality: High Generation Quality - with tailored designs of the VAE model, the tabular data is mapped into regularized latent space of good shape, e.g., a standard normal distribution. This will greatly simplify training the subsequent diffusion model (Vahdat et al., 2021), making TabSyn more expressive and enabling it to generate high-quality synthetic data. (3) Speed: With the proposed linear noise schedule, our TabSyn can generate high-quality synthetic data with fewer than 20 reverse steps, which is significantly fewer than existing methods.

Recognizing the absence of unified and comprehensive evaluations for tabular data synthesis methods, we perform extensive experiments, which involve comparing TabSyn with seven state-of-the-art methods on six mixed-type tabular datasets using over five distinct evaluation metrics. The experimental results demonstrate that TabSyn consistently outperforms previous methods (see Figure 1). Specifically, TabSyn reduces the average errors in column-wise distribution shape estimation (i.e., single density) and pair-wise column correlation estimation (i.e., pair correlation) tasks by 86% and 67% than the most competitive baselines. Furthermore, we demonstrate that TabSyn achieves competitive performance across two downstream tabular data tasks, machine learning efficiency and missing value imputation. Specifically, the well-learned unconditional TabSyn is able to be applied to missing value imputation without retraining. Moreover, thorough ablation studies and visualization case studies substantiate the rationale and effectiveness of our developed approach.

2 Related Works

Deep Generative Models for Tabular Data Generation.

Generative models for tabular data have become increasingly important and have widespread applications Assefa et al. (2021); Zheng & Charoenphakdee (2022); Hernandez et al. (2022). To deal with the imbalanced categorical features, Xu et al. (2019) proposes CTGAN and TVAE based on the popular Generative Adversarial Networks (Goodfellow et al., 2014) and VAE (Kingma & Welling, 2013), respectively. Multiple advanced methods have been proposed for synthetic tabular data generation in the past year. Specifically, GOGGLE (Liu et al., 2023b) became the first to explicitly model the dependency relationship between columns, proposing a VAE-based model using graph neural networks as the encoder and decoder models. Inspired by the success of large language models in modeling the distribution of natural languages, GReaT transformed each row in a table into a natural sentence and learned sentence-level distributions using auto-regressive GPT2. In recent years, the physical diffusion process has inspired a lot of advanced research in deep learning. For example, DIFFormer (Wu et al., 2023) develops a scalable Transformer model for geometric data via a constrained diffusion process, and the Denoising Diffusion models have achieved great success in image generation (Ho et al., 2020). STaSy (Kim et al., 2023), TabDDPM (Kotelnikov et al., 2023), and CoDi (Lee et al., 2023) concurrently applied the popular diffusion-based generative models for synthetic tabular data generation.

Generative Modeling in the Latent Space.

While generative models in the data space have achieved significant success, latent generative models have demonstrated several advantages, including more compact and disentangled representations, robustness to noise, and greater flexibility in controlling generated styles (van den Oord et al., 2017; Razavi et al., 2019; Esser et al., 2021). For example, the recent GAN literature (Li et al., 2022) has demonstrated superior controllability via adversarial learning in the latent space. Recently, the Latent Diffusion Models (LDM) (Rombach et al., 2022; Vahdat et al., 2021) have achieved great success in image generation as they exhibit better scaling properties and expressivity than the vanilla diffusion models in the data space (Ho et al., 2020; Song et al., 2021b; Karras et al., 2022). The success of LDMs in image generation has also inspired their applications in video (Blattmann et al., 2023) and audio data (Liu et al., 2023a). To the best of our knowledge, the proposed work is the first to explore the application of latent diffusion models for general tabular data generation tasks.

3 Synthetic Tabular Data generation with TabSyn

Figure 2 gives an overview of TabSyn. In Section 3.1, we first formally define the tabular data generation task. Then, we introduce the design details of TabSyn’s autoencoding and diffusion process in Section 3.2 and 3.3. We summarize the training and sampling algorithms in Appendix A.

Refer to caption
Figure 2: An overview of the proposed TabSyn. Each row data x is mapped to latent space z via a column-wise tokenizer and an encoder. A diffusion process z0zT is applied in the latent space. Synthesis zTz0 starts from the base distribution p(zT) and generates samples z0 in latent space through a reverse process. These samples are then mapped from latent z to data space x~ using a decoder and a detokenizer.

3.1 Problem Definition of Tabular Data Generation

Let Mnum and Mcat be the number of numerical columns and categorical columns, respectively. Each row is represented as a vector of numerical features and categorical features 𝒙=[𝒙num,𝒙cat], where 𝒙numMnum and 𝒙catMcat. Specifically, the i-th categorical attribute has Ci finite candidate values, therefore we have xicat{1,,Ci},i. This paper focuses on the unconditional generation task. With a tabular dataset 𝒯={𝒙}, we aim to learn a parameterized generative model pθ(𝒯), with which realistic and diverse synthetic tabular data 𝒙^𝒯^ can be generated.

3.2 AutoEncoding for Tabular Data

Tabular data is highly structured of mixed-type column features, with different columns having distinct meanings and being highly dependent on each other. These characteristics make it challenging to design an approximate encoder to model and effectively utilize the rich relationships between columns. Motivated by the successes of Transformers in classification/regression of tabular data (Gorishniy et al., 2021), we first learn a unique tokenizer for each column, and then the token(column)-wise representations are fed into a Transformer for capturing the intricate relationships among columns.

Feature Tokenizer. The feature tokenizer converts each column (both numerical and categorical) into a d-dimensional vector. First, we use one-hot encoding to pre-process categorical features, i.e., xicat𝒙ioh1×Ci. Each record is represented as 𝒙=[𝒙num,𝒙1oh,,𝒙Mcatoh]Mnum+i=1McatCi. Then, we apply a linear transformation for numerical columns and create an embedding lookup table for categorical columns, where each category is assigned a learnable d-dimensional vector, i.e.,

𝒆inum=xinum𝒘inum+𝒃inum,𝒆icat=𝒙ioh𝑾icat+𝒃icat, (1)

where 𝒘inum,𝒃inum,𝒃icat1×d, 𝑾icatCi×d are learnable parameters of the tokenizer, 𝒆inum,𝒆icat1×d. Now, each record is expressed as the stack of the embeddings of all columns

𝑬=[𝒆1num,,𝒆Mnumnum,𝒆1cat,,𝒆Mcatcat]M×d. (2)

Transformer Encoding and Decoding. As with typical VAEs, we use the encoder to obtain the mean and log variance of the latent variable. Then, we acquire the latent embeddings with the reparameterization tricks. The latent embeddings are then passed through the decoder to obtain the reconstructed token matrix 𝑬^M×d. The detailed architectures are in Appendix D.

Detokenizer. Finally, we apply a detokenizer to the recovered token representation of each column to reconstruct the column values. The design of the detokenizer is symmetrical to that of the tokenizer:

x^inum=𝒆^inum𝒘^inum+b^inum,𝒙^ioh=Softmax(𝒆^icat𝑾^icat+𝒃^icat),𝒙^=[x^1num,,x^Mnumnum,𝒙^1oh,,𝒙^Mcatoh], (3)

where 𝒘^inumd×1,b^inum1×1, 𝑾icatd×Ci,𝒃^icat1×Ci are detokenizer’s parameters.

Training with adaptive weight coefficient.

The VAE model is usually learned with the classical ELBO loss function, but here we use β-VAE (Higgins et al., 2016), where a coefficient β balances the importance of the reconstruction loss and KL-divergence loss

=recon(𝒙,𝒙^)+βkl. (4)

recon is the reconstruction loss between the input data and the reconstructed one, and kl is the KL divergence loss that regularizes the mean and variance of the latent space. In the vanilla VAE model, β is set to be 1 because the two loss terms are equally important to generate high-quality synthetic data from Gaussian noises. However, in our model, β is expected to be smaller, as we do not require the distribution of the embeddings to precisely follow a standard Gaussian distribution because we have an additional diffusion model. Therefore, we propose to adaptively schedule the scale of β in the training process, encouraging the model to achieve lower reconstruction error while maintaining an appropriate embedding shape.

With an initial (maximum) β=βmax, we monitor the epoch-wise reconstruction loss recon. When recon fails to decrease for a predefined number of epochs (which indicates that the KL-divergence dominates the overall loss), the weight is scheduled by β=λβ,λ<1. This process continues until β approaches a predefined minimum value βmin. This strategy is simple yet very effective, and we empirically justify the effectiveness of the design in Section 4.

3.3 Score-based Generative Modeling in the Latent Space

Training and sampling via denoising.

After the VAE model is well-learned, we extract the latent embeddings through the encoder and flatten the encoder’s output as 𝒛=Flatten(Encoder(𝒙))1×Md such that the embedding of a record is a vector rather than a matrix. To learn the underlying distribution of embeddings p(𝒛), we consider the following forward diffusion process and reverse sampling process (Song et al., 2021b; Karras et al., 2022):

𝒛t =𝒛0+σ(t)𝜺,𝜺𝒩(𝟎,𝑰), (Forward Process) (5)
d𝒛t =2σ˙(t)σ(t)𝒛tlogp(𝒛t)dt+2σ˙(t)σ(t)d𝝎t, (Reverse Process) (6)

where 𝒛0=𝒛 is the initial embedding from the encoder, 𝒛t is the diffused embedding at time t, and σ(t) is the noise level. In the reverse process, 𝒛tlogpt(𝒛t) is the score function of 𝒛t, and 𝝎t is the standard Wiener process. The training of the diffusion model is achieved via denoising score matching (Karras et al., 2022):

=𝔼𝒛0p(𝒛0)𝔼tp(t)𝔼𝜺𝒩(𝟎,𝑰)ϵθ(𝒛t,t)𝜺)22,where𝒛t=𝒛0+σ(t)𝜺, (7)

where ϵθ is a neural network (named denoising function) to approximate the Gaussian noise using the perturbed data 𝒙t and the time t. Then 𝒛tlogp(𝒛t)=ϵθ(𝒛t,t)/σ(t). After the model is trained, synthetic data can be obtained via the reverse process in Eq. 6. The detailed algorithm description of TabSyn is provided in Appendix A. Detailed derivations are in Appendix B.

Schedule of noise level σ(t).

The noise level σ(t) defines the scale of noises for perturbing the data at different time steps and significantly affects the final Differential Equation solution trajectories (Song et al., 2021b; Karras et al., 2022). Following the recommendations in Karras et al. (2022), we set the noise level σ(t)=t that is linear w.r.t. the time. We show in Proposition 1 that the linear noise level schedule leads to the smallest approximation errors in the reverse process:

Proposition 1.

Consider the reverse diffusion process in Equation (6) from 𝐳tb to 𝐳ta(tb>ta), the numerical solution 𝐳^ta has the smallest approximation error to 𝐳ta when σ(t)=t.

See proof in Appendix C. A natural corollary of Proposition 1 is that a small approximation error allows us to increase the interval between two timesteps, thereby reducing the overall number of sampling steps and accelerating the sampling. In Section 4, we demonstrate that with this design, TabSyn can generate synthetic tabular data of high quality within less than 20 NFEs (number of function evaluations), which is much smaller than other tabular-data synthesis methods based on diffusion (Kim et al., 2023; Kotelnikov et al., 2023).

4 Benchmarking Synthetic Tabular Data Generation Algorithms

4.1 Experimental Setups

Datasets. We select six real-world tabular datasets consisting of both numerical and categorical attributes: Adult, Default, Shoppers, Magic, Faults, Beijing, and News. Table 6 provides the overall statistics of these datasets, and the detailed descriptions can be found in Appendix E.1.

Baselines. We compare the proposed TabSyn with seven existing synthetic tabular data generation methods. The first two are classical GAN and VAE models: CTGAN (Xu et al., 2019) and TVAE (Xu et al., 2019). Additionally, we evaluate five SOTA methods introduced recently: GOGGLE (Liu et al., 2023b), a VAE-based method; GReaT (Borisov et al., 2023), a language model variant; and three diffusion-based methods: STaSy (Kim et al., 2023), TabDDPM (Kotelnikov et al., 2023), and CoDi (Lee et al., 2023). Notably, these approaches were nearly simultaneously introduced, limiting opportunities for extensive comparison. For reference, we also compare with the representative interpolation-based method SMOTE (Chawla et al., 2002). Our paper fills this gap by providing the first comprehensive evaluation of their performance in a standardized setting.

Evaluation Methods. We evaluate the quality of the synthetic data from three aspects: 1) Low-order statisticscolumn-wise density estimation and pair-wise column correlation, estimating the density of every single column and the correlation between every column pair (Section 4.2). We also evaluate if the density estimation performance by testing if the synthetic data can be detected from the real data via a machine learning model (Appendix F.3). 2) High-order metricsα-precision and β-recall scores (Alaa et al., 2022) that measure the overall fidelity and diversity of synthetic data (the results are deferred to Appendix F.2), and 3) Performance on downstream tasksmachine learning efficiency (MLE) and missing value imputation. MLE is to compare the testing accuracy on real data when trained on synthetically generated tabular datasets. The performance on privacy protection is measured by MLE tasks that have been widely adopted in previous literature (Section 4.3.1). We also extend TabSyn for the missing value imputation task, which aims to fill in missing features/labels given partial column values (Appendix F.4).

Implementation details. The reported results are averaged over 20 randomly sampled synthetic data. The implementation details are in Appendix E.

4.2 Estimating Low-order Statistics of Data Density

Metrics. We employ the Kolmogorov-Sirnov Test (KST) for numerical columns and the Total Variation Distance (TVD) for categorical columns to quantify column-wise density estimation. For pair-wise column correlation, we use Pearson correlation for numerical columns and contingency similarity for categorical columns. The performance is measured by the difference between the correlations computed from real data and synthetic data. For the correlation between numerical and categorical columns, we first group numerical values into categorical ones via bucketing, then calculate the corresponding contingency similarity. Further details on these metrics are in Appendix E.3.

Table 1: Error rate (%) of column-wise density estimation. Bold Face represents the best score on each dataset. Lower values indicate more accurate estimation (superior results). TabSyn outperforms the best generative baseline model by 86.0% on average.
Method Adult Default Shoppers Magic Beijing News Average
SMOTE 1.60±0.23 1.48±0.15 2.68±0.19 0.91±0.05 1.85±0.21 5.31±0.46 2.30
CTGAN 16.84± 0.03 16.83±0.04 21.15±0.10 9.81±0.08 21.39±0.05 16.09±0.02 17.02
TVAE 14.22±0.08 10.17±0.05 24.51±0.06 8.25±0.06 19.16±0.06 16.62±0.03 15.49
GOGGLE1 16.97 17.02 22.33 1.90 16.93 25.32 16.74
GReaT2 12.12±0.04 19.94±0.06 14.51±0.12 16.16±0.09 8.25±0.12 OOM 14.20
STaSy 11.29±0.06 5.77±0.06 9.37±0.09 6.29±0.13 6.71±0.03 6.89±0.03 7.72
CoDi 21.38±0.06 15.77± 0.07 31.84±0.05 11.56±0.26 16.94±0.02 32.27±0.04 21.63
TabDDPM3 1.75±0.03 1.57± 0.08 2.72±0.13 1.01±0.09 1.30±0.03 78.75±0.01 14.52
TabSyn 0.58±0.06 0.85±0.04 1.43±0.24 0.88±0.09 1.12±0.05 1.64±0.04 1.08
Improv. 66.9% 45.9% 47.4% 12.9% 13.8% 76.2% 86.0%
  • 1

    GOGGLE fixes the random seed during sampling in the official codes, and we follow it for consistency.

  • 2

    GReaT cannot be applied on News because of the maximum length limit.

  • 3

    TabDDPM fails to generate meaningful content on the News dataset.

Table 2: Error rate (%) of pair-wise column correlation score. Bold Face represents the best score on each dataset. TabSyn outperforms the best baseline model by 67.6% on average.
Method Adult Default Shoppers Magic Beijing News Average
SMOTE 3.28±0.29 8.41±0.38 3.56±0.22 3.16±0.41 2.39±0.35 5.38±0.76 4.36
CTGAN 20.23±1.20 26.95±0.93 13.08±0.16 7.00±0.19 22.95±0.08 5.37±0.05 15.93
TVAE 14.15±0.88 19.50±0.95 18.67±0.38 5.82±0.49 18.01±0.08 6.17±0.09 13.72
GOGGLE 45.29 21.94 23.90 9.47 45.94 23.19 28.28
GReaT 17.59±0.22 70.02±0.12 45.16±0.18 10.23±0.40 59.60±0.55 OOM 44.24
STaSy 14.51±0.25 5.96±0.26 8.49±0.15 6.61±0.53 8.00±0.10 3.07±0.04 7.77
CoDi 22.49±0.08 68.41±0.05 17.78±0.11 6.53±0.25 7.07±0.15 11.10±0.01 22.23
TabDDPM 3.01±0.25 4.89±0.10 6.61±0.16 1.70±0.22 2.71±0.09 13.16±0.11 5.34
TabSyn 1.54±0.27 2.05±0.12 2.07±0.21 1.06±0.31 2.24±0.28 1.44±0.03 1.73
Improve. 48.8% 58.1% 68.7% 37.6% 17.3% 53.1% 67.6%
Column-wise distribution density estimation.

In Table 1, we note that TabSyn consistently outperforms baseline methods in the column-wise distribution density estimation task. On average, TabSyn surpasses the most competitive baselines by 86.0%. While STaSy and TabDDPM perform well, STaSy is sub-optimal because it treats one-hot embeddings of categorical columns as continuous features. Additionally, TabDDPM exhibits unstable performance across datasets, failing to generate meaningful content on the News dataset despite a standard training process.

Pair-wise column correlations.

Table 2 displays the results of pair-wise column correlations. TabSyn outperforms the best baselines by an average of 67.6%. Notably, the performance of GReaT is significantly poorer in this task than in the column-wise task. This indicates the limitations of autoregressive language models in density estimation, particularly in capturing the joint probability distributions between columns.

4.3 Performance on Downstream Tasks

4.3.1 Machine Learning Efficiency

We then evaluate the quality of synthetic data by evaluating their performance in Machine Learning Efficiency tasks. Following established settings (Kotelnikov et al., 2023; Kim et al., 2023; Lee et al., 2023), we first split a real table into a real training and a real testing set. The generative models are learned on the real training set, from which a synthetic set of equivalent size is sampled. This synthetic data is then used to train a classification/regression model (XGBoost Classifier and XGBoost Regressor (Chen & Guestrin, 2016)), which will be evaluated using the real testing set. The performance of MLE is measured by the AUC score for classification tasks and RMSE for regression tasks. The detailed settings of the MLE evaluations are in Appendix E.4.

In Table 3, we demonstrate that TabSyn consistently outperforms all the baseline methods. The performance gap between methods is smaller compared to column-wise density and pair-wise column correlation estimation tasks (Tables 1 and 2). This suggests that some columns may not significantly impact the classification/regression tasks, allowing methods with lower performance in previous tasks to show competitive results in MLE (e.g., GReaT on Default dataset). This underscores the need for a comprehensive evaluation approach beyond just MLE metrics. As shown above, we have incorporated low-order and high-order statistics for a more robust assessment.

Table 3: AUC (classification task) and RMSE (regression task) scores of Machine Learning Efficiency. () indicates that the higher (lower) the score, the better the performance. TabSyn consistently outperforms all others across all datasets.
Methods Adult Default Shoppers Magic Beijing News1 Average Gap
AUC AUC AUC AUC RMSE RMSE %
Real .927±.000 .770±.005 .926±.001 .946±.001 .423±.003 .842±.002 0%
SMOTE .899±.007 .741±.009 .911±.012 .934±.008 .593±.011 .897±.036 9.39%
CTGAN .886±.002 .696±.005 .875±.009 .855±.006 .902±.019 .880±.016 24.5%
TVAE .878±.004 .724±.005 .871±.006 .887±.003 .770±.011 1.01±.016 20.9%
GOGGLE .778±.012 .584±.005 .658±.052 .654±.024 1.09±.025 .877±.002 43.6%
GReaT .913±.003 .755±.006 .902±.005 .888±.008 .653±.013 OOM 13.3%
STaSy .906±.001 .752±.006 .914±.005 .934±.003 .656±.014 .871±.002 10.9%
CoDi .871±.006 .525±.006 .865±.006 .932±.003 .818±.021 1.21±.005 30.5%
TabDDPM2 .907±.001 .758±.004 .918±.005 .935±.003 .592±.011 4.86±3.04 9.14%1
TabSyn .915±.002 .764±.004 .920±.005 .938±.002 .582±.008 .861±.027 7.23%
  • 1

    Following CoDi (Lee et al., 2023), the continuous targets are standardized to prevent large values.

  • 2

    TabDDPM collapses on News, leading to an extremely high error on this dataset. We exclude this dataset
    when computing the average gap of TabDDPM.

4.3.2 Missing Value Imputation and Privacy Protection

One advantage of the diffusion model is that a well-trained unconditional model can be directly used for data imputation (e.g., image inpainting (Song et al., 2021b; Lugmayr et al., 2022)) without additional training. This paper explores adapting TabSyn for missing value imputation, a crucial task in real-world tabular data. Due to space limitation, the detailed algorithms for missing value imputation and the results are deferred to Appendix F.4. We also evaluate if the synthetic data is randomly sampled according to the distribution density rather than copied from the training data (for privacy protection) via Distance to Closest Records (DCR) in Appendix F.6.

4.4 Ablation Studies

The effect of adaptive β-VAE.

We assess the effectiveness of scheduling the weighting coefficient β in the VAE model. Figure 3 presents the trends of the reconstruction loss and the KL-divergence loss with the scheduled β and constant β values (from 101 to 105) across 4,000 training epochs. Notably, a large β value leads to subpar reconstruction, while a small β value results in a large divergence between the embedding distribution and the standard Gaussian, making the balance hard to achieve. In contrast, by dynamically scheduling β during training (βmax=0.01,βmin=105,λ=0.7), we not only prevent excessive KL divergence but also enhance quality. Table 4 further evaluates the learned embeddings from various β values of the VAE model via synthetic data quality (single-column density and pair-wise column correlation estimation tasks). This demonstrates the superior performance of our proposed scheduled β approach to train the VAE model.

Refer to caption
Figure 3: The trends of the validation reconstruction (left) and KL-divergence (right) losses on the Adult dataset, with varying constant β, and our proposed scheduled β (βmax=0.01,βmin=105,λ=0.7). The proposed scheduled β obtains the lowest reconstruction loss with a fairly low KL-divergence loss.
Table 4: The results of single-column density and pair-wise column correlation estimation with different β values on the Adult dataset.
β Single Pair
101 1.24 3.05
102 0.87 2.79
103 0.72 2.25
104 0.69 2.01
105 41.96 69.17
Scheduled β 0.58 1.54
The effect of linear noise levels.

We evaluate the effectiveness of using linear noise levels, σ(t)=t, in the diffusion process. As Section 3.3 outlines, linear noises lead to linear trajectories and faster sampling speed. Consequently, we compare TabSyn and two other diffusion models (STaSy and TabDDPM) in terms of the single-column density and pair-wise column correlation estimation errors relative to the number of function evaluations (NFEs), i.e., denoising steps to generate the real data. As continuous-time diffusion models, the proposed TabSyn and STaSy are flexible in choosing NFEs. For TabDDPM, we use the DDIM sampler (Song et al., 2021a) to adjust NFEs. Figure 4 shows that TabSyn not only significantly improves the sampling speed but also consistently yields better performance (with fewer than 20 NFEs for optimal results). In contrast, STaSy requires 50-200 NFEs, varying by datasets, and achieves sub-optimal performance. TabDDPM achieves competitive performance with 1,000 NFEs but significantly drops in performance when reducing NFEs.

Refer to caption
Figure 4: Quality of synthetic data as a function of NFEs on STaSy, TabDDPM, and TabSyn. TabSyn can generate synthetic data of the best quality with fewer NFEs (indicating faster sampling speed).
Table 5: Performance of TabSyn’s variants on Adult dataset, on low-order statistics estimation tasks.
Variants Single Pair
TabDDPM 1.75 3.01
TabSyn-OneHot 5.59 6.92
TabSyn-DDPM 1.02 2.15
TabSyn 0.58 1.54
Comparing different encoding/diffusion methods.

We assess the effectiveness of learning the diffusion model in the latent space learned by VAE by creating two TabSyn variants: 1) TabSyn-OneHot: replacing VAE with one-hot encodings of categorical variables and 2) TabSyn-DDPM: substituting the diffusion process in Equation (5) with DDPM as used in TabDDPM. Results in Table 5 demonstrate: 1) One-hot encodings for categorical variables plus continuous diffusion models lead to the worst performance, indicating that it is not appropriate to treat categorical columns simply as continuous features; 2) TabSyn-DDPM in the latent space outperforms TabDDPM in the data space, highlighting the benefit of learning high-quality latent embeddings for improved diffusion modeling; 3) TabSyn surpasses TabSyn-DDPM, indicating the advantage of employing tailored diffusion models in the continuous latent space for better data distribution learning.

4.5 Visualization

In Figure 5, we compare column density across eight columns from four datasets (one numerical and one categorical per dataset). TabDDPM matches TabSyn’s accuracy on numerical columns but falls short on categorical ones. Figure 6 displays the divergence heatmap between estimated pair-wise column correlations and real correlations. TabSyn gives the most accurate correlation estimation, while other methods exhibit suboptimal performance. These results justify that employing generative modeling in latent space enhances the learning of categorical features and joint column distributions.

Refer to caption
Figure 5: Visualization of synthetic data’s single column distribution density (from STaSy, TabDDPM, and TabSyn) v.s. the real data. Upper: numerical columns; Lower: Categorical columns. Note that numerical columns show competitive performance with baselines, while TabSyn excels in estimating categorical column distributions.
Refer to caption
Figure 6: Heatmaps of the pair-wise column correlation of synthetic data v.s. the real data. The value represents the absolute divergence between the real and estimated correlations (the lighter, the better). TabSyn gives the most accurate column correlation estimation.

5 Conclusions

In this paper, we have proposed TabSyn for synthetic tabular data generation. The TabSyn framework leverages a VAE to map tabular data into a latent space, followed by utilizing a diffusion-based generative model to learn the latent distribution. This approach presents the dual advantages of accommodating numerical and categorical features within a unified latent space, thus facilitating a more comprehensive understanding of their interrelationships and enabling the utilization of advanced generative models in a continuous embedding space. To address potential challenges, TabSyn proposes a model design and training methods, resulting in a highly stable generative model. In addition, TabSyn rectifies the deficiency in prior research by employing a diverse set of evaluation metrics to comprehensively compare the proposed method with existing approaches, showcasing the remarkable quality and fidelity of the generated samples in capturing the original data distribution.

References

  • Alaa et al. (2022) Ahmed Alaa, Boris Van Breugel, Evgeny S Saveliev, and Mihaela van der Schaar. How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models. In International Conference on Machine Learning, pp. 290–306. PMLR, 2022.
  • Assefa et al. (2021) Samuel A. Assefa, Danial Dervovic, Mahmoud Mahfouz, Robert E. Tillman, Prashant Reddy, and Manuela Veloso. Generating synthetic data in finance: Opportunities, challenges and pitfalls. In Proceedings of the First ACM International Conference on AI in Finance, ICAIF ’20. Association for Computing Machinery, 2021. ISBN 9781450375849.
  • Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • Blattmann et al. (2023) Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563–22575, 2023.
  • Borisov et al. (2023) Vadim Borisov, Kathrin Sessler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. Language models are realistic tabular data generators. In The Eleventh International Conference on Learning Representations, 2023.
  • Chawla et al. (2002) Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
  • Chen & Guestrin (2016) Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 785–794, 2016.
  • Esser et al. (2021) Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12873–12883, 2021.
  • Fonseca & Bacao (2023) Joao Fonseca and Fernando Bacao. Tabular and latent space synthetic data generation: a literature review. Journal of Big Data, 10(1):115, 2023.
  • Goodfellow et al. (2014) Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, pp. 2672–2680, 2014.
  • Gorishniy et al. (2021) Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting deep learning models for tabular data. In Proceedings of the 35th International Conference on Neural Information Processing Systems, pp. 18932–18943, 2021.
  • Hernandez et al. (2022) Mikel Hernandez, Gorka Epelde, Ane Alberdi, Rodrigo Cilla, and Debbie Rankin. Synthetic data generation for tabular health records: A systematic review. Neurocomputing, 493:28–45, 2022.
  • Higgins et al. (2016) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In The Forth International Conference on Learning Representations, 2016.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 6840–6851, 2020.
  • Hoogeboom et al. (2021) Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. In Proceedings of the 35th International Conference on Neural Information Processing Systems, pp. 12454–12465, 2021.
  • Karras et al. (2022) Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, pp. 26565–26577, 2022.
  • Kim et al. (2022) Jayoung Kim, Chaejeong Lee, Yehjin Shin, Sewon Park, Minjung Kim, Noseong Park, and Jihoon Cho. Sos: Score-based oversampling for tabular data. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 762–772, 2022.
  • Kim et al. (2023) Jayoung Kim, Chaejeong Lee, and Noseong Park. Stasy: Score-based tabular data synthesis. In The Eleventh International Conference on Learning Representations, 2023.
  • Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  • Kingma & Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Kotelnikov et al. (2023) Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning, pp. 17564–17579. PMLR, 2023.
  • Lee et al. (2023) Chaejeong Lee, Jayoung Kim, and Noseong Park. Codi: Co-evolving contrastive diffusion models for mixed-type tabular synthesis. In International Conference on Machine Learning, pp. 18940–18956. PMLR, 2023.
  • Li et al. (2022) Yang Li, Yichuan Mo, Liangliang Shi, and Junchi Yan. Improving generative adversarial networks via adversarial learning in latent space. Advances in Neural Information Processing Systems, 35:8868–8881, 2022.
  • Liu et al. (2023a) Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023a.
  • Liu et al. (2023b) Tennison Liu, Zhaozhi Qian, Jeroen Berrevoets, and Mihaela van der Schaar. Goggle: Generative modelling for tabular data by learning relational structure. In The Eleventh International Conference on Learning Representations, 2023b.
  • Lugmayr et al. (2022) Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11461–11471, 2022.
  • Razavi et al. (2019) Ali Razavi, Aäron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 14866–14876, 2019.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
  • Song et al. (2021a) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In The Ninth International Conference on Learning Representations, 2021a.
  • Song et al. (2021b) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In The Ninth International Conference on Learning Representations, 2021b.
  • Vahdat et al. (2021) Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. In Proceedings of the 35th International Conference on Neural Information Processing Systems, pp. 11287–11302, 2021.
  • van den Oord et al. (2017) Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6309–6318, 2017.
  • Wu et al. (2023) Qitian Wu, Chenxiao Yang, Wentao Zhao, Yixuan He, David Wipf, and Junchi Yan. Difformer: Scalable (graph) transformers induced by energy constrained diffusion. In The Eleventh International Conference on Learning Representations, 2023.
  • Xu et al. (2019) Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional gan. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 7335–7345, 2019.
  • Zheng & Charoenphakdee (2022) Shuhan Zheng and Nontawat Charoenphakdee. Diffusion models for missing value imputation in tabular data. arXiv preprint arXiv:2210.17128, 2022.

Appendix A Algorithms

In this section, we provide an algorithmic illustration of the proposed TabSyn. Algorithm 1 and Algorithm 2 present the algorithms of the VAE and Diffusion phases of the training process of TabSyn, respectively. Algorithm 3 presents TabSyn’s sampling algorithm.

1: Sample 𝒛=(𝒛num,𝒛cat)p(𝒯)
2: Get tokenized feature 𝒆 via Eq. 1
3: Get 𝝁 and log𝝈 via VAE’s Transformer encoder
4: Reparameterization: 𝒛^=𝝁+𝜺𝝈, where 𝜺𝒩(𝟎,𝑰)
5: Get 𝒆^ via VAE’s Transformer decoder
6: Get detokenized feature 𝒛^ via Eq. 3
7: Calculate loss =recon(𝒛,𝒛^)+βkl(𝝁,𝝈)
8: Update the network parameter via Adam optimizer
9: if recon fails to decrease for S steps then
10: βλβ
11: end if
Algorithm 1 TabSyn: Training of VAE
1: Sample the embedding 𝒛0 from p(𝒛)=p(𝝁)
2: Sample time steps t from p(t) then get σ(t)
3: Sample noise vectors 𝜺𝒩(𝟎,σi2𝑰)
4: Get perturbed data 𝒛t=𝒛0+𝜺
5: Calculate loss (θ)=ϵθ(𝒛t,t)𝜺22
6: Update the network parameter θ via Adam optimizer
Algorithm 2 TabSyn: Training of Diffusion
1: Sample 𝒛T𝒩(𝟎,σ2(T)𝑰),tmax=T
2: for i=max,,1 do
3: 𝒛tilogp(𝒛ti)=ϵθ(𝒛ti,ti)/σ(ti)
4: get 𝒛ti1 via solving the SDE in Eq. 6.
5: end for
6: Put 𝒛0 as input of the VAE’s Transformer decoder, then acquire 𝒆^
7: Get detokenized feature 𝒛^ via Eq. 3
8: 𝒛^ is the sampled synthetic data
Algorithm 3 TabSyn: Sampling

Appendix B Diffusion Models Basics

Diffusion models are often presented as a pair of two processes.

  • A fixed forward process governs the training of the model, which adds Gaussian noises of increasing scales to the original data.

  • A corresponding backward process involves utilizing the trained model iteratively to denoise the samples starting from a fully noisy prior distribution.

B.1 Forward Process

Although there are different mathematical formulations (discrete or continuous) of the diffusion model, Song et al. (2021b) provides a unified formulation via the Stochastic Differential Equation (SDE) and defines the forward process of Diffusion as (note that in this paper, the independent variable is denoted as 𝒛)

d𝒛=𝒇(𝒛,t)dt+g(t)d𝒘t, (8)

where 𝒇() and g() are the drift and diffusion coefficients and are selected differently for different diffusion processes, e.g., the variance preserving (VP) and variance exploding (VE) formulations. 𝝎t is the standard Wiener process. Usually, f() is of the form 𝒇(𝒛,t)=f(t)𝒛. Thus, the SDE can be equivalently written as

d𝒛=f(t)𝒛dt+g(t)d𝒘t. (9)

Let 𝒛 be a function of the time t, i.e., 𝒛t=𝒛(t), then the conditional distribution of 𝒛t given 𝒛0 (named as the perturbation kernel of the SDE) could be formulated as:

p(𝒛t|𝒛0)=𝒩(𝒛t;s(t)𝒛0,s2(t)σ2(t)𝑰), (10)

where

s(t)=exp(0tf(ξ)dξ),andσ(t)=0tg2(ξ)s2(ξ)dξ. (11)

Therefore, the forward diffusion process could be equivalently formulated by defining the perturbation kernels (via defining appropriate s(t) and σ(t)).

Variance Preserving (VP) implements the perturbation kernel Eq. 10 by setting s(t)=1β(t), and σ(t)=β(t)1β(t) (s2(t)+s2(t)σ2(t)=1). Denoising Diffusion Probabilistic Models (DDPM, Ho et al. (2020)) belong to VP-SDE by using discrete finite time steps and giving specific functional definitions of β(t).

Variance Exploding (VE) implements the perturbation kernel Eq. 10 by setting s(t)=1, indicating that the noise is directly added to the data rather than weighted mixing. Therefore, The noise variance (the noise level) is totally decided by σ(t). The diffusion model used in TabSyn belongs to VE-SDE, but we use linear noise level (i.e., σ(t)=t) rather than σ(t)=t in the vanilla VE-SDE (Song et al., 2021b). When s(t)=1, the perturbation kernels become:

p(𝒛t|𝒛0)=𝒩(𝒛t;𝟎,σ2(t)𝑰)𝒛t=𝒛0+σ(t)𝜺, (12)

which aligns with the forward diffusion process in Eq. 5.

B.2 Reverse Process

The sampling process of diffusion models is defined by a corresponding reverse SDE:

d𝒛=[𝒇(𝒛,t)g2(t)𝒛logpt(𝒛)]dt+g(t)d𝒘t. (13)

For VE-SDE, s(t)=1𝒇(𝒛,t)=f(t)𝒛=𝟎, and

σ(t)=0tg2(ξ)dξ0tg2(ξ)dξ=σ2(t),g2(t)=dσ2(t)dt=2σ(t)σ˙(t),g(t)=2σ(t)σ˙(t). (14)

Plugging g(t) into Eq. 13, the reverse process in Eq. 6 is recovered:

d𝒛=2σ(t)σ˙(t)𝒛logpt(𝒛)dt+2σ(t)σ˙(t)d𝝎t. (15)

B.3 Training

As 𝒇(𝒛,t),g(t),𝒘t are all known, if 𝒛logpt(𝒛) (named the score function) is also available, we can sample synthetic data via the reverse process from random noises. Diffusion models train a neural network (named the denoising function) Dθ(𝒛t,t) to approximate 𝒛logpt(𝒛). However, 𝒛logpt(𝒛) itself is intractable, as the marginal distribution pt(𝒛)=p(𝒛t) is intractable. Fortunately, the conditional distribution p(𝒛t|𝒛0) is tractable. Therefore, we can train the denoising function to approximate the conditional score function instead 𝒛tlogp(𝒛t|𝒛0), and the training process is called denoising score matching:

min𝔼𝒛0p(𝒛0)𝔼𝒛tp(𝒛t|𝒛0)Dθ(𝒛t,t)𝒛tlogp(𝒛t|𝒛0))22, (16)

where 𝒛logp(𝒛t|𝒛0) has analytical solution according to Eq. 10:

𝒛tlogp(𝒛t|𝒛0)=1p(𝒛t|𝒛0)𝒛tp(𝒛t|𝒛0)=1p(𝒛t|𝒛0)(1s2(t)σ2(t)(𝒛ts(t)𝒛0))p(𝒛t|𝒛0)=1s2(t)σ2(t)(𝒛ts(t)𝒛0)=1s2(t)σ2(t)(s(t)𝒛0+s(t)σ(t)𝜺s(t)𝒛0)=𝜺s(t)σ(t). (17)

Therefore, Eq. 16 becomes

min𝔼𝒛0p(𝒛0)𝔼𝒛tp(𝒛t|𝒛0)Dθ(𝒛t,t)+𝜺s(t)σ(t)22min𝔼𝒛0p(𝒛0)𝔼𝒛tp(𝒛t|𝒛0)ϵθ(𝒛t,t)𝜺22, (18)

where Dθ(𝒛t,t)=ϵθ(𝒛t,t)s(t)σ(t). After the training ends, sampling is enabled by solving Eq. 13 (replacing 𝒛logpt(𝒛) with Dθ(𝒛t,t)).

Appendix C Proofs

C.1 Proof for Proposition 1

We first introduce Lemma 1 (from (Karras et al., 2022)), which introduces a family of SDEs sharing the same solution trajectories with different forwarding processes:

Lemma 1.

Let g(t) be a free parameter functional of t, and the following family of (forward) SDEs have the same marginal distributions of the solution trajectories with noise levels σ(t) for any choice of g(t):

d𝒛=(12g2(t)σ˙(t)σ(t))𝒛logp(𝒛;σ(t))dt+g(t)dωt. (19)

The reverse family of SDEs of Eq. 19 is given by changing the sign of the first term:

d𝒛=12g2(t)𝒛logp(𝒛;σ(t)σ˙(t)σ(t)𝒛logp(𝒛;σ(t))dt+g(t)dωt. (20)

Lemma 1 indicates that for a specific (forward) SDE following Eq. 19, we can obtain its solution trajectory by solving Eq. 20 of any g(t).

Since our forwarding diffusion process (Eq. 5) lets g(t)=2σ˙(t)σ(t) (see derivations in Appendix B), its solution trajectory can be solved via letting g(t)=0 in Eq. 20:

d𝒛=σ˙(t)σ(t)𝒛logp(𝒛;σ(t))dt,d𝒛dt=σ˙(t)σ(t)𝒛logp(𝒛;σ(t)). (21)

Eq. 21 is usually named the Probability-Flow ODE, since it depicts a deterministic reverse process without noise terms. Based on Lemma 1, we can study the solution of Eq. 6 using Eq. 21.

To prove Proposition 1, we only have to study the absolute error between the ground-truth 𝒛tΔt the approximated one by solving Eq. 6 from 𝒛t, where Δt0:

𝒛tΔt𝒛tΔtEuler. (22)

Since 𝒛t=𝒛0+σ(t)𝜺, there is 𝒛tΔt=𝒛t𝜺(σ(t)σ(tΔt)).

The solution of 𝒛tΔt from 𝒛t can be obtained using 1st-order Euler method:

𝒛tΔtEuler𝒛tΔtd𝒛tdt=𝒛t+Δtσ˙(t)σ(t)𝒛tlogp(𝒛t)=𝒛t+Δtσ˙(t)σ(t)(ϵθ(𝒛t,t)/σ(t))=𝒛tσ˙(t)ϵθ(𝒛t,t)Δt=𝒛tσ˙(t)𝜺Δt, (23)
𝒛tΔt𝒛tΔtEuler=𝜺(σ(t)σ(tΔt))σ˙(t)𝜺Δt=𝜺(σ(t)σ(tΔt)σ˙(t)Δt). (24)

Specifically, if σ(t)=t,σ˙(t)=1, there is

𝒛tΔt𝒛tΔtEuler=𝟎. (25)

Comparably, setting σ(t)=t (as in VE-SDE Song et al. (2021b)) leads to

𝒛tΔt𝒛tΔtEuler=|ϵ(ttΔtΔt2t)|𝟎. (26)

Therefore, the proof is complete.

Appendix D Network Architectures

D.1 Architectures of VAE

In Sec. 3.2, we introduce the general architecture of our VAE module, which consists of a tokenizer, a Transformer encoder, a Transformer decoder, and a detokenizer. Figure 7 is a detailed illustration of the architectures of the Transformer encoder and decoder.

Refer to caption
Figure 7: Architectures of VAE’s Encoder and Decoder. Both the encoder and decoder are implemented as a two-layer Transformer of identical architectures.

The VAE’s encoder takes the tokenizer’s output 𝑬M×d as input. As we are using the Variational AutoEncoder, the encoder module consists of a μ encoder and a logσ encoder of the same architecture. Each encoder is implemented as a two-layer Transformer, each with a Self-Attention (without multi-head) module and a Feed Forward Neural Network (FFNN). The FFNN used in TabSyn is a simple two-layer MLP with ReLU activation(the input of the FFNN is denoted by 𝑯0):

𝑯1=ReLU(FC(𝑯0))M×D,𝑯2=FC(𝑯1),M×d, (27)

where FC denotes the fully-connected layer, and D is FFNN’s hidden dimension. In this paper, we set d=4 and D=128 for all datasets. "Add & Norm" in Figure 7 denotes residual connection and Layer Normalization (Ba et al., 2016), respectively.

The VAE encoder outputs two matrixes: mean matrix 𝝁M×d and log standard deviation matrix log𝝈M×d. Then, the latent variables are obtained via the parameterization trick:

𝒁=𝝁+𝝈𝜺,𝜺𝒩(𝟎,𝑰). (28)

The VAE’s decoder is another two-layer Transformer of the same architecture as the encoder, and it takes 𝒁 as input. The decoder is expected to output 𝑬^M×d for the detokenizer.

D.2 Architectures of Denoising MLP

In Figure 7, we present the architecture of the denoising neural networks 𝜺θ(𝒛t,t) in Eq. 7, which is a simple MLP of the similar architecture as used in TabDDPM (Kotelnikov et al., 2023).

Refer to caption
Figure 8: Architectures of denoising function 𝜺θ(𝒛t,t). The denoising function is a 5-layer MLP with SiLU activations. temb is the sinusoidal timestep embeddings.

The denoising MLP takes the current time step t and the corresponding latent vector 𝒛t1×Md as input. First, 𝒛t is fed into a linear projection layer that converts the vector dimension to be dhidden:

𝒉0=FCin(𝒛t)1×dhidden, (29)

where 𝒉0 is the transformed vector, and dhidden is the output dimension of the input layer.

Then, following the practice in TabDDPM (Kotelnikov et al., 2023), the sinusoidal timestep embeddings 𝒕emb1×dhidden is added to 𝒉0 to obtain the input vector 𝒉hidden:

𝒉in=𝒉0+𝒕emb. (30)

The hidden layers are three fully connected layers of the size dhidden2dhidden2dhiddendhidden, with SiLU activation functions (in consistency with TabDDPM (Kotelnikov et al., 2023)):

𝒉1=SiLU(FC1(𝒉0)1×2dhidden),𝒉2=SiLU(FC2(𝒉1)1×2dhidden),𝒉3=SiLU(FC3(𝒉2)1×dhidden). (31)

The estimated score is obtained via the last linear layer:

𝜺θ(𝒛t,t)=𝒉out=FCout(𝒉3)1×din. (32)

Finally, 𝜺θ(𝒛t,t) is applied to Eq. 7 for model training.

Appendix E Details of Experimental Setups

We implement TabSyn and all the baseline methods with PyTorch. All the methods are optimized with Adam (Kingma & Ba, 2015) optimizer. All the experiments are conducted on an Nvidia RTX 4090 GPU with 24G memory.

E.1 Datasets

We use 6 tabular datasets from UCI Machine Learning Repository111https://archive.ics.uci.edu/datasets: Adult, Default, Shoppers, Magic, Beijing, and News, where each tabular dataset is associated with a machine-learning task. Classification: Adult, Default, Magic, and Shoppers. Regression: Beijing and News. The statistics of the datasets are presented in Table 6.

Table 6: Statistics of datasets. # Num stands for the number of numerical columns, and # Cat stands for the number of categorical columns.
Dataset # Rows # Num # Cat # Train # Validation # Test Task
Adult 48,842 6 9 28,943 3,618 16,281 Classification
Default 30,000 14 11 24,000 3,000 3,000 Classification
Shoppers 12,330 10 8 9,864 1,233 1,233 Classification
Magic 19,019 10 1 15,215 1,902 1,902 Classification
Beijing 43,824 7 5 35,058 4,383 4,383 Regression
News 39,644 46 2 31,714 3,965 3,965 Regression

In Table 6, # Rows denote the number of rows (records) in the table. # Num and # Cat denote the number of numerical features and categorical features, respectively. Note that the target column is counted as either a numerical or a categorical feature, depending on the task type. Specifically, the target column belongs to the categorical column if the task is classification; otherwise, it is a numerical column. Each dataset is split into training, validation, and testing sets for the Machine Learning Efficiency experiments. As Adult has its official testing set, we directly use it as the testing set. The original training set of Adult is further split into training and validation split with the ratio 8:1. The remaining datasets are split into training/validation/testing sets with the ratio 8:1:1 with a fixed seed.

Below is a detailed introduction to each dataset:

E.2 Baselines

In this section, we introduce and compare the properties of the baseline methods used in this paper.

  • CTGAN and TVAE are two methods for synthetic tabular data generation proposed in one paper (Xu et al., 2019), using the same techniques proposed but based on different basic generative models – GAN for CTGAN while VAE for TVAE. The two methods contain two important designs: 1) Mode-specific Normalization to deal with numerical columns with complicated distributions. 2) Conditional Generation of numerical columns based on categorical columns to deal with class imbalance problems.

  • GOGGLE (Liu et al., 2023b) is a recently proposed synthetic tabular data generation model based on VAE. The primary motivation of GOGGLE is that the complicated relationships between different columns are hardly exploited by previous literature. Therefore, it proposes to learn a graph adjacency matrix to model the dependency relationships between different columns. The encoder and decoder of the VAE model are both implemented as Graph Neural Networks (GNNs), and the graph adjacent matrix is jointly optimized with the GNNs parameters.

  • GReaT (Borisov et al., 2023) treats a row of tabular data as a sentence and applies the Auto-regressive GPT model to learn the sentence-level row distributions. GReaT involves a well-designed serialization process to transform a row into a natural language sentence of a specific format and a corresponding deserialization process to transform a sentence back to the table format. To ensure the permutation invariant property of tabular data, GReaT shuffles each row several times before serialization.

  • STaSy (Kim et al., 2023) is a recent diffusion-based model for synthetic tabular data generation. STaSy treats the one-hot encoding of categorical columns as continuous features, which are then processed together with the numerical columns. STaSy adopts the VP/VE SDEs from Song et al. (2021b) as the diffusion process to learn the distribution of tabular data. STaSy further proposes several training strategies, including self-paced learning and fine-tuning, to stabilize the training process, increasing sampling quality and diversity.

  • CoDi (Lee et al., 2023) proposes to utilize two diffusion models for numerical and categorical columns, respectively. For numerical columns, it uses the DDPM (Ho et al., 2020) model with Gaussian noises. For categorical columns, it uses the multinominal diffusion model (Hoogeboom et al., 2021) with categorical noises. The two diffusion processes are inter-conditioned on each other to model the joint distribution of numerical and categorical columns. In addition, CoDi adopts contrastive learning methods to further bind the two diffusion methods.

  • TabDDPM (Kotelnikov et al., 2023). Like CoDi, TabDDPM introduces two diffusion processes: DDPM with Gaussian noises for numerical columns and multinominal diffusion with categorical noises for categorical columns. Unlike CoDi, which introduces many additional techniques, such as co-evolved learning via inter-conditioning and contrastive learning, TabDDPM concatenates the numerical and categorical features as input and output of the denoising function (an MLP). Despite its simplicity, our experiments have shown that TabDDPM performs even better than CoDi.

We further compare the properties of these baseline methods and the proposed TabSyn in Table 7. The compared properties include: 1) Compatibility: if the method can deal with mixed-type data columns, e.g., numerical and categorical. 2) Robustness: if the method has stable performance across different datasets (measured by the standard deviation of the scores (10% or not) on different datasets (from Table 1 and Table 2). 3) Quality: Whether the synthetic data can pass column-wise Chi-Squared Test (p0.95). 4) Efficiency: Each method can generate synthetic tabular data of satisfying quality within less than 20 steps.

Table 7: A comparison of the properties of different generative models for tabular data. Base model denotes the base generative model type: Generative Adversarial Networks (GAN), Variational AutoEncoders (VAE), Auto-Regressive Language Models (AR), and Diffusion.
Method Base Model Compatibility Robustness Quality Efficiency
CTGAN GAN  ✗  ✗
TVAE VAE  ✗
GOGGLE VAE  ✗  ✗  ✗
GReaT AR  ✗  ✗  ✗
STaSy Diffusion  ✗  ✗
CoDi Diffusion  ✗  ✗  ✗
TabDDPM Diffusion  ✗  ✗
TabSyn Diffusion

E.3 Metrics of Low-order Statistics

In this section, we give a detailed introduction of the metrics used in Sec. 4.2.

E.3.1 Column-wise Density Estimation

Kolmogorov-Sirnov Test (KST): Given two (continuous) distributions pr(x) and ps(x) (r denotes real and s denotes synthetic), KST quantifies the distance between the two distributions using the upper bound of the discrepancy between two corresponding Cumulative Distribution Functions (CDFs):

KST=supx|Fr(x)Fs(x)|, (33)

where Fr(x) and Fs(x) are the CDFs of pr(x) and ps(x), respectively:

F(x)=xp(x)dx. (34)

Total Variation Distance (TVD): TVD computes the frequency of each category value and expresses it as a probability. Then, the TVD score is the average difference between the probabilities of the categories:

TVD=12ωΩ|R(ω)S(ω)|, (35)

where ω describes all possible categories in a column Ω. R() and S() denotes the real and synthetic frequencies of these categories.

E.3.2 Pair-wise Column Correlation

Pearson Correlation Coefficient: The Pearson correlation coefficient measures whether two continuous distributions are linearly correlated and is computed as:

ρx,y=Cov(x,y)σxσy, (36)

where x and y are two continuous columns. Cov is the covariance, and σ is the standard deviation.

Then, the performance of correlation estimation is measured by the average differences between the real data’s correlations and the synthetic data’s corrections:

Pearson Score=12𝔼x,y|ρR(x,y)ρS(x,y)|, (37)

where ρR(x,y) and ρS(x,y)) denotes the Pearson correlation coefficient between column x and column y of the real data and synthetic data, respectively. As ρ[1,1], the average score is divided by 2 to ensure that it falls in the range of [0,1], then the smaller the score, the better the estimation.

Contingency similarity: For a pair of categorical columns A and B, the contingency similarity score computes the difference between the contingency tables using the Total Variation Distance. The process is summarized by the formula below:

Contingency Score=12αAβB|Rα,βSα,β|, (38)

where α and β describe all the possible categories in column A and column B, respectively. Rα,β and Sα,β are the joint frequency of α and β in the real data and synthetic data, respectively.

E.4 Details of Machine Learning Efficiency Experiments

As preliminarily illustrated in Sec. 4.3.1 and Appendix E.1, each dataset is first split into the real training and testing set. The generative models are learned on the real training set. After the models are learned, a synthetic set of equivalent size is sampled.

The performance of synthetic data on MLE tasks is evaluated based on the divergence of test scores using the real and synthetic training data. Therefore, we first train the machine learning model on the real training set, split into training and validation sets with a 8:1 ratio. The classifier/regressor is trained on the training set, and the optimal hyperparameter setting is selected according to the performance on the validation set. After the optimal hyperparameter setting is obtained, the corresponding classifier/regressor is retrained on the training set and evaluated on the real testing set. We create 20 random splits for training and validation sets, and the performance reported in Table 3 is the mean and std of the AUC/RMSE score over the 20 random trails. The performance of synthetic data is obtained in the same way.

Below is the hyperparameter search space of the XGBoost Classifier/Regressor used in MLE tasks, and we select the optimal hyperparameters via grid search.

  • Number of estimators: [10, 50, 100]

  • Minimum child weight: [5, 10, 20].

  • Maximum tree depth: [1,10].

  • Gamma: [0.0, 1.0].

We use the implementations of these metrics from SDMetric888https://docs.sdv.dev/sdmetrics.

Appendix F Addition Experimental Results

In this section, we compare the training and sampling time of different methods, taking the Adult dataset as an example.

F.1 Training / Sampling Time

Table 8: Comparison of training and sampling time of different methods, on Adult dataset. TabSyn’s training time is the summation of VAE’s and Diffusion’s training time.
Method Training Sampling
CTGAN 1029.8s 0.8621s
TVAE 352.6s 0.5118s
GOGGLE 1h 34min 5.342s
GReaT 2h 27min 2min 19s
STaSy 2283s 8.941s
CoDi 2h 56min 4.616s
TabDDPM 1031s 28.92s
TabSyn 1758s + 664s 1.784s

As shown in Fig. 8, though having an additional VAE training process, the proposed TabSyn has a similar training time to most of the baseline methods. In regard to the sampling time, TabSyn requires merely 1.784s to generate synthetic data of the same size as Adult’s training data, which is close to the one-step sampling methods CTGAN and TVAE. Other diffusion-based methods take a much longer time for sampling. E.g., the most competitive method TabDDPM (Kotelnikov et al., 2023) takes 28.92s for sampling. The proposed TabSyn reduces the sampling time by 93%, and achieves even better synthesis quality.

F.2 Sample-wise Quality score of Synthetic Data (α-Precison and β-Recall)

Experiments in Sec. 4 have evaluated the performance of synthetic data generated from different models using low-order statistics, including the column-wise density estimation (Table 1) and pair-wise column correlation estimation (Table 2). However, these results are insufficient to evaluate synthetic data’s overall density estimation performance, as the generative model may simply learn to estimate the density of each single column individually rather than the joint probability of all columns. Furthermore, the MLE tasks also cannot reflect the overall density estimation performance since some unimportant columns might be overlooked. Therefore, in this section, we adopt higher-order metrics that focus more on the entire data distribution, i.e., the joint distribution of all the columns.

Following Liu et al. (2023b) and Alaa et al. (2022), we adopt the α-Precision and β-Recall proposed in Alaa et al. (2022), two sample-level metric quantifying how faithful the synthetic data is. In general, α-Precision evaluates the fidelity of synthetic data – whether each synthetic example comes from the real-data distribution, β-Recall evaluates the coverage of the synthetic data, e.g., whether the synthetic data can cover the entire distribution of the real data (In other words, whether a real data sample is close to the synthetic data). Liu et al. (2023b) also adopts the third metric, Authenticity – whether the synthetic sample is generated randomly or simply copied from the real data. However, we found that authenticity score and beta-recall exhibit a predominantly negative correlation – their sum is nearly constant, and an improvement in beta-recall is typically accompanied by a decrease in authenticity score (we believe this is the reason for the relatively small differences in quality scores among the various models in GOGGLE (Liu et al., 2023b)). Therefore, we believe that beta-recall and authenticity are not suitable for simultaneous use.

In Table 9 and Table 10 we report the scores of α-Precision and β-Recall, respectively. TabSyn obtains the best α-Precision scores on all the datasets, demonstrating the superior capacity of TabSyn in generating synthetic data that is close to the real ones. In Table 10, we observe that TabSyn consistently achieves high β-Recall scores across six datasets. Although some baseline methods obtain higher β-recall scores on specific datasets, it can hardly be concluded that the synthetic data generated by these methods are of better quality since 1) their synthetic data has poor α-Precision scores (e.g., GReaT on Adult, and STaSy on Magic), indicating that the synthetic data is far from the real data’s manifold; 2) they fail to demonstrate stably competitive performance on other datasets (e.g., GReaT has high β-Recall scores on Adult but poor scores on Magic). We believe that to assess the quality of generation, the first consideration is whether the generated data is sufficiently authentic (α-Precision), and the second is whether the generated data can cover all the modes of the real dataset (β-Recall). According to this criterion, the quality of data generated by TabSyn is the highest. It not only possesses the highest fidelity score but also consistently demonstrates remarkably high coverage on every dataset.

Table 9: Comparison of α-Precision scores. Bold Face represents the best score on each dataset. Higher values indicate superior results. TabSyn outperforms all other baseline methods on all datasets.
Methods Adult Default Shoppers Magic Beijing News Average Ranking
CTGAN 77.74±0.15 62.08±0.08 76.97±0.39 86.90±0.22 96.27±0.14 96.96±0.17 82.82 5
TVAE 98.17±0.17 85.57±0.34 58.19±0.26 86.19±0.48 97.20±0.10 86.41±0.17 85.29 4
GOGGLE 50.68 68.89 86.95 90.88 88.81 86.41 78.77 8
GReaT 55.79±0.03 85.90±0.17 78.88±0.13 85.46±0.54 98.32±0.22 - 80.87 6
STaSy 82.87±0.26 90.48±0.11 89.65±0.25 86.56±0.19 89.16±0.12 94.76±0.33 88.91 2
CoDi 77.58±0.45 82.38±0.15 94.95±0.35 85.01±0.36 98.13±0.38 87.15±0.12 87.03 3
TabDDPM 96.36±0.20 97.59±0.36 88.55±0.68 98.59±0.17 97.93±0.30 0.00±0.00 79.83 7
TabSyn 99.52±0.10 99.26±0.27 99.16±0.22 99.38±0.27 98.47±0.10 96.80±0.25 98.67 1
Table 10: Comparison of β-Recall scores. Bold Face represents the best score on each dataset. Higher values indicate superior results. TabSyn gives consistently high β-recall scores, indicating that the synthetic data covers a wide range of the real distribution. Although some baseline methods obtain higher scores on specific datasets, they fail to demonstrate stably competitive performance on other datasets.
Methods Adult Default Shoppers Magic Beijing News Average Ranking
CTGAN 30.80±0.20 18.22±0.17 31.80±0.350 11.75±0.20 34.80±0.10 24.97±0.29 25.39 7
TVAE 38.87±0.31 23.13±0.11 19.78±0.10 32.44±0.35 28.45±0.08 29.66±0.21 28.72 6
GOGGLE 8.80 14.38 9.79 9.88 19.87 2.03 10.79 8
GReaT 49.12±0.18 42.04±0.19 44.90±0.17 34.91±0.28 43.34±0.31 - 43.34 2
STaSy 29.21±0.34 39.31±0.39 37.24±0.45 53.97±0.57 54.79±0.18 39.42±0.32 42.32 3
CoDi 9.20±0.15 19.94±0.22 20.82±0.23 50.56±0.31 52.19±0.12 34.40±0.31 31.19 5
TabDDPM 47.05±0.25 47.83±0.35 47.79±0.25 48.46±0.42 56.92±0.13 0.00±0.00 41.34 4
TabSyn 47.56±0.22 48.00±0.35 48.95±0.28 48.03±0.23 55.84±0.19 45.04±0.34 48.90 1

F.3 Detection: Classifier Two Sample Tests (C2ST)

We further study how difficult it is to tell apart the real data from the synthetic data, therefore evaluating if the synthetic data can recover the real data distribution. To this end, we employ the detection metric provided by sdmetrics 999https://docs.sdv.dev/sdmetrics/metrics/metrics-in-beta/detection-single-table. In Table 11, we present the results obtained using logistic regression as the detection method.

Table 11: Detection score (C2ST) using logistic regression classifier. Higher scores indicate better performance.
Method Adult Default Shoppers Magic Beijing News
SMOTE 0.9710 0.9274 0.9086 0.9961 0.9888 0.9344
CTGAN 0.5949 0.4875 0.7488 0.6728 0.7531 0.6947
TVAE 0.6315 0.6547 0.2962 0.7706 0.8659 0.4076
GOGGLE 0.1114 0.5163 0.1418 0.9526 0.4779 0.0745
GReaT 0.5376 0.4710 0.4285 0.4326 0.6893 -
STaSy 0.4054 0.6814 0.5482 0.6939 0.7922 0.5287
CoDi 0.2077 0.4595 0.2784 0.7206 0.7177 0.0201
TabDDPM 0.9755 0.9712 0.8349 0.9998 0.9513 0.0002
TabSyn 0.9986 0.9870 0.9740 0.9732 0.9603 0.9749

As indicated in the table, the Detection score exhibits superior discriminative power compared to other metrics, such as single-column density estimation, pair-wise column shape estimation, and MLE. The detection score shows significant variations across different models for synthetic data generation. As indicated in the table, the Detection score exhibits superior discriminative power compared to other metrics, such as single-column density estimation, pair-wise column shape estimation, and MLE. The detection score shows significant variations across different models for synthetic data generation. The proposed TabSyn consistently achieves notably high scores across all datasets (SMOTE (Chawla et al., 2002) directly interpolates within the training set, so it is not surprising that it achieves high scores in the detection metric.).

F.4 Missing Value Imputation

Adapting TabSyn for missing value imputation

An essential advantage of the Diffusion Model is that an unconditional model can be directly used for missing data imputation tasks (e.g., image inpainting (Song et al., 2021b; Lugmayr et al., 2022) and missing value imputation) without retraining. Following the inpainting methods proposed in Lugmayr et al. (2022), we apply the proposed TabSyn in Missing Value Imputation Tasks.

For a row of tabular data 𝒛i=[𝒛inum,𝒛i,1oh,𝒛i,Mcatoh], 𝒛inum1×Mnum, 𝒛i,joh1×Cj. Assume the index set of masked numerical columns is mnum, and of masked categorical columns is mcat, we first preprocess the masked columns as follows:

  • The value of a masked numerical column is set as the averaged value of this column, i.e., xi,jnummean(𝒛:,jnum),jmnum.

  • The masked categorical column is set as 𝒛i,joh[1Cj,,1Cj,1Cj]1×Cj,jmcat.

The updated 𝒛i (we omit the subscript in the remaining parts) is fed to TabSyn’s frozen VAE’s encoder to obtain the embedding 𝒛1×Md. As TabSyn’s VAE employs the Transformer architecture, there is a deterministic mapping from the masked indexes in the data space mnum and mcat to the masked indexes in the latent space. For example, the first dimension of the numerical column is mapped to dimension 1 to d in 𝒛. Therefore, we can create a masking vector 𝒎 indicating whether each dimension is masked. Then, the known and unknown part of 𝒛 could be expressed as 𝒎𝒛 and (1𝒎)𝒛, respectively.

Following Lugmayr et al. (2022), the reverse step is modified as a mixture of the known part’s forwarding and the unknown part’s denoising:

𝒛ti1known=𝒛+σ(ti1)𝜺,𝜺𝒩(𝟎,𝑰),𝒛ti1unknown=𝒛ti+titi1d𝒛ti,𝒛ti1=𝒎𝒛ti1known+(1𝒎)𝒛ti1unknown,whered𝒛t=σ˙(t)σ(t)𝒛tlogp(𝒛t)dt+σ˙(t)σ(t)d𝝎t. (39)

The reverse imputation from time ti to ti1 also involves resampling in order to reduce the error brought by each step (Lugmayr et al., 2022). Resampling indicates that Eq. 39 will be repeated for U steps from 𝒛ti to 𝒛ti1. After completing the reverse steps, we obtain the imputed latent vector 𝒛0, which could be put into TabSyn’s VAE decoder to recover the original input data.

The algorithm for missing value imputation is presented in Algorithm 4.

1: 1. VAE encoding
2: 𝒛=[𝒛num,𝒛1oh,𝒛Mcatoh] is a data sample having missing values.
3: mnum denotes the missing numerical columns.
4: mcat denotes the missing categorical columns.
5: x:,jnummean(𝒛:,jnum),jmnum.
6: 𝒛i,joh[1Cj,,1Cj,1Cj]1×Cj,jmcat.
7: 𝒛=Flatten(Encoder(𝒛))1×Md
18:
9: 2. Obtaining the masking vector
10: 𝒎1×Md
11: for j=1,,Md do
12: if (j/d)mnum or (j/dMnum)mcat then
13: 𝒎j=0
14: else
15: 𝒎j=1
16: end if
17: end for
218:
19: 3. Missing Value Imputation via Denoising
20: 𝒛tmax=𝒛+σ(tmax)𝜺,𝜺𝒩(𝟎,𝑰)
21: for i=max,,1 do
22: for u=1,,U do
23: 𝒛ti1known=𝒛+σ(ti1)𝜺,𝜺𝒩(𝟎,𝑰)
24: 𝒛ti1unknown=𝒛ti+titi1d𝒛ti
25: d𝒛 is defined in Eq. 6
26: 𝒛ti1=𝒎𝒛ti1known+(1𝒎)𝒛ti1unknown
27: if u<U and t>1 then
28: ### Resampling
29: 𝒛ti=𝒛ti1+σ2(ti)σ2(ti1)𝜺,𝜺𝒩(𝟎,𝑰)
30: end if
31: end for
32: end for
33: 𝒛0=𝒛t0
334:
35: 4. VAE decoding
36: 𝒛^=Decoder(Unflatten(𝒛0))
37: 𝒛^ is the imputed data
Algorithm 4 TabSyn for Missing Value Imputation
Classification/Regression as missing value imputation.

With Algorithm 4, we are able to use TabSyn for imputation with any missing columns. In this section, we consider a more interesting application – treating classification/regression as missing value imputation tasks directly. As illustrated in Section E.1, each dataset is assigned a machine learning task, either a classification or regression on the target column in the dataset. Therefore, we can mask the values of the target columns, then apply TabSyn to impute the masked values, completing the classification or regression tasks.

In Table 12, we present TabSyn’s performance in missing value imputation tasks on the target column of each dataset. The performance is compared with directly training a classifier/regressor, using the remaining columns to predict the target column (the ’Real’ row in Machine Learning Efficiency tasks, Table 3). Surprisingly, imputing with TabSyn shows highly competitive results on all datasets. On four of six datasets, TabSyn outperforms training a discriminate ML classifier/regressor on real data. We suspect that the reason for this phenomenon may be that discriminative ML models are more prone to overfitting on the training set. In contrast, by learning the smooth distribution of the entire data, generative models significantly reduce the risk of overfitting. The excellent results on the missing value imputation task further highlight the significant importance of our proposed TabSyn for real-world applications.

Our TabSyn is not trained conditionally on other columns for the missing value imputation tasks, and the performance can be further improved by training a separate conditional model specifically for this task. We leave it for future work.

Table 12: Performance of TabSyn in Missing Value Imputation tasks, compared with training an XGBoost classifier/regressor using the real data.
Methods Adult Default Shoppers Magic Beijing News
AUC AUC AUC AUC RMSE RMSE
Real with XGBoost 92.7 77.0 92.6 94.6 0.423 0.842
Impute with TabSyn 93.2 87.2 96.6 88.8 0.258 1.253

F.5 Impacts of the quality of VAEs

Intuitively, the performance of TabSyn appears to be highly dependent on the quality of the pre-trained VAE model. Therefore, we conduct further research to investigate how a less trained VAE model would impact the quality of synthetic data generated by TabSyn. To this end, we investigate the quality of synthetic data generated by TabSyn using the embeddings of the VAE obtained at different epochs as the latent space. Figure 9 plots the results of single-column density estimation and pair-wise column correlation estimation on the Adult and Default datasets, with intervals set at 400 epochs. We can observe that increasing the training epochs of the VAE does indeed improve the quality of TabSyn’s generated data. Additionally, even when the VAE is sub-optimal (e.g., training epochs around 2000), TabSyn’s performance is already very close to the optimal ones. Furthermore, even with a relatively low number of VAE training epochs (e.g., 800-1200), TabSyn’s performance approaches or even surpasses the most competitive baseline, TabDDPM. Based on this observation, we recommend thoroughly training the VAE to achieve superior data generation quality when resources are abundant. However, when resources are limited, reducing the VAE training duration still yields decent performance.

Refer to caption
Figure 9: Performance of TabSyn with VAEs trained for different epochs. By default, TabSyn’s VAE is trained with 4000 epochs

F.6 Privacy Protection: Distance to Closest Record (DCR)

To study if the synthetic data will cause privacy information leakage issues, we compute the DCRs of the synthetic data. Specifically, we follow the ’synthetic vs. holdout’ setting 101010https://www.clearbox.ai/blog/2022-06-07-synthetic-data-for-privacy-
preservation-part-2
.
We initially divide the dataset into two equal parts: the first part served as the training set for training our generative model, while the second part was designated as the holdout set, which is not used for training. After completing model training, we sample a synthetic set of the same size as the training set (and the holdout set).

We then calculate the DCR scores for each sample in the synthetic set concerning both the training set and the holdout set. We can create histograms to visualize the distribution of DCR scores for the synthetic set in comparison to both the training and holdout sets. Intuitively, if there is a privacy issue (e.g. if the synthetic set is directly copied from the training set), then the DCR scores for the training set should be closer to 0 than those for the testing set. Conversely, if there is no privacy issue, the distributions of DCR scores of the training and holdout sets should largely overlap. In Figure 10, we plot the distributions of synthetic sets’ DCRs concerning the training set and holdout set on Default and Shoppers. We can observe that deep generative models such as CoDi, STaSy, TabDDPM, and TabSyn do not suffer from privacy issues, while the interpolation-based method SMOTE might not be able to protect privacy information.

Refer to caption
Figure 10: Distributions of the DCR scores between the synthetic dataset and the training/holdout datasets. Deep generative models have similar DCR distributions concerning the training set and holdout set, while in the interpolation-based method SMOTE, DCRs concerning the training set are much smaller than DCRs concerning the holdout set.

Additionally, we calculate the probability that a synthetic sample is closer to the training set (rather than the holdout set). When this probability is close to 50% (i.e., 0.5), it indicates that the distribution of distances between synthetic and training instances is very similar (or at least not systematically smaller) than the distribution of distances between synthetic and holdout instances, which is a positive indicator in terms of privacy risk. Table 13 displays the results obtained by different models on Default and Shoppers datasets.

Table 13: DCR score (the probability that a synthetic example is closer to the training set rather than the holdout set (%, a score closer to 50% is better).
Method Default Shoppers
SMOTE 91.41%±3.42 96.40%±4.70
STaSy 50.23%±0.09 51.53%±0.16
CoDi 51.82%±0.26 51.06%±0.18
TabDDPM 52.15%±0.20 63.23%±0.25
TabSyn 51.20%±0.18 52.90% ±0.22

Appendix G Details for Reproduction

In this section, we introduce the details of TabSyn, such as the data preprocessing, training, and hyperparameter details. We also present details of our reproduction for the baseline methods.

G.1 Details of implementing TabSyn

Data Preprocessing.

Real-world tabular data often contain missing values, and each column’s data may have distinct scales. Therefore, we need to preprocess the data. Following TabDDPM (Kotelnikov et al., 2023), missing cells are filled with the column’s average for numerical columns. For categorical columns, missing cells are treated as one additional category. Then, each numerical/categorical column is transformed using the QuantileTransformer111111https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html/OneHotEncoder121212https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html from scikit-learn, respectively.

Hyperparameter settings.

TabSyn uses the same set of parameters for different datasets. (except βmax for Shoppers). The detailed architectures of the VAE and Diffusion model of TabSyn have been presented in Appendix D.1 and Appendix D.2, respectively. Below is the detailed hyperparameter setting.

Hyperparameters of VAE:

  • Token dimension d=4,

  • Number of Layers of VAEs’ encoder/decoder: 2,

  • Hidden dimension of Transformer’s FFN: 432=128,

  • βmax=0.01 (βmax=0.001 for Shoppers),

  • βmin=105 ,

  • λ=0.7.

Hyperparameters of Diffusion:

  • MLP’s hidden dimension dhidden=1024.

Unlike the cumbersome hyperparameter search process in some current methods (Kotelnikov et al., 2023; Kim et al., 2023; Lee et al., 2023) to obtain the optimal hyperparameters, TabSyn consistently generates high-quality data without the need for meticulous hyperparameter tuning.

G.2 Details of implementing baselines

Since different methods have adopted distinct neural network architectures, it is inappropriate to compare the performance of different methods using identical networks. For fair comparison, we adjust the hidden dimensions of different methods, ensuring that the numbers of trainable parameters are close (around 10 million). Note that enlarging the model size does lead to better performance for the baseline methods. Under these conditions, we reproduced the baseline methods based on the official codes, and our reproduced codes are provided in the supplementary. Below are the detailed implementations of the baselines.

CTGAN and TVAE (Xu et al., 2019): For CTGAN, we follow the implementations in the official codes131313https://github.com/sdv-dev/CTGAN, where the hyperparameters are well-given. Since the default discriminator/generator MLPs are small, we enlarge them to be the same size as TabSyn for fair comparison. The interface for TVAE is not provided, so we simply use the default hyperparameters defined in the TVAE module. The sizes of TVAE’s encoder/decoder are enlarged as well.

GOGGLE (Liu et al., 2023b): We follow the official implementations141414https://github.com/vanderschaarlab/GOGGLE. In GOGGLE’s official implementation, each node is a column, and the node feature is the 1-dimensional numerical value of this column. However, GOGGLE did not illustrate and failed to explain how to handle categorical columns 151515https://github.com/tennisonliu/GOGGLE/issues/2. To extend GOGGLE to mixed-type tabular data, we first transform each categorical column into its C-dimensional one-hot encoding. Then, each single dimension of 0/1 binary values becomes the graph node. Consequently, for a mixed-type tabular data of Mnum numerical columns and Mcat categorical columns and the i-th categorical column of Ci categories, GOGGLE’s graph has Mnum+iCi nodes.

GReaT: We follow the official implementations161616https://github.com/kathrinse/be_great/tree/main. During our reproduction, we found that the training of GReaT is memory and time-consuming (because it is fine-tuning a large language model). Typically, the batch size is limited to 32 on the Adult dataset, and training for 200 epochs takes over 2 hours. In addition, since GReaT is textual-based, the generated content is not guaranteed to follow the format of the given table. Therefore, additional post-processing has to be applied.

STaSy (Kim et al., 2023): In STaSy, the categorical columns are encoded with one-hot encoding and then are put into the continuous diffusion model together with the numerical columns. We follow the default hyperparameters given by the official codes171717https://github.com/JayoungKim408/STaSy/tree/main except for the denoising function’s size, which is enlarged for a fair comparison.

CoDi (Lee et al., 2023): We follow the default hyperparameters given by the official codes181818https://github.com/ChaejeongLee/CoDi/tree/main. Similarly, the denoising U-Nets used by CoDi are also enlarged to ensure similar model parameters.

TabDDPM (Kotelnikov et al., 2023): The official code of TabDDPM191919https://github.com/yandex-research/tab-ddpm is for conditional generation tasks, where the non-target columns are generated conditioned on the target column(s). We slightly modify the code to be applied to unconditional generation.