Skip to content

Data Synthesizer: How I built a tabular data generator

Oct 21, 2025 (4 months ago)

3 min read

Create a lightweight web app that allows uploading a CSV, training a generative model (GaussianCopula / CTGAN / TVAE), and downloading a synthetic dataset, all running on CPU within Hugging Face Spaces.


In many real projects, it is not possible to share data due to privacy or compliance issues. Synthetic data allows:

  • Publishing demos without exposing sensitive information.
  • Augmenting data (data augmentation) and balancing rare classes.
  • Testing pipelines/ETL and dashboards without using the original dataset.

I wanted to build a simple, explainable, and portable tool for my portfolio: upload a CSV, learn its distribution, and generate a new CSV that preserves global patterns (distributions and correlations), without copying records.


How it works?

  1. Mandatory upload of a CSV (no preloaded data).
  2. Detection of column types and construction of metadata.
  3. Training of a synthesizer (GaussianCopula / CTGAN / TVAE) with lightweight parameters.
  4. Generation of N synthetic rows and download (synthetic.csv).
  5. Quality evaluation (0–1) with sdmetrics if available.

Architecture and stack

  • Interface: Streamlit (tabs + cards + custom CSS styles)
  • Synthetic modeling: SDV single_table (with fallback to sdv.tabular for compatibility)
  • Metrics: sdmetrics.QualityReport
  • Optional ML: PyTorch (only for CTGAN/TVAE)
  • Manipulation: Pandas / NumPy
Browser ──▶ Streamlit UI
              
              ├─▶ Upload CSV  preview + light cleaning
              
              ├─▶ SDV.metadata (type detection)
                   ├─▶ GaussianCopula Synthesizer (CPU, fast)
                   ├─▶ CTGAN Synthesizer (GAN, requires torch; pac=1)
                   └─▶ TVAE Synthesizer (VAE, requires torch)
              
              ├─▶ sample(N)  synthetic.csv (download)
              └─▶ sdmetrics.QualityReport (score 0–1)

Models and "when to use them"

  • GaussianCopula (CPU, without PyTorch): fast and stable baseline; ideal for demos and small/medium datasets or slightly non-linear relationships.
  • CTGAN (PyTorch): better for mixed data with many categorical/high cardinality and multiple modes; captured minority classes better. I forced pac=1 to avoid PacGAN asserts with common batches.
  • TVAE (PyTorch): stable training and good global structure, especially when numerical data predominates; sometimes "smooths" extremes.

I included a recommender that, depending on the dataset and PyTorch availability, suggests a default model and parameters (epochs, batch, sample).


Key technical decisions

  • Lazy import of models: CTGAN/TVAE are only imported when the user chooses them, so the app runs with GaussianCopula even without PyTorch.
  • New/legacy SDV compatibility: first I try sdv.single_table (CTGANSynthesizer/TVAESynthesizer/GaussianCopulaSynthesizer) and, if it fails, I fallback to sdv.tabular.
  • Demo guardrails: limit of 25 columns, reasonable sampling, and low epochs on CPU.
  • Robust evaluation: sdmetrics.QualityReport with metadata.to_dict() when available.
  • Upload limit: 5 MB from the app + server.maxUploadSize in .streamlit/config.toml.

Deployment on Hugging Face Spaces (CPU, free)

README Front-matter (or Settings → Runtime):

sdk: streamlit
app_file: app.py
python_version: 3.10

requirements.txt

streamlit>=1.36.0
pandas>=2.1.0
numpy>=1.24.0
sdv>=1.8.0,<2.0.0
sdmetrics>=0.12.0
scikit-learn>=1.3.0
scipy>=1.10.0
plotly>=5.20.0
torch>=2.1,<3.0   # optional: only if you will use CTGAN/TVAE

Persistence: on the free plan, storage is ephemeral; generated data is downloaded locally.


Local use

# 1) Create environment
conda create -n synth python=3.10 -y
conda activate synth

# 2) Install dependencies
pip install -r requirements.txt

# 3) Run
streamlit run app.py

If you are going to use CTGAN/TVAE on CPU:

pip install "torch>=2.1,<3.0" --index-url https://download.pytorch.org/whl/cpu

Results and lessons

  • In mixed datasets with high cardinality categorical variables, CTGAN performed better than TVAE and GaussianCopula.
  • TVAE was the most stable with few epochs on CPU.
  • GaussianCopula is ideal as a baseline and for public demos where you don't want to depend on PyTorch.
  • The limits (columns/epochs) were key to avoiding timeouts and offering a smooth experience in Spaces.