Skip to content

Data Synthesizer: How I built a tabular data generator

Oct 21, 2025 (3 months ago)

3 min read

Create a lightweight web app that allows uploading a CSV, training a generative model (GaussianCopula / CTGAN / TVAE), and downloading a synthetic dataset, all running on CPU within Hugging Face Spaces.


Data Synthesizer

In many real projects, it is not possible to share data due to privacy or compliance issues. Synthetic data allows:

  • Publishing demos without exposing sensitive information.
  • Augmenting data (data augmentation) and balancing rare classes.
  • Testing pipelines/ETL and dashboards without using the original dataset.

I wanted to build a simple, explainable, and portable tool for my portfolio: upload a CSV, learn its distribution, and generate a new CSV that preserves global patterns (distributions and correlations), without copying records.


What does it do?

  1. Mandatory upload of a CSV (no preloaded data).
  2. Detection of column types and construction of metadata.
  3. Training of a synthesizer (GaussianCopula / CTGAN / TVAE) with lightweight parameters.
  4. Generation of N synthetic rows and download (synthetic.csv).
  5. Quality evaluation (0–1) with sdmetrics if available.
  6. Modern UI with gradient header, cards, and tabs (Load Data · Train & Evaluate).

Architecture and stack

  • Interface: Streamlit (tabs + cards + custom CSS styles)
  • Synthetic modeling: SDV single_table (with fallback to sdv.tabular for compatibility)
  • Metrics: sdmetrics.QualityReport
  • Optional ML: PyTorch (only for CTGAN/TVAE)
  • Manipulation: Pandas / NumPy
Browser ──▶ Streamlit UI
              
              ├─▶ Upload CSV  preview + light cleaning
              
              ├─▶ SDV.metadata (type detection)
                   ├─▶ GaussianCopula Synthesizer (CPU, fast)
                   ├─▶ CTGAN Synthesizer (GAN, requires torch; pac=1)
                   └─▶ TVAE Synthesizer (VAE, requires torch)
              
              ├─▶ sample(N)  synthetic.csv (download)
              └─▶ sdmetrics.QualityReport (score 0–1)

Models and "when to use them"

  • GaussianCopula (CPU, without PyTorch): fast and stable baseline; ideal for demos and small/medium datasets or slightly non-linear relationships.
  • CTGAN (PyTorch): better for mixed data with many categorical/high cardinality and multiple modes; captured minority classes better. I forced pac=1 to avoid PacGAN asserts with common batches.
  • TVAE (PyTorch): stable training and good global structure, especially when numerical data predominates; sometimes "smooths" extremes.

I included a recommender that, depending on the dataset and PyTorch availability, suggests a default model and parameters (epochs, batch, sample).


Key technical decisions

  • Lazy import of models: CTGAN/TVAE are only imported when the user chooses them, so the app runs with GaussianCopula even without PyTorch.
  • New/legacy SDV compatibility: first I try sdv.single_table (CTGANSynthesizer/TVAESynthesizer/GaussianCopulaSynthesizer) and, if it fails, I fallback to sdv.tabular.
  • Demo guardrails: limit of 25 columns, reasonable sampling, and low epochs on CPU.
  • Robust evaluation: sdmetrics.QualityReport with metadata.to_dict() when available.
  • Upload limit: 5 MB from the app + server.maxUploadSize in .streamlit/config.toml.

Deployment on Hugging Face Spaces (CPU, free)

README Front-matter (or Settings → Runtime):

sdk: streamlit
app_file: app.py
python_version: 3.10

requirements.txt

streamlit>=1.36.0
pandas>=2.1.0
numpy>=1.24.0
sdv>=1.8.0,<2.0.0
sdmetrics>=0.12.0
scikit-learn>=1.3.0
scipy>=1.10.0
plotly>=5.20.0
torch>=2.1,<3.0   # optional: only if you will use CTGAN/TVAE

Persistence: on the free plan, storage is ephemeral; generated data is downloaded locally.


Local use

# 1) Create environment
conda create -n synth python=3.10 -y
conda activate synth

# 2) Install dependencies
pip install -r requirements.txt

# 3) Run
streamlit run app.py

If you are going to use CTGAN/TVAE on CPU:

pip install "torch>=2.1,<3.0" --index-url https://download.pytorch.org/whl/cpu

Results and lessons

  • In mixed datasets with high cardinality categorical variables, CTGAN performed better than TVAE and GaussianCopula.
  • TVAE was the most stable with few epochs on CPU.
  • GaussianCopula is ideal as a baseline and for public demos where you don't want to depend on PyTorch.
  • The limits (columns/epochs) were key to avoiding timeouts and offering a smooth experience in Spaces.