Data Synthesizer: How I built a tabular data generator
Oct 21, 2025 (3 months ago)
3 min read
Create a lightweight web app that allows uploading a CSV, training a generative model (GaussianCopula / CTGAN / TVAE), and downloading a synthetic dataset, all running on CPU within Hugging Face Spaces.
In many real projects, it is not possible to share data due to privacy or compliance issues. Synthetic data allows:
- Publishing demos without exposing sensitive information.
- Augmenting data (data augmentation) and balancing rare classes.
- Testing pipelines/ETL and dashboards without using the original dataset.
I wanted to build a simple, explainable, and portable tool for my portfolio: upload a CSV, learn its distribution, and generate a new CSV that preserves global patterns (distributions and correlations), without copying records.
What does it do?
- Mandatory upload of a CSV (no preloaded data).
- Detection of column types and construction of metadata.
- Training of a synthesizer (GaussianCopula / CTGAN / TVAE) with lightweight parameters.
- Generation of N synthetic rows and download (
synthetic.csv). - Quality evaluation (0–1) with
sdmetricsif available. - Modern UI with gradient header, cards, and tabs (Load Data · Train & Evaluate).
Architecture and stack
- Interface: Streamlit (tabs + cards + custom CSS styles)
- Synthetic modeling: SDV
single_table(with fallback tosdv.tabularfor compatibility) - Metrics:
sdmetrics.QualityReport - Optional ML: PyTorch (only for CTGAN/TVAE)
- Manipulation: Pandas / NumPy
Browser ──▶ Streamlit UI
│
├─▶ Upload CSV → preview + light cleaning
│
├─▶ SDV.metadata (type detection)
│ ├─▶ GaussianCopula Synthesizer (CPU, fast)
│ ├─▶ CTGAN Synthesizer (GAN, requires torch; pac=1)
│ └─▶ TVAE Synthesizer (VAE, requires torch)
│
├─▶ sample(N) → synthetic.csv (download)
└─▶ sdmetrics.QualityReport (score 0–1)
Models and "when to use them"
- GaussianCopula (CPU, without PyTorch): fast and stable baseline; ideal for demos and small/medium datasets or slightly non-linear relationships.
- CTGAN (PyTorch): better for mixed data with many categorical/high cardinality and multiple modes; captured minority classes better. I forced
pac=1to avoid PacGAN asserts with common batches. - TVAE (PyTorch): stable training and good global structure, especially when numerical data predominates; sometimes "smooths" extremes.
I included a recommender that, depending on the dataset and PyTorch availability, suggests a default model and parameters (epochs, batch, sample).
Key technical decisions
- Lazy import of models: CTGAN/TVAE are only imported when the user chooses them, so the app runs with GaussianCopula even without PyTorch.
- New/legacy SDV compatibility: first I try
sdv.single_table(CTGANSynthesizer/TVAESynthesizer/GaussianCopulaSynthesizer) and, if it fails, I fallback tosdv.tabular. - Demo guardrails: limit of 25 columns, reasonable sampling, and low epochs on CPU.
- Robust evaluation:
sdmetrics.QualityReportwithmetadata.to_dict()when available. - Upload limit: 5 MB from the app +
server.maxUploadSizein.streamlit/config.toml.
Deployment on Hugging Face Spaces (CPU, free)
README Front-matter (or Settings → Runtime):
sdk: streamlit
app_file: app.py
python_version: 3.10
requirements.txt
streamlit>=1.36.0
pandas>=2.1.0
numpy>=1.24.0
sdv>=1.8.0,<2.0.0
sdmetrics>=0.12.0
scikit-learn>=1.3.0
scipy>=1.10.0
plotly>=5.20.0
torch>=2.1,<3.0 # optional: only if you will use CTGAN/TVAE
Persistence: on the free plan, storage is ephemeral; generated data is downloaded locally.
Local use
# 1) Create environment
conda create -n synth python=3.10 -y
conda activate synth
# 2) Install dependencies
pip install -r requirements.txt
# 3) Run
streamlit run app.py
If you are going to use CTGAN/TVAE on CPU:
pip install "torch>=2.1,<3.0" --index-url https://download.pytorch.org/whl/cpu
Results and lessons
- In mixed datasets with high cardinality categorical variables, CTGAN performed better than TVAE and GaussianCopula.
- TVAE was the most stable with few epochs on CPU.
- GaussianCopula is ideal as a baseline and for public demos where you don't want to depend on PyTorch.
- The limits (columns/epochs) were key to avoiding timeouts and offering a smooth experience in Spaces.