Databricks Labs - PySpark Synthetic Data Generator

library 0.4.0.post1 ·python

✓ verified May 24, 2026

dbldatagen (Databricks Labs Data Generator) is an open-source Python library for generating synthetic data at scale within Apache Spark and Databricks environments. It allows users to define complex data schemas with various constraints, distributions, and inter-column relationships to create realistic datasets for testing, benchmarking, and machine learning model development. The library is currently at version 0.4.0.post1 and has an active development and release cadence.

Traffic · last 30 days ↓12% vs prev 7d · indexed Thu Apr 16 · updated Mon Jun 01

total hits 33

actors 7 distinct systems

last hit 20h ago Amazonbot

ByteDance

Amazonbot

MetaBot

GPTBot

Script

top countries 🇸🇬 Singapore · 🇺🇸 United States · 🇩🇪 Germany · 🇨🇦 Canada

Resources

docsdatabrickslabs.github.io/dbldatagen/public_docs/index.html ↗

githubgithub.com/databrickslabs/data-generator ↗

packagepypi.org/project/dbldatagen/ ↗

API endpoints

full doc /v1/registry/dbldatagen

install /v1/registry/dbldatagen/install

compatibility /v1/registry/dbldatagen/compatibility