Databricks Labs - PySpark Synthetic Data Generator

0.4.0.post1 · active · verified Thu Apr 16

dbldatagen (Databricks Labs Data Generator) is an open-source Python library for generating synthetic data at scale within Apache Spark and Databricks environments. It allows users to define complex data schemas with various constraints, distributions, and inter-column relationships to create realistic datasets for testing, benchmarking, and machine learning model development. The library is currently at version 0.4.0.post1 and has an active development and release cadence.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to generate a synthetic dataset using `dbldatagen`'s `Datasets` class, which provides pre-configured data generation recipes. It initializes a SparkSession (if not already present), creates a 'basic/user' dataset with 1 million rows and 4 partitions, then displays its schema and the first few rows. This approach is recommended for quickly generating common synthetic data patterns.

from pyspark.sql import SparkSession
import dbldatagen as dg

# Initialize SparkSession (if not in Databricks already)
try:
    spark
except NameError:
    spark = SparkSession.builder.appName("dbldatagen_quickstart").getOrCreate()

# Generate a basic user dataset using standard datasets feature
# This creates 1 million rows and 4 partitions
print("Generating a basic user dataset...")
df = dg.Datasets(spark, "basic/user").get(rows=1_000_000, partitions=4).build()

# Display schema and a few rows
print("Schema:")
df.printSchema()

print("Sample data:")
df.show(5, truncate=False)

# Stop SparkSession if it was created here (optional)
# spark.stop()

view raw JSON →