Databricks Labs - PySpark Synthetic Data Generator
dbldatagen (Databricks Labs Data Generator) is an open-source Python library for generating synthetic data at scale within Apache Spark and Databricks environments. It allows users to define complex data schemas with various constraints, distributions, and inter-column relationships to create realistic datasets for testing, benchmarking, and machine learning model development. The library is currently at version 0.4.0.post1 and has an active development and release cadence.
Common errors
-
ModuleNotFoundError: No module named 'jmespath'
cause Some environments, particularly non-Databricks Spark setups like Google Colab, might not have `jmespath` pre-installed, leading to import errors when `dbldatagen` attempts to load.fixInstall the `jmespath` package: `pip install jmespath`. -
Error: Attempting to add a column named id which conflicts with the internal seed column.
cause The `id` column name is reserved internally by `dbldatagen` for its seeding mechanism. If you define a column with this name, it causes a conflict.fixRename your column or, if you specifically need an 'id' column for your data, use the `seedColumnName` parameter in `DataGenerator` to assign a different internal seed column name (e.g., `DataGenerator(spark, ..., seedColumnName='_internal_seed_id')`). -
pyspark.sql.utils.AnalysisException: Cannot resolve 'element_at(`array_column`, 0)` due to data type mismatch.
cause Older versions of PySpark or specific Databricks Runtimes might have incompatibilities with direct array indexing or `element_at` function usage. This was specifically addressed in `dbldatagen` v0.3.6.fixEnsure you are using `dbldatagen` version 0.3.6 or later, and a compatible PySpark version (>=3.2.1 recommended for 0.4.0+). If manually constructing Spark SQL expressions, consult PySpark documentation for array access compatible with your runtime.
Warnings
- breaking Version 0.4.0 increased the minimum `pyspark` version to 3.2.1 and requires Databricks runtime 10.4 LTS or later. Older PySpark versions or Databricks runtimes will not be compatible.
- gotcha Spark SQL column names are case-insensitive. Defining new columns with the same name but different casing than existing ones may lead to unexpected behavior or errors in downstream operations.
- gotcha When using `dbldatagen.constraints.UniqueCombinations` with streaming dataframes, deduplication is performed only within a batch. For full stream-wide deduplication, you must implement explicit watermarking and deduplication logic on the resultant DataFrame, which can be resource-intensive for high-volume streams.
- gotcha The column name 'id' is reserved internally by `dbldatagen` as the seed column for data generation. If your generated data requires a column named 'id' with different semantics, it will conflict with this internal mechanism.
- gotcha When running on Databricks Unity Catalog enabled environments with Runtimes prior to 13.2, `dbldatagen` requires 'Single User' or 'No Isolation Shared' access modes. 'Shared' access mode in these older runtimes lacks necessary features (e.g., 3rd party libraries, Python UDFs) for `dbldatagen` to function correctly. This limitation is resolved in Databricks Runtimes 13.2 and newer.
Install
-
pip install dbldatagen -
%pip install dbldatagen
Imports
- DataGenerator
import dbldatagen.DataGenerator
from dbldatagen import DataGenerator
- dg
import dbldatagen as dg
- Datasets
from dbldatagen.datasets import Datasets
import dbldatagen as dg df = dg.Datasets(spark, "basic/user").get().build()
Quickstart
from pyspark.sql import SparkSession
import dbldatagen as dg
# Initialize SparkSession (if not in Databricks already)
try:
spark
except NameError:
spark = SparkSession.builder.appName("dbldatagen_quickstart").getOrCreate()
# Generate a basic user dataset using standard datasets feature
# This creates 1 million rows and 4 partitions
print("Generating a basic user dataset...")
df = dg.Datasets(spark, "basic/user").get(rows=1_000_000, partitions=4).build()
# Display schema and a few rows
print("Schema:")
df.printSchema()
print("Sample data:")
df.show(5, truncate=False)
# Stop SparkSession if it was created here (optional)
# spark.stop()