{"id":7144,"library":"dbldatagen","title":"Databricks Labs - PySpark Synthetic Data Generator","description":"dbldatagen (Databricks Labs Data Generator) is an open-source Python library for generating synthetic data at scale within Apache Spark and Databricks environments. It allows users to define complex data schemas with various constraints, distributions, and inter-column relationships to create realistic datasets for testing, benchmarking, and machine learning model development. The library is currently at version 0.4.0.post1 and has an active development and release cadence.","status":"active","version":"0.4.0.post1","language":"en","source_language":"en","source_url":"https://github.com/databrickslabs/data-generator","tags":["pyspark","databricks","synthetic-data","data-generation","testing","benchmarking"],"install":[{"cmd":"pip install dbldatagen","lang":"bash","label":"Standard pip install"},{"cmd":"%pip install dbldatagen","lang":"bash","label":"Databricks Notebook"}],"dependencies":[{"reason":"Core dependency for Spark DataFrame operations, requires >=3.2.1 for dbldatagen v0.4.0.","package":"pyspark","optional":false},{"reason":"May be an implicit dependency in some environments (e.g., Google Colab) to avoid import errors related to JSON processing.","package":"jmespath","optional":true}],"imports":[{"note":"The primary class for defining custom data generation specifications. Prefer aliased import for brevity: import dbldatagen as dg.","wrong":"import dbldatagen.DataGenerator","symbol":"DataGenerator","correct":"from dbldatagen import DataGenerator"},{"note":"Standard aliased import for the dbldatagen library, commonly used with DataGenerator or Datasets classes.","symbol":"dg","correct":"import dbldatagen as dg"},{"note":"While technically possible, directly importing submodules like `Datasets` is less common. The recommended pattern is `dg.Datasets` for accessing standard datasets.","wrong":"from dbldatagen.datasets import Datasets","symbol":"Datasets","correct":"import dbldatagen as dg\ndf = dg.Datasets(spark, \"basic/user\").get().build()"}],"quickstart":{"code":"from pyspark.sql import SparkSession\nimport dbldatagen as dg\n\n# Initialize SparkSession (if not in Databricks already)\ntry:\n    spark\nexcept NameError:\n    spark = SparkSession.builder.appName(\"dbldatagen_quickstart\").getOrCreate()\n\n# Generate a basic user dataset using standard datasets feature\n# This creates 1 million rows and 4 partitions\nprint(\"Generating a basic user dataset...\")\ndf = dg.Datasets(spark, \"basic/user\").get(rows=1_000_000, partitions=4).build()\n\n# Display schema and a few rows\nprint(\"Schema:\")\ndf.printSchema()\n\nprint(\"Sample data:\")\ndf.show(5, truncate=False)\n\n# Stop SparkSession if it was created here (optional)\n# spark.stop()","lang":"python","description":"This quickstart demonstrates how to generate a synthetic dataset using `dbldatagen`'s `Datasets` class, which provides pre-configured data generation recipes. It initializes a SparkSession (if not already present), creates a 'basic/user' dataset with 1 million rows and 4 partitions, then displays its schema and the first few rows. This approach is recommended for quickly generating common synthetic data patterns."},"warnings":[{"fix":"Upgrade your PySpark installation (`pip install 'pyspark>=3.2.1'`) and ensure your Databricks Runtime is 10.4 LTS or newer.","message":"Version 0.4.0 increased the minimum `pyspark` version to 3.2.1 and requires Databricks runtime 10.4 LTS or later. Older PySpark versions or Databricks runtimes will not be compatible.","severity":"breaking","affected_versions":"0.4.0+"},{"fix":"Ensure consistent casing for column names throughout your data generation specifications to avoid conflicts.","message":"Spark SQL column names are case-insensitive. Defining new columns with the same name but different casing than existing ones may lead to unexpected behavior or errors in downstream operations.","severity":"gotcha","affected_versions":"All"},{"fix":"For stateful deduplication across an entire stream, apply watermarking and deduplication using Spark's native streaming APIs on the DataFrame produced by `build()`.","message":"When using `dbldatagen.constraints.UniqueCombinations` with streaming dataframes, deduplication is performed only within a batch. For full stream-wide deduplication, you must implement explicit watermarking and deduplication logic on the resultant DataFrame, which can be resource-intensive for high-volume streams.","severity":"gotcha","affected_versions":"All"},{"fix":"Customize the internal seed column name by setting the `seedColumnName` attribute when creating the `DataGenerator` instance (e.g., `DataGenerator(..., seedColumnName=\"_internal_id\")`).","message":"The column name 'id' is reserved internally by `dbldatagen` as the seed column for data generation. If your generated data requires a column named 'id' with different semantics, it will conflict with this internal mechanism.","severity":"gotcha","affected_versions":"All"},{"fix":"Use Databricks Runtime 13.2 or later, or configure your cluster to use 'Single User' or 'No Isolation Shared' access modes if using older runtimes.","message":"When running on Databricks Unity Catalog enabled environments with Runtimes prior to 13.2, `dbldatagen` requires 'Single User' or 'No Isolation Shared' access modes. 'Shared' access mode in these older runtimes lacks necessary features (e.g., 3rd party libraries, Python UDFs) for `dbldatagen` to function correctly. This limitation is resolved in Databricks Runtimes 13.2 and newer.","severity":"gotcha","affected_versions":"<13.2 Databricks Runtimes on Unity Catalog"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Install the `jmespath` package: `pip install jmespath`.","cause":"Some environments, particularly non-Databricks Spark setups like Google Colab, might not have `jmespath` pre-installed, leading to import errors when `dbldatagen` attempts to load.","error":"ModuleNotFoundError: No module named 'jmespath'"},{"fix":"Rename your column or, if you specifically need an 'id' column for your data, use the `seedColumnName` parameter in `DataGenerator` to assign a different internal seed column name (e.g., `DataGenerator(spark, ..., seedColumnName='_internal_seed_id')`).","cause":"The `id` column name is reserved internally by `dbldatagen` for its seeding mechanism. If you define a column with this name, it causes a conflict.","error":"Error: Attempting to add a column named id which conflicts with the internal seed column."},{"fix":"Ensure you are using `dbldatagen` version 0.3.6 or later, and a compatible PySpark version (>=3.2.1 recommended for 0.4.0+). If manually constructing Spark SQL expressions, consult PySpark documentation for array access compatible with your runtime.","cause":"Older versions of PySpark or specific Databricks Runtimes might have incompatibilities with direct array indexing or `element_at` function usage. This was specifically addressed in `dbldatagen` v0.3.6.","error":"pyspark.sql.utils.AnalysisException: Cannot resolve 'element_at(`array_column`, 0)` due to data type mismatch."}]}