DBND Spark

raw JSON →
1.0.34.1 verified Mon Apr 27 auth: no python

DBND Spark provides integration between Databand's data orchestration framework and Apache Spark. It enables tracking, monitoring, and logging of Spark jobs, including data metrics, lineage, and execution context. The library wraps SparkSession to automatically capture logs and telemetry. Version 1.0.34.1 is the latest stable release, with monthly updates. 'dbnd-spark' is part of the 'dbnd' ecosystem but installed separately. Maintained by Databand (now IBMA).

pip install dbnd-spark
error ModuleNotFoundError: No module named 'dbnd_spark'
cause dbnd-spark is a separate package from dbnd.
fix
Run 'pip install dbnd-spark' in addition to dbnd.
error AttributeError: 'SparkSession' object has no attribute 'dbnd_tracking'
cause Spark session not wrapped by DBND; import missing or SparkContext not initialized properly.
fix
Ensure you import dbnd_spark before creating SparkSession, or use DbndSparkSessionBuilder.
deprecated The 'dbnd-spark' package is being deprecated in favor of 'dbnd' unified package. New versions of dbnd include Spark support internally.
fix Migrate to 'dbnd' package and use 'from dbnd_spark import ...' from within dbnd.
gotcha SparkSession must be created inside a DBND task. Creating it at module level breaks tracking.
fix Always create SparkSession inside a @task-decorated function.
breaking Changed from camelCase to snake_case for configuration attributes in v1.0.20.
fix Use underscore style: app_name instead of appName.

Define a Spark job as a DBND task, configure tracking URL (optional), and run via dbnd_run.

from dbnd import dbnd_config, task
from dbnd_spark import DbndSparkConfig

@task
def my_spark_job():
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName("test").getOrCreate()
    df = spark.range(10)
    df.show()
    spark.stop()

if __name__ == "__main__":
    dbnd_config.set(DbndSparkConfig.webapp_url=os.environ.get('DBND_WEBAPP_URL', ''))
    my_spark_job.dbnd_run()