Dagster Spark

0.29.0 · active · verified Tue Apr 14

Dagster Spark is a Python library that provides integration components for orchestrating Apache Spark jobs within the Dagster data platform. It enables users to define, run, and monitor Spark-based data pipelines with Dagster's declarative programming model, offering capabilities for data management, lineage, and observability. The library is actively maintained and typically releases in sync with the core Dagster library.

Warnings

Install

Imports

Quickstart

This quickstart defines a Dagster job that executes a simple Spark application (like SparkPi) using `create_spark_op`. It demonstrates how to define Spark configuration and wrap a Spark job for orchestration within Dagster. Replace `path/to/your/spark-examples.jar` with the actual path to your Spark application JAR.

from dagster import job, Definitions, asset
from dagster_spark import create_spark_op, define_spark_config

# Define Spark configuration
my_spark_config = define_spark_config(
    {
        "spark.master": "local[*]",
        "spark.app.name": "dagster-spark-example"
    }
)

# Create an op from the Spark job definition
my_spark_job_op = create_spark_op(
    main_class="org.apache.spark.examples.SparkPi",
    jars=["path/to/your/spark-examples.jar"], # Replace with your Spark job JAR path
    spark_config=my_spark_config,
    name="my_spark_pi_op"
)

@job
def spark_pi_job():
    my_spark_job_op()

# Or, integrate with assets (requires PySparkResource from dagster-pyspark for direct SparkSession)
# For a simple asset that just runs a Spark job via `spark-submit`, you might do:
# @asset
# def spark_data_asset():
#    # This would involve using create_spark_op within an asset or using Dagster Pipes
#    # For simplicity, this example focuses on a job.
#    pass

defs = Definitions(jobs=[spark_pi_job])

view raw JSON →