{"id":5895,"library":"dagster-spark","title":"Dagster Spark","description":"Dagster Spark is a Python library that provides integration components for orchestrating Apache Spark jobs within the Dagster data platform. It enables users to define, run, and monitor Spark-based data pipelines with Dagster's declarative programming model, offering capabilities for data management, lineage, and observability. The library is actively maintained and typically releases in sync with the core Dagster library.","status":"active","version":"0.29.0","language":"en","source_language":"en","source_url":"https://github.com/dagster-io/dagster/tree/master/python_modules/libraries/dagster-spark","tags":["dagster","spark","etl","orchestration","data-pipeline"],"install":[{"cmd":"pip install dagster dagster-spark","lang":"bash","label":"Install Dagster Spark"}],"dependencies":[{"reason":"Core orchestration framework dependency.","package":"dagster","optional":false},{"reason":"Required for defining and executing PySpark jobs.","package":"pyspark","optional":true},{"reason":"Required by Dagster core.","package":"pydantic>=2","optional":false}],"imports":[{"symbol":"create_spark_op","correct":"from dagster_spark import create_spark_op"},{"symbol":"define_spark_config","correct":"from dagster_spark import define_spark_config"},{"note":"This API is currently in feature preview and not considered ready for production use; it may have breaking changes in patch releases.","symbol":"SparkDeclarativePipelineComponent","correct":"from dagster_spark.components.spark_declarative_pipeline import SparkDeclarativePipelineComponent"}],"quickstart":{"code":"from dagster import job, Definitions, asset\nfrom dagster_spark import create_spark_op, define_spark_config\n\n# Define Spark configuration\nmy_spark_config = define_spark_config(\n    {\n        \"spark.master\": \"local[*]\",\n        \"spark.app.name\": \"dagster-spark-example\"\n    }\n)\n\n# Create an op from the Spark job definition\nmy_spark_job_op = create_spark_op(\n    main_class=\"org.apache.spark.examples.SparkPi\",\n    jars=[\"path/to/your/spark-examples.jar\"], # Replace with your Spark job JAR path\n    spark_config=my_spark_config,\n    name=\"my_spark_pi_op\"\n)\n\n@job\ndef spark_pi_job():\n    my_spark_job_op()\n\n# Or, integrate with assets (requires PySparkResource from dagster-pyspark for direct SparkSession)\n# For a simple asset that just runs a Spark job via `spark-submit`, you might do:\n# @asset\n# def spark_data_asset():\n#    # This would involve using create_spark_op within an asset or using Dagster Pipes\n#    # For simplicity, this example focuses on a job.\n#    pass\n\ndefs = Definitions(jobs=[spark_pi_job])","lang":"python","description":"This quickstart defines a Dagster job that executes a simple Spark application (like SparkPi) using `create_spark_op`. It demonstrates how to define Spark configuration and wrap a Spark job for orchestration within Dagster. Replace `path/to/your/spark-examples.jar` with the actual path to your Spark application JAR."},"warnings":[{"fix":"Replace `SparkSolidDefinition` with `create_spark_op`.","message":"The `SparkSolidDefinition` has been removed. Users should migrate to `create_spark_op` for defining Spark-based operations.","severity":"breaking","affected_versions":"<=0.6.0"},{"fix":"Consider migrating to Dagster Pipes for more lightweight and flexible Spark job orchestration, especially for new projects or feature development.","message":"Spark Step Launchers are superseded by Dagster Pipes and are no longer the recommended method for launching external code from Dagster ops and assets. While still available, they will not receive new features or active development.","severity":"deprecated","affected_versions":"All versions"},{"fix":"Use with caution in production environments. Monitor Dagster release notes for updates on SDP API stability.","message":"The Spark Declarative Pipeline (SDP) integration components (`SparkDeclarativePipelineComponent`, `SparkPipelinesResource`) are in feature preview. This API may have breaking changes in patch version releases and is not considered ready for production use.","severity":"gotcha","affected_versions":"0.28.21 and later"},{"fix":"Refer to Dagster's official documentation for recommended compatibility matrices. Test deployments thoroughly across environments.","message":"Ensure compatibility between your `dagster-spark` version, the `pyspark` library version (if used), and your Apache Spark cluster version. Specific Hadoop/AWS Java SDK versions might also be critical for integrations like S3.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Upgrade your Python environment to 3.10 or newer and ensure `pydantic` is version 2 or greater.","message":"Dagster core (a dependency of `dagster-spark`) no longer supports Python 3.8 and requires `pydantic>=2`.","severity":"breaking","affected_versions":"Dagster 1.12.0 (Dagster Spark 0.28.21) and later"}],"env_vars":null,"last_verified":"2026-04-14T00:00:00.000Z","next_check":"2026-07-13T00:00:00.000Z"}