{"id":4423,"library":"apache-airflow-providers-apache-spark","title":"Apache Airflow Apache Spark Provider","description":"This provider package enables Apache Airflow to interact with Apache Spark, allowing for the orchestration and scheduling of Spark jobs. It includes operators and hooks for submitting Spark applications, executing Spark SQL queries, and performing data transfers. It's an active provider package, with version 6.0.0 released on March 28, 2026. Airflow providers are released independently of Airflow core, typically with a regular cadence to support new features and bug fixes.","status":"active","version":"6.0.0","language":"en","source_language":"en","source_url":"https://github.com/apache/airflow/tree/main/airflow/providers/apache/spark","tags":["Apache Airflow","Spark","Data Processing","ETL","Provider","Orchestration","Big Data"],"install":[{"cmd":"pip install apache-airflow-providers-apache-spark","lang":"bash","label":"Base installation"},{"cmd":"pip install apache-airflow-providers-apache-spark[pyspark]","lang":"bash","label":"With PySpark (for non-Spark Connect connections)"},{"cmd":"pip install apache-airflow-providers-apache-spark[cncf.kubernetes]","lang":"bash","label":"With Kubernetes (for Spark on Kubernetes)"}],"dependencies":[{"reason":"Core Airflow functionality is required. Version 6.x.x of this provider requires Airflow >=2.11.0.","package":"apache-airflow","optional":false},{"reason":"Required for Spark Connect functionality. Minimum version 4.0.0.","package":"pyspark-client","optional":false},{"reason":"Required for Spark Connect functionality. Minimum version 1.67.0.","package":"grpcio-status","optional":false},{"reason":"Optional. Required if using Spark connection types other than 'spark-connect'. No longer included by default since 6.0.0.","package":"pyspark","optional":true},{"reason":"Optional. Required for submitting Spark jobs to Kubernetes.","package":"apache-airflow-providers-cncf-kubernetes","optional":true}],"imports":[{"symbol":"SparkSubmitOperator","correct":"from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator"},{"symbol":"SparkSqlOperator","correct":"from airflow.providers.apache.spark.operators.spark_sql import SparkSqlOperator"},{"symbol":"PySparkOperator","correct":"from airflow.providers.apache.spark.operators.pyspark import PySparkOperator"},{"symbol":"SparkJDBCOperator","correct":"from airflow.providers.apache.spark.operators.spark_jdbc import SparkJDBCOperator"}],"quickstart":{"code":"from __future__ import annotations\n\nimport pendulum\n\nfrom airflow.models.dag import DAG\nfrom airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator\n\n# For local testing, ensure a Spark Connection 'spark_default' is configured in Airflow UI.\n# Example: Host: spark://localhost:7077 (or similar Spark Master URL)\n# For a PySpark job, you might need a local 'pyspark_job.py' file.\n# Example pyspark_job.py content:\n# from pyspark.sql import SparkSession\n# spark = SparkSession.builder.appName('SimpleSparkApp').getOrCreate()\n# data = [('Alice', 1), ('Bob', 2), ('Charlie', 3)]\n# df = spark.createDataFrame(data, ['Name', 'Age'])\n# df.show()\n# spark.stop()\n\nwith DAG(\n    dag_id=\"spark_submit_example_dag\",\n    start_date=pendulum.datetime(2023, 1, 1, tz=\"UTC\"),\n    catchup=False,\n    schedule=None,\n    tags=[\"spark\", \"example\"],\n) as dag:\n    submit_pyspark_job = SparkSubmitOperator(\n        task_id=\"submit_pyspark_job\",\n        conn_id=\"spark_default\", # Ensure this Spark connection is configured in Airflow UI\n        application=\"/opt/airflow/dags/pyspark_job.py\", # Path to your PySpark script\n        name=\"airflow_pyspark_job\",\n        conn_id=\"spark_default\",\n        conf={\n            \"spark.executor.memory\": \"2g\",\n            \"spark.driver.memory\": \"1g\"\n        },\n        verbose=True,\n        # For more options, see SparkSubmitOperator documentation\n        # application_args=[\"--input\", \"/path/to/input.csv\", \"--output\", \"/path/to/output.csv\"]\n    )\n","lang":"python","description":"This example demonstrates a basic Airflow DAG using the `SparkSubmitOperator` to submit a PySpark application to a Spark cluster. Before running, ensure you have a 'Spark' connection (e.g., `spark_default`) configured in your Airflow UI with the appropriate Spark master URL. The `application` parameter should point to your PySpark script accessible by the Airflow worker."},"warnings":[{"fix":"Install the provider with `pip install apache-airflow-providers-apache-spark[pyspark]` if you need non-Spark Connect functionality.","message":"The `pyspark` package is no longer included by default in `apache-airflow-providers-apache-spark` starting from version 6.0.0. Only 'spark-connect' type connections work by default. For other Spark connection types (e.g., submitting PySpark jobs locally), you must install the provider with the `[pyspark]` extra.","severity":"breaking","affected_versions":">=6.0.0"},{"fix":"Ensure your `pyspark` and `spark-connect` installations are at least version 4.0.0.","message":"The minimum required versions for `pyspark` and `spark-connect` are now 4.0.0.","severity":"breaking","affected_versions":">=6.0.0"},{"fix":"Upgrade your Apache Airflow instance to version 2.11.0 or higher to use the latest provider functionalities. Refer to the specific provider version's changelog for exact Airflow compatibility.","message":"This provider version (6.x.x) requires Apache Airflow 2.11.0 or newer. Older provider versions had similar minimum Airflow requirements (e.g., 5.x.x required >=2.11.0, 3.x.x required >=2.2.0, 2.x.x required >=2.1.0).","severity":"breaking","affected_versions":">=5.4.0"},{"fix":"Ensure `JAVA_HOME` and `SPARK_HOME` environment variables are correctly set on the Airflow worker machines and that `spark-submit` and other Spark binaries are accessible in the PATH. This often involves customizing your Docker image for Airflow deployments.","message":"To run Spark jobs via Airflow (especially `SparkSubmitOperator` or `SparkSqlOperator`), the Airflow worker executing the task must have Java installed and correctly configured with `JAVA_HOME`, and Spark binaries must be available in the system's `PATH` (e.g., `SPARK_HOME/bin`).","severity":"gotcha","affected_versions":"All"},{"fix":"Install the Kubernetes provider: `pip install apache-airflow-providers-apache-spark[cncf.kubernetes]`.","message":"When running Spark jobs on Kubernetes, the `apache-airflow-providers-cncf-kubernetes` provider must be installed separately to enable the necessary integration.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}