Apache Airflow Apache Spark Provider
raw JSON → 6.0.0 verified Sun Apr 12 auth: no python
This provider package enables Apache Airflow to interact with Apache Spark, allowing for the orchestration and scheduling of Spark jobs. It includes operators and hooks for submitting Spark applications, executing Spark SQL queries, and performing data transfers. It's an active provider package, with version 6.0.0 released on March 28, 2026. Airflow providers are released independently of Airflow core, typically with a regular cadence to support new features and bug fixes.
pip install apache-airflow-providers-apache-spark Common errors
error ModuleNotFoundError: No module named 'airflow.providers.apache.spark' ↓
cause The 'apache-airflow-providers-apache-spark' package is not installed.
fix
Install the package using 'pip install apache-airflow-providers-apache-spark'.
error Exception when importing 'airflow.providers.apache.spark.hooks.spark_jdbc.SparkJDBCHook' from 'apache-airflow-providers-apache-spark' package: name 'client' is not defined ↓
cause The 'kubernetes' Python package is missing, which is required by the Spark provider.
fix
Install the 'kubernetes' package using 'pip install kubernetes'.
error Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources ↓
cause The Spark cluster lacks available resources to run the job.
fix
Ensure that the Spark cluster has sufficient resources and that workers are properly registered.
error ModuleNotFoundError: No module named 'airflow.providers.apache' ↓
cause The `apache-airflow-providers-apache-spark` package, or its parent `apache` provider directory, is not installed or not accessible within the Airflow environment where the DAG is being parsed or executed.
fix
Ensure the provider package is installed correctly in the Airflow environment (e.g.,
pip install apache-airflow-providers-apache-spark). If using Docker, rebuild the Docker image after adding the installation command to the Dockerfile. error airflow.exceptions.AirflowException: Cannot execute: spark-submit ... Error code is: ... ↓
cause This error often indicates that the `spark-submit` command is not found in the PATH of the Airflow worker, or there's an issue with the Spark installation or the application itself that prevents `spark-submit` from executing successfully.
fix
Verify that the
spark-submit binary is available in the system's PATH where the Airflow worker is running. Ensure Spark is correctly installed and configured, and check the full Spark logs for more specific errors. Warnings
breaking The `pyspark` package is no longer included by default in `apache-airflow-providers-apache-spark` starting from version 6.0.0. Only 'spark-connect' type connections work by default. For other Spark connection types (e.g., submitting PySpark jobs locally), you must install the provider with the `[pyspark]` extra. ↓
fix Install the provider with `pip install apache-airflow-providers-apache-spark[pyspark]` if you need non-Spark Connect functionality.
breaking The minimum required versions for `pyspark` and `spark-connect` are now 4.0.0. ↓
fix Ensure your `pyspark` and `spark-connect` installations are at least version 4.0.0.
breaking This provider version (6.x.x) requires Apache Airflow 2.11.0 or newer. Older provider versions had similar minimum Airflow requirements (e.g., 5.x.x required >=2.11.0, 3.x.x required >=2.2.0, 2.x.x required >=2.1.0). ↓
fix Upgrade your Apache Airflow instance to version 2.11.0 or higher to use the latest provider functionalities. Refer to the specific provider version's changelog for exact Airflow compatibility.
gotcha To run Spark jobs via Airflow (especially `SparkSubmitOperator` or `SparkSqlOperator`), the Airflow worker executing the task must have Java installed and correctly configured with `JAVA_HOME`, and Spark binaries must be available in the system's `PATH` (e.g., `SPARK_HOME/bin`). ↓
fix Ensure `JAVA_HOME` and `SPARK_HOME` environment variables are correctly set on the Airflow worker machines and that `spark-submit` and other Spark binaries are accessible in the PATH. This often involves customizing your Docker image for Airflow deployments.
gotcha When running Spark jobs on Kubernetes, the `apache-airflow-providers-cncf-kubernetes` provider must be installed separately to enable the necessary integration. ↓
fix Install the Kubernetes provider: `pip install apache-airflow-providers-apache-spark[cncf.kubernetes]`.
Install
pip install apache-airflow-providers-apache-spark[pyspark] pip install apache-airflow-providers-apache-spark[cncf.kubernetes] Imports
- SparkSubmitOperator
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator - SparkSqlOperator
from airflow.providers.apache.spark.operators.spark_sql import SparkSqlOperator - PySparkOperator
from airflow.providers.apache.spark.operators.pyspark import PySparkOperator - SparkJDBCOperator
from airflow.providers.apache.spark.operators.spark_jdbc import SparkJDBCOperator
Quickstart
from __future__ import annotations
import pendulum
from airflow.models.dag import DAG
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
# For local testing, ensure a Spark Connection 'spark_default' is configured in Airflow UI.
# Example: Host: spark://localhost:7077 (or similar Spark Master URL)
# For a PySpark job, you might need a local 'pyspark_job.py' file.
# Example pyspark_job.py content:
# from pyspark.sql import SparkSession
# spark = SparkSession.builder.appName('SimpleSparkApp').getOrCreate()
# data = [('Alice', 1), ('Bob', 2), ('Charlie', 3)]
# df = spark.createDataFrame(data, ['Name', 'Age'])
# df.show()
# spark.stop()
with DAG(
dag_id="spark_submit_example_dag",
start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),
catchup=False,
schedule=None,
tags=["spark", "example"],
) as dag:
submit_pyspark_job = SparkSubmitOperator(
task_id="submit_pyspark_job",
conn_id="spark_default", # Ensure this Spark connection is configured in Airflow UI
application="/opt/airflow/dags/pyspark_job.py", # Path to your PySpark script
name="airflow_pyspark_job",
conn_id="spark_default",
conf={
"spark.executor.memory": "2g",
"spark.driver.memory": "1g"
},
verbose=True,
# For more options, see SparkSubmitOperator documentation
# application_args=["--input", "/path/to/input.csv", "--output", "/path/to/output.csv"]
)