Apache Airflow Apache Spark Provider
This provider package enables Apache Airflow to interact with Apache Spark, allowing for the orchestration and scheduling of Spark jobs. It includes operators and hooks for submitting Spark applications, executing Spark SQL queries, and performing data transfers. It's an active provider package, with version 6.0.0 released on March 28, 2026. Airflow providers are released independently of Airflow core, typically with a regular cadence to support new features and bug fixes.
Warnings
- breaking The `pyspark` package is no longer included by default in `apache-airflow-providers-apache-spark` starting from version 6.0.0. Only 'spark-connect' type connections work by default. For other Spark connection types (e.g., submitting PySpark jobs locally), you must install the provider with the `[pyspark]` extra.
- breaking The minimum required versions for `pyspark` and `spark-connect` are now 4.0.0.
- breaking This provider version (6.x.x) requires Apache Airflow 2.11.0 or newer. Older provider versions had similar minimum Airflow requirements (e.g., 5.x.x required >=2.11.0, 3.x.x required >=2.2.0, 2.x.x required >=2.1.0).
- gotcha To run Spark jobs via Airflow (especially `SparkSubmitOperator` or `SparkSqlOperator`), the Airflow worker executing the task must have Java installed and correctly configured with `JAVA_HOME`, and Spark binaries must be available in the system's `PATH` (e.g., `SPARK_HOME/bin`).
- gotcha When running Spark jobs on Kubernetes, the `apache-airflow-providers-cncf-kubernetes` provider must be installed separately to enable the necessary integration.
Install
-
pip install apache-airflow-providers-apache-spark -
pip install apache-airflow-providers-apache-spark[pyspark] -
pip install apache-airflow-providers-apache-spark[cncf.kubernetes]
Imports
- SparkSubmitOperator
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
- SparkSqlOperator
from airflow.providers.apache.spark.operators.spark_sql import SparkSqlOperator
- PySparkOperator
from airflow.providers.apache.spark.operators.pyspark import PySparkOperator
- SparkJDBCOperator
from airflow.providers.apache.spark.operators.spark_jdbc import SparkJDBCOperator
Quickstart
from __future__ import annotations
import pendulum
from airflow.models.dag import DAG
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
# For local testing, ensure a Spark Connection 'spark_default' is configured in Airflow UI.
# Example: Host: spark://localhost:7077 (or similar Spark Master URL)
# For a PySpark job, you might need a local 'pyspark_job.py' file.
# Example pyspark_job.py content:
# from pyspark.sql import SparkSession
# spark = SparkSession.builder.appName('SimpleSparkApp').getOrCreate()
# data = [('Alice', 1), ('Bob', 2), ('Charlie', 3)]
# df = spark.createDataFrame(data, ['Name', 'Age'])
# df.show()
# spark.stop()
with DAG(
dag_id="spark_submit_example_dag",
start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),
catchup=False,
schedule=None,
tags=["spark", "example"],
) as dag:
submit_pyspark_job = SparkSubmitOperator(
task_id="submit_pyspark_job",
conn_id="spark_default", # Ensure this Spark connection is configured in Airflow UI
application="/opt/airflow/dags/pyspark_job.py", # Path to your PySpark script
name="airflow_pyspark_job",
conn_id="spark_default",
conf={
"spark.executor.memory": "2g",
"spark.driver.memory": "1g"
},
verbose=True,
# For more options, see SparkSubmitOperator documentation
# application_args=["--input", "/path/to/input.csv", "--output", "/path/to/output.csv"]
)