Apache Airflow Livy Provider

4.5.5 · active · verified Tue Apr 14

The Apache Airflow Livy Provider package enables Apache Airflow to interact with Apache Livy, an open-source REST service for submitting and managing Spark jobs on a cluster over HTTP. It includes operators and hooks to facilitate the orchestration of Spark applications within Airflow DAGs. The current version is 4.5.5, with provider packages often updated independently of, but in alignment with, core Airflow releases.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates a basic Airflow DAG using the `LivyOperator` to submit a Spark Pi calculation job to a Livy server. Ensure you have a 'livy_default' connection configured in your Airflow UI pointing to your Livy instance. The `file` parameter should point to the Spark example JAR on your cluster's accessible path.

from __future__ import annotations

import os
from datetime import datetime

from airflow.models.dag import DAG
from airflow.providers.apache.livy.operators.livy import LivyOperator


with DAG(
    dag_id='livy_spark_pi_example',
    schedule=None,
    start_date=datetime(2023, 1, 1),
    catchup=False,
    tags=['livy', 'spark', 'example'],
) as dag:
    # Ensure 'livy_default' connection is configured in Airflow UI (Admin -> Connections)
    # with appropriate host and port for your Livy server.
    # Example: livy_default, Host: localhost, Port: 8998

    submit_spark_pi_job = LivyOperator(
        task_id='submit_spark_pi_job',
        file=os.getenv('LIVY_SPARK_PI_JAR', '/opt/spark/examples/jars/spark-examples_2.12-3.2.1.jar'),
        class_name='org.apache.spark.examples.SparkPi',
        args=['10'], # Example: Calculate Pi with 10 iterations
        livy_conn_id='livy_default',
        driver_memory='1g',
        executor_memory='1g',
        num_executors=1
    )

view raw JSON →