Apache Airflow Papermill Provider
The Apache Airflow Papermill Provider integrates Papermill with Apache Airflow, enabling users to parameterize and execute Jupyter Notebooks as part of their Airflow DAGs. This allows for automated, reproducible, and scalable execution of notebooks within data pipelines. The current version is 3.12.3, and it follows the release cadence of Apache Airflow providers, with updates typically aligned with Airflow releases or independent fixes and features.
Warnings
- breaking Provider version 3.0.0 and above requires Apache Airflow 2.2+. Earlier versions of the provider (2.0.0) required Airflow 2.1.0+. Ensure your Airflow installation meets the minimum version requirement for the provider you are installing.
- breaking Python 3.7 support was dropped in provider versions 3.2.1 and above. Ensure you are using a supported Python version.
- gotcha The `PapermillOperator` executes notebooks locally within the Airflow worker's environment. You must ensure that the notebook's kernel (e.g., `ipykernel`) and any other dependencies required by your notebook code are installed in the Airflow worker's environment.
- gotcha Jupyter notebooks intended for use with `PapermillOperator` must have a cell explicitly tagged as 'parameters' if you intend to pass parameters from Airflow. If this tag is missing, parameters will be injected at the top of the notebook, which might not be the desired behavior.
- gotcha A known bug with some `papermill` versions can cause 'No such file or directory' errors when writing grammar tables. This typically manifests as `Writing failed: [Errno 2] No such file or directory: '/home/astro/.cache/black/21.7b0/tmpzpsclowd'`.
Install
-
pip install apache-airflow-providers-papermill
Imports
- PapermillOperator
from airflow.providers.papermill.operators.papermill import PapermillOperator
Quickstart
from __future__ import annotations
import pendulum
from airflow.models.dag import DAG
from airflow.providers.papermill.operators.papermill import PapermillOperator
# For a real-world scenario, ensure 'hello_world.ipynb' exists in your DAGs folder
# or a location accessible by Airflow, with a 'parameters' tagged cell.
# Example 'hello_world.ipynb':
# # In a cell, add tag 'parameters'
# msg = "Default message"
# print(f"Hello, {msg}!")
with DAG(
dag_id="example_papermill_notebook",
start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),
schedule=None,
catchup=False,
tags=["papermill", "example"],
) as dag:
run_notebook = PapermillOperator(
task_id="run_hello_world_notebook",
input_nb="/tmp/hello_world.ipynb", # Replace with actual path or Airflow-accessible path
output_nb="/tmp/out-{{ ds }}.ipynb",
parameters={
"msgs": "Ran from Airflow at {{ ds }}!"
},
)