Apache Airflow Microsoft Azure Provider
The `apache-airflow-providers-microsoft-azure` package provides Apache Airflow hooks, operators, and sensors for integrating with various Microsoft Azure services, including Blob Storage, Data Lake Storage Gen2, Cosmos DB, and SQL Database. Currently at version 13.1.0, this provider package follows the Apache Airflow project's release cadence, receiving regular updates and new feature additions as part of the broader Airflow ecosystem. It requires Python >= 3.10.
Warnings
- breaking Migration from Airflow 1.x `airflow.contrib` modules to Airflow 2.x+ provider packages. All Azure-related hooks, operators, and sensors moved from `airflow.contrib.hooks.azure_blob_hook`, etc., to `airflow.providers.microsoft.azure.*` paths.
- gotcha Azure connection authentication methods are complex and often misconfigured. The provider supports Service Principal (with tenant, client_id, client_secret), Managed Identity, Account Key, and SAS Token. The method used depends on how the Airflow Connection 'Extra' field is populated.
- gotcha Specific Azure client libraries are often installed as 'extras' with the provider package (e.g., `[blob]`, `[datalake]`, `[cosmos]`). If you encounter `ModuleNotFoundError` for an `azure-` client library, it's likely a missing extra.
- breaking Airflow 2.0 introduced significant changes to the `BaseOperator` interface, including how XComs are handled. Operators may no longer support `xcom_push=True` or `do_xcom_push=True` arguments, relying instead on the `template_fields` and `return_value` mechanism.
- gotcha The `AzureDataLakeStorageGen2Hook` and related operators require `azure-storage-file-datalake` and `azure-identity` client libraries, which are installed via the `[datalake]` extra. Using older `azure-datalake-store` for Gen1 Data Lake will not work with Gen2 operators.
Install
-
pip install apache-airflow-providers-microsoft-azure -
pip install 'apache-airflow-providers-microsoft-azure[blob,datalake]' # Install with specific extras for services
Imports
- AzureBlobStorageHook
from airflow.providers.microsoft.azure.hooks.blob import AzureBlobStorageHook
- AzureBlobStorageListOperator
from airflow.providers.microsoft.azure.operators.blob import AzureBlobStorageListOperator
- AzureDataLakeStorageGen2Hook
from airflow.providers.microsoft.azure.hooks.datalake import AzureDataLakeStorageGen2Hook
Quickstart
from __future__ import annotations
import pendulum
from airflow.models.dag import DAG
from airflow.providers.microsoft.azure.operators.blob import AzureBlobStorageListOperator
# Configure an Airflow connection named 'azure_blob_default'
# with details like account_key, sas_token, or service principal details.
# For example, in Airflow UI: Admin -> Connections -> Add a new Connection
# Conn Id: azure_blob_default
# Conn Type: Azure
# Host: <your-azure-storage-account-name>.blob.core.windows.net
# Extra: {"login": "<service-principal-client-id>", "password": "<service-principal-client-secret>", "tenant": "<azure-tenant-id>"}
# Or use environment variables like AIRFLOW_CONN_AZURE_BLOB_DEFAULT
with DAG(
dag_id="azure_blob_storage_list_example",
start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),
schedule=None,
catchup=False,
tags=["azure", "blob_storage", "example"],
) as dag:
list_blobs = AzureBlobStorageListOperator(
task_id="list_blobs_in_container",
container_name="your-container-name", # Replace with an actual Azure Blob Storage container name
azure_blob_conn_id="azure_blob_default", # Ensure this connection ID is configured in Airflow
# Optional: prefix='my-folder/',
# Optional: show_only_last_modified=True,
)