Apache Airflow Provider for Apache HDFS

raw JSON →
4.11.5 verified Sat May 09 auth: no python

A provider package for Apache Airflow that integrates with Apache HDFS, providing hooks and operators for HDFS file operations. Current version 4.11.5 requires Python >=3.10. Released on Airflow provider schedule.

pip install apache-airflow-providers-apache-hdfs
error ModuleNotFoundError: No module named 'airflow.hooks.hdfs'
cause Using old import path before provider split.
fix
Change to from airflow.providers.apache.hdfs.hooks.hdfs import HDFSHook.
error ImportError: cannot import name 'HdfsPutFileOperator' from 'airflow.operators.hdfs'
cause Operator moved to provider package.
fix
Use from airflow.providers.apache.hdfs.operators.hdfs import HdfsPutFileOperator.
error airflow.exceptions.AirflowException: Connection 'hdfs_default' not found
cause No Airflow connection configured for HDFS.
fix
Create an Airflow connection with conn_id='hdfs_default', type='HDFS', and host/port details.
breaking HDFSHook and operators were moved from `airflow.hooks.hdfs` and `airflow.operators.hdfs` to `airflow.providers.apache.hdfs` in version 2.0.0. Old imports will break.
fix Use new import paths: `from airflow.providers.apache.hdfs.hooks.hdfs import HDFSHook`
gotcha The HDFSHook's get_conn() returns a HDFSClient that may require explicit authentication; default uses 'hdfs' connection string without kerberos if not configured.
fix Ensure your Airflow connection has extra parameters for kerberos if needed.
deprecated HdfsMkdirFileOperator and HdfsPutFileOperator are deprecated as of provider version 4.0.0 in favor of generic FileTransferOperator.
fix Use `FileTransferOperator` from `airflow.providers.apache.hdfs.operators.hdfs` or implement custom logic.

Basic usage of HDFSHook to list files

from airflow.providers.apache.hdfs.hooks.hdfs import HDFSHook

hook = HDFSHook(conn_id='hdfs_default')
# List files in a directory
files = hook.list_directory('/tmp')
print(files)