Azure Storage File DataLake Client Library
Microsoft Azure File DataLake Storage Client Library for Python provides APIs for interacting with Azure Data Lake Storage Gen2, which offers hierarchical namespace capabilities on top of Azure Blob Storage. This library enables developers to manage file systems, directories, and files, including operations for creating, renaming, deleting, and managing access control lists (ACLs). Azure SDKs typically receive frequent updates, often monthly or bi-monthly, focusing on new features, bug fixes, and alignment with new service API versions.
Warnings
- breaking This library (`azure-storage-file-datalake`) is specifically for Azure Data Lake Storage Gen2. Its API differs significantly from older Gen1 Data Lake Store libraries (`azure-datalake-store`) and general Blob Storage (`azure-storage-blob`) when performing hierarchical operations. Migration from older versions or relying solely on Blob APIs for Gen2 may require substantial code changes to leverage full hierarchical namespace capabilities.
- gotcha Using account keys or connection strings directly for authentication is less secure and not recommended for production environments. These methods embed credentials directly in code or environment variables, posing a security risk if compromised.
- gotcha When writing data to a file using `DataLakeFileClient.append_data()`, the data is buffered and not immediately committed or visible until `DataLakeFileClient.flush_data()` is explicitly called. Forgetting to call `flush_data()` can result in incomplete or invisible file content.
- gotcha Azure Data Lake Storage Gen2 supports multi-protocol access, allowing both Blob APIs and Data Lake APIs. However, for operations unique to hierarchical namespaces (like atomic directory renames, creating directories, and fine-grained ACLs), it is crucial to use the `azure-storage-file-datalake` APIs. Using Blob APIs for these specific tasks may result in incorrect behavior, errors, or a lack of functionality.
Install
-
pip install azure-storage-file-datalake
Imports
- DataLakeServiceClient
from azure.storage.filedatalake import DataLakeServiceClient
- FileSystemClient
from azure.storage.filedatalake import FileSystemClient
- DataLakeDirectoryClient
from azure.storage.filedatalake import DataLakeDirectoryClient
- DataLakeFileClient
from azure.storage.filedatalake import DataLakeFileClient
- DefaultAzureCredential
from azure.identity import DefaultAzureCredential
Quickstart
import os
from azure.storage.filedatalake import DataLakeServiceClient
from azure.identity import DefaultAzureCredential
# Ensure environment variables are set for authentication and account URL:
# AZURE_STORAGE_ACCOUNT_NAME: Name of your Azure Data Lake Storage Gen2 account
# AZURE_TENANT_ID, AZURE_CLIENT_ID, AZURE_CLIENT_SECRET for DefaultAzureCredential
try:
account_name = os.environ.get("AZURE_STORAGE_ACCOUNT_NAME")
if not account_name:
raise ValueError("AZURE_STORAGE_ACCOUNT_NAME environment variable not set.")
# Construct the account URL for Data Lake Storage Gen2
# Note: .dfs.core.windows.net is used for Data Lake Storage Gen2 endpoints
account_url = f"https://{account_name}.dfs.core.windows.net"
# Authenticate using DefaultAzureCredential (recommended for production)
# DefaultAzureCredential tries various authentication methods, including environment variables,
# managed identity, Azure CLI, etc.
credential = DefaultAzureCredential()
# Create a DataLakeServiceClient
service_client = DataLakeServiceClient(account_url, credential=credential)
print(f"Listing file systems in account: {account_name}")
file_systems = service_client.list_file_systems()
for fs in file_systems:
print(f"- {fs.name}")
except Exception as e:
print(f"An error occurred: {e}")
print("Please ensure AZURE_STORAGE_ACCOUNT_NAME and authentication credentials (e.g., AZURE_TENANT_ID, AZURE_CLIENT_ID, AZURE_CLIENT_SECRET for service principal) are correctly configured.")