Azure ML Data Preparation SDK
The `azureml-dataprep` library is part of the Azure ML Python SDK v1, providing capabilities to load, transform, and write data for machine learning workflows within the v1 ecosystem. As of version 5.4.3, it is primarily used for creating `Dataset` objects that integrate with Azure ML workspaces (v1). While receiving maintenance updates, it is largely superseded by the Azure ML SDK v2 (`azure-ai-ml`) for new development, which offers different data handling paradigms.
Warnings
- deprecated The `azureml-dataprep` library is part of the Azure ML Python SDK v1, which is largely superseded by the v2 SDK (`azure-ai-ml`) for new development. Microsoft recommends migrating to the v2 SDK for modern Azure ML workflows.
- gotcha Operations on `Dataflow` objects are lazily evaluated. Transformations are not applied until an action (like `to_pandas_dataframe()`, `head()`, or `write_to_csv()`) is called, which can sometimes lead to unexpected behavior or delayed error detection.
- gotcha `azureml-dataprep` is tightly coupled with `azureml-core` (v1 SDK) and may have version conflicts if other `azureml` packages, especially from the v2 SDK (`azure-ai-ml`), are installed in the same environment.
- breaking Requires Python 3.8 or higher. Older Python versions (e.g., 3.7) are not supported by recent `azureml-dataprep` releases.
Install
-
pip install azureml-dataprep
Imports
- Dataflow
from azureml.dataprep import Dataflow
- read_csv
import azureml.dataprep as dprep dprep.read_csv(...)
Quickstart
import azureml.dataprep as dprep
import pandas as pd
import os
# Create a dummy CSV file for demonstration
file_path = "quickstart_data.csv"
with open(file_path, "w") as f:
f.write("id,name,value\n")
f.write("1,apple,100\n")
f.write("2,banana,200\n")
f.write("3,orange,150\n")
try:
# Read the CSV into a Dataflow object
dataflow = dprep.read_csv(file_path)
print("Original Dataflow (first 5 rows):")
print(dataflow.head(5))
# Perform a simple transformation: select specific columns
transformed_dataflow = dataflow.keep_columns(columns=['name', 'value'])
print("\nTransformed Dataflow (name, value columns, first 5 rows):")
print(transformed_dataflow.head(5))
# Convert the Dataflow to a Pandas DataFrame for local processing
pandas_df = transformed_dataflow.to_pandas_dataframe()
print("\nConverted to Pandas DataFrame:")
print(pandas_df)
finally:
# Clean up the dummy file
if os.path.exists(file_path):
os.remove(file_path)