Azure ML Pipeline Steps
The `azureml-pipeline-steps` library, part of the Azure Machine Learning Python SDK, provides classes to define individual computational units (steps) within an Azure ML pipeline. These steps can encapsulate Python scripts, data transfers, AutoML runs, and more, enabling the construction of complex MLOps workflows. The current version is 1.62.0, and it follows the release cadence of the broader Azure ML SDK, with frequent updates.
Common errors
-
ModuleNotFoundError: No module named 'azureml.core'
cause The `azureml-core` package, which provides fundamental Azure ML SDK functionalities like `Workspace` and `Environment`, is not installed.fixInstall the `azureml-core` package: `pip install azureml-core`. -
azureml.exceptions.UserErrorException: Workspace not found for subscription ID...
cause The Azure ML Workspace could not be found or authenticated. This usually indicates incorrect subscription/resource group/workspace name, or a lack of proper authentication credentials.fixVerify your `config.json` is present and correct, or provide explicit `subscription_id`, `resource_group`, and `workspace_name` parameters along with an `auth` object when calling `Workspace.get()` or `Workspace.from_config()`. -
ScriptExecutionException: User program failed with exit code 1
cause This generic error indicates a failure within your Python script that was executed by the pipeline step. Common causes include missing dependencies in the step's environment, errors in script logic, or incorrect input/output paths.fixExamine the detailed logs of the failed step in Azure ML Studio to identify the specific error message from your script. Ensure all required packages are specified in the step's `Environment` definition and that script paths/arguments are correct. -
UserErrorException: The specified environment is not found or cannot be created.
cause The `Environment` object specified for the pipeline step either doesn't exist, has an invalid definition (e.g., incorrect Conda dependencies file), or lacks permissions to be created/accessed.fixVerify the `Environment` object's name and definition. If creating from a file, ensure the file path is correct and the Conda/Docker specification is valid. Consider using curated environments for simplicity if applicable.
Warnings
- gotcha Authentication is crucial and often a source of failure. Ensure your local environment is authenticated to Azure (e.g., via `az login`), or provide explicit credentials using `ServicePrincipalAuthentication` or `InteractiveLoginAuthentication` when instantiating `Workspace`.
- deprecated Older methods of defining execution environments, such as `RunConfiguration` and directly passing `CondaDependencies`, have largely been deprecated in favor of explicit `Environment` objects. Mixing old and new approaches can lead to errors.
- gotcha Data transfer between steps via `PipelineData` or `DataReference` requires careful path management. Incorrectly specified paths or attempting to access data before it's materialized can cause `FileNotFoundError` or `PathNotFoundException` within your step's script.
- breaking The `azureml-sdk` components, including `azureml-pipeline-steps`, often introduce breaking changes, especially between major or significant minor versions, particularly concerning environment definitions, data APIs, and compute targets. Incompatible `azureml-core` and `azureml-pipeline-steps` versions can cause issues.
Install
-
pip install azureml-pipeline-steps azureml-core
Imports
- PythonScriptStep
from azureml.train.steps import PythonScriptStep
from azureml.pipeline.steps import PythonScriptStep
- DataTransferStep
from azureml.pipeline.steps import DataTransferStep
Quickstart
import os
from azureml.core import Workspace, Environment
from azureml.data.datareference import DataReference
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep
# NOTE: For actual execution, ensure Azure ML workspace is configured
# using a config.json file or environment variables for service principal.
# Example: os.environ['AZUREML_ARM_SUBSCRIPTION'] = '...'
# workspace = Workspace.from_config()
# Define a dummy script file (must exist for PythonScriptStep to be valid)
with open("process_data.py", "w") as f:
f.write("import argparse\n")
f.write("parser = argparse.ArgumentParser()\n")
f.write("parser.add_argument('--input_data', type=str)\n")
f.write("parser.add_argument('--output_data', type=str)\n")
f.write("args = parser.parse_args()\n")
f.write("print(f'Processing data from {args.input_data} to {args.output_data}')\n")
f.write("with open(os.path.join(args.output_data, 'output.txt'), 'w') as out_f:\n")
f.write(" out_f.write('Processed data!')\n")
# Create a simple environment (using a curated environment is recommended in production)
# environment = Environment.from_conda_specification("myenv", "./myenv.yml")
# For quickstart, a basic environment suffices or assume a default compute's environment.
# Define pipeline inputs and outputs
# Using dummy placeholder for workspace and compute target for demonstration
# In a real scenario, you'd load these from your Azure ML setup
# Placeholder for Workspace and Compute
class MockWorkspace:
def __init__(self):
self.name = "mock_ws"
self.subscription_id = "mock_sub_id"
self.resource_group = "mock_rg"
class MockComputeTarget:
def __init__(self, name):
self.name = name
# Use mock objects for demonstration, replace with actual objects for execution
# workspace = Workspace.from_config() # Real workspace loading
# compute_target = workspace.compute_targets['my-aml-compute'] # Real compute target
mock_workspace = MockWorkspace()
mock_compute = MockComputeTarget('cpu-cluster')
# Define PipelineData outputs
processed_data = PipelineData("processed_data", datastore=mock_workspace.get_default_datastore() if hasattr(mock_workspace, 'get_default_datastore') else None)
# Create a PythonScriptStep
step = PythonScriptStep(
name="process-data-step",
script_name="process_data.py",
arguments=["--input_data", "dummy_input_path", "--output_data", processed_data],
inputs=[DataReference(datastore=mock_workspace.get_default_datastore() if hasattr(mock_workspace, 'get_default_datastore') else None, data_reference_name="dummy_input", path_on_datastore="/dummy/input")],
outputs=[processed_data],
compute_target=mock_compute.name,
source_directory=".",
runconfig=Environment.from_conda_specification(name='my_env', file_path='.azureml/my_env.yml').create_run_config() if os.path.exists('.azureml/my_env.yml') else None # Use an existing run config or Environment
)
print(f"Successfully created step: {step.name}")
# Clean up dummy script
os.remove("process_data.py")