Azure ML Pipeline Steps

1.62.0 · active · verified Thu Apr 16

The `azureml-pipeline-steps` library, part of the Azure Machine Learning Python SDK, provides classes to define individual computational units (steps) within an Azure ML pipeline. These steps can encapsulate Python scripts, data transfers, AutoML runs, and more, enabling the construction of complex MLOps workflows. The current version is 1.62.0, and it follows the release cadence of the broader Azure ML SDK, with frequent updates.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to define a `PythonScriptStep`, a fundamental component of `azureml-pipeline-steps`. It shows how to link a Python script, pass arguments, specify inputs and outputs using `PipelineData` and `DataReference`, and associate it with a compute target. Note that for actual execution, you'll need to configure an Azure ML `Workspace`, a `ComputeTarget`, and a proper `Environment`.

import os
from azureml.core import Workspace, Environment
from azureml.data.datareference import DataReference
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep

# NOTE: For actual execution, ensure Azure ML workspace is configured
# using a config.json file or environment variables for service principal.
# Example: os.environ['AZUREML_ARM_SUBSCRIPTION'] = '...'
# workspace = Workspace.from_config()

# Define a dummy script file (must exist for PythonScriptStep to be valid)
with open("process_data.py", "w") as f:
    f.write("import argparse\n")
    f.write("parser = argparse.ArgumentParser()\n")
    f.write("parser.add_argument('--input_data', type=str)\n")
    f.write("parser.add_argument('--output_data', type=str)\n")
    f.write("args = parser.parse_args()\n")
    f.write("print(f'Processing data from {args.input_data} to {args.output_data}')\n")
    f.write("with open(os.path.join(args.output_data, 'output.txt'), 'w') as out_f:\n")
    f.write("    out_f.write('Processed data!')\n")

# Create a simple environment (using a curated environment is recommended in production)
# environment = Environment.from_conda_specification("myenv", "./myenv.yml")
# For quickstart, a basic environment suffices or assume a default compute's environment.

# Define pipeline inputs and outputs
# Using dummy placeholder for workspace and compute target for demonstration
# In a real scenario, you'd load these from your Azure ML setup

# Placeholder for Workspace and Compute
class MockWorkspace:
    def __init__(self):
        self.name = "mock_ws"
        self.subscription_id = "mock_sub_id"
        self.resource_group = "mock_rg"

class MockComputeTarget:
    def __init__(self, name):
        self.name = name

# Use mock objects for demonstration, replace with actual objects for execution
# workspace = Workspace.from_config() # Real workspace loading
# compute_target = workspace.compute_targets['my-aml-compute'] # Real compute target

mock_workspace = MockWorkspace()
mock_compute = MockComputeTarget('cpu-cluster')

# Define PipelineData outputs
processed_data = PipelineData("processed_data", datastore=mock_workspace.get_default_datastore() if hasattr(mock_workspace, 'get_default_datastore') else None)

# Create a PythonScriptStep
step = PythonScriptStep(
    name="process-data-step",
    script_name="process_data.py",
    arguments=["--input_data", "dummy_input_path", "--output_data", processed_data],
    inputs=[DataReference(datastore=mock_workspace.get_default_datastore() if hasattr(mock_workspace, 'get_default_datastore') else None, data_reference_name="dummy_input", path_on_datastore="/dummy/input")],
    outputs=[processed_data],
    compute_target=mock_compute.name,
    source_directory=".",
    runconfig=Environment.from_conda_specification(name='my_env', file_path='.azureml/my_env.yml').create_run_config() if os.path.exists('.azureml/my_env.yml') else None # Use an existing run config or Environment
)

print(f"Successfully created step: {step.name}")

# Clean up dummy script
os.remove("process_data.py")

view raw JSON →