Amazon SageMaker Data Wrangler Library
Amazon SageMaker Data Wrangler is a feature within Amazon SageMaker Studio Classic (and now integrated into SageMaker Canvas) that provides a visual interface for end-to-end data preparation for machine learning. It allows users to import, prepare, transform, featurize, and analyze data with little to no coding, offering over 300 built-in transformations. Users build 'data flows' graphically, which can then be exported as Python code, SageMaker Pipelines, or Data Wrangler jobs for automated ML workflows. The current PyPI version is 0.4.3, and it typically releases new features and updates aligned with SageMaker Studio and Canvas releases.
Warnings
- gotcha SageMaker Data Wrangler operations, especially processing jobs, incur AWS costs. Ensure you shut down Data Wrangler instances, Studio applications, and explicitly stop or delete any running processing jobs or endpoints after use to avoid unexpected charges.
- gotcha The `sagemaker-datawrangler` PyPI package provides utilities for interacting with Data Wrangler, but the primary data transformation capabilities reside within the SageMaker Studio/Canvas UI. Direct programmatic access to individual transformation functions (like `pandas` or `numpy`) is generally not the intended use case. Instead, users define workflows visually and export them for execution.
- deprecated Amazon SageMaker Data Wrangler has been integrated into Amazon SageMaker Canvas. While it still exists in Studio Classic, new features and a natural language interface are primarily being added to the Canvas experience.
- gotcha When exporting transformed image data from Data Wrangler, the system typically exports the *transformations applied* rather than the actual transformed image files themselves. Attempts to export 'IMAGE' as a file format might result in errors.
Install
-
pip install sagemaker-datawrangler
Imports
- DataWranglerProcessor
from sagemaker.processing import Processor from sagemaker.processing import ScriptProcessor # Or, if using the high-level Data Wrangler specific processor (depends on export method) # from sagemaker.analytics import DataWranglerProcessor
Quickstart
import os
from sagemaker.estimator import Estimator
from sagemaker.processing import Processor, ScriptProcessor
from sagemaker.s3 import S3Uploader
# This quickstart demonstrates how to execute an *exported* Data Wrangler flow.
# Data Wrangler flows are typically created and exported from the SageMaker Studio UI.
# The 'flow_file.flow' and 'transformation_script.py' are hypothetical outputs from a Data Wrangler export.
# Set up S3 bucket for input/output and flow file
bucket = os.environ.get('SAGEMAKER_BUCKET', 'your-sagemaker-default-bucket') # Replace with your S3 bucket
role = os.environ.get('SAGEMAKER_ROLE', 'arn:aws:iam::123456789012:role/SageMakerExecutionRole') # Replace with your SageMaker execution role
flow_file_s3_uri = S3Uploader.upload('path/to/local/flow_file.flow', f's3://{bucket}/data-wrangler-flows/')
input_data_s3_uri = S3Uploader.upload('path/to/local/input_data.csv', f's3://{bucket}/data-wrangler-inputs/')
output_data_s3_uri = f's3://{bucket}/data-wrangler-outputs/'
# Option 1: Run a Data Wrangler .flow file directly as a Processing Job
# This requires a Data Wrangler-specific container image.
# You would typically get the image URI from SageMaker documentation or your AWS account.
# dw_image_uri = 'your-data-wrangler-processing-image-uri'
# dw_processor = Processor(
# role=role,
# image_uri=dw_image_uri,
# instance_count=1,
# instance_type='ml.m5.xlarge',
# max_runtime_in_seconds=3600
# )
#
# dw_processor.run(
# inputs=[sagemaker.processing.ProcessingInput(source=input_data_s3_uri, destination='/opt/ml/processing/input')],
# outputs=[sagemaker.processing.ProcessingOutput(source='/opt/ml/processing/output', destination=output_data_s3_uri)],
# arguments=['--flow', flow_file_s3_uri, '--output-uri', output_data_s3_uri]
# )
# Option 2: Run a Python script exported from Data Wrangler as a ScriptProcessor
# This assumes Data Wrangler exported a Python script that encapsulates the transformations.
# You would need to ensure the script is self-contained or has necessary dependencies.
# Placeholder for a Python script that would be generated by Data Wrangler export.
# Example content for 'transformation_script.py':
# import pandas as pd
# import argparse
# import os
#
# if __name__ == '__main__':
# parser = argparse.ArgumentParser()
# parser.add_argument('--input-path', type=str, default='/opt/ml/processing/input/input_data.csv')
# parser.add_argument('--output-path', type=str, default='/opt/ml/processing/output/transformed_data.csv')
# args = parser.parse_args()
#
# df = pd.read_csv(args.input_path)
# # Apply your Data Wrangler transformations here, e.g.,
# df['new_feature'] = df['existing_feature'] * 2
# df.to_csv(args.output_path, index=False)
script_processor = ScriptProcessor(
role=role,
image_uri='your-sagemaker-processing-python-image-uri', # e.g., '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-python-sdk:latest-cpu-py310'
command=['python3'],
instance_count=1,
instance_type='ml.m5.xlarge',
max_runtime_in_seconds=3600
)
# Upload the transformation script
S3Uploader.upload('path/to/local/transformation_script.py', f's3://{bucket}/data-wrangler-scripts/')
script_processor.run(
code=f's3://{bucket}/data-wrangler-scripts/transformation_script.py',
inputs=[
sagemaker.processing.ProcessingInput(source=input_data_s3_uri, destination='/opt/ml/processing/input')
],
outputs=[
sagemaker.processing.ProcessingOutput(source='/opt/ml/processing/output', destination=output_data_s3_uri)
],
arguments=['--input-path', '/opt/ml/processing/input/input_data.csv', '--output-path', '/opt/ml/processing/output/transformed_data.csv']
)
print(f"Data Wrangler processing job launched. Output will be in: {output_data_s3_uri}")