Amazon SageMaker Data Wrangler Library

0.4.3 · active · verified Sun Apr 12

Amazon SageMaker Data Wrangler is a feature within Amazon SageMaker Studio Classic (and now integrated into SageMaker Canvas) that provides a visual interface for end-to-end data preparation for machine learning. It allows users to import, prepare, transform, featurize, and analyze data with little to no coding, offering over 300 built-in transformations. Users build 'data flows' graphically, which can then be exported as Python code, SageMaker Pipelines, or Data Wrangler jobs for automated ML workflows. The current PyPI version is 0.4.3, and it typically releases new features and updates aligned with SageMaker Studio and Canvas releases.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to programmatically execute a data preparation flow defined and exported from SageMaker Data Wrangler. Since Data Wrangler is a UI-driven tool, the Python library `sagemaker-datawrangler` itself doesn't offer direct transformation functions. Instead, you typically export your flow (either as a `.flow` file or a Python script) and then use the SageMaker Python SDK to run it as a SageMaker Processing Job. This example outlines how to set up and run such a processing job, requiring an S3 bucket for inputs, outputs, and the exported flow/script, as well as an appropriate IAM role.

import os
from sagemaker.estimator import Estimator
from sagemaker.processing import Processor, ScriptProcessor
from sagemaker.s3 import S3Uploader

# This quickstart demonstrates how to execute an *exported* Data Wrangler flow.
# Data Wrangler flows are typically created and exported from the SageMaker Studio UI.
# The 'flow_file.flow' and 'transformation_script.py' are hypothetical outputs from a Data Wrangler export.

# Set up S3 bucket for input/output and flow file
bucket = os.environ.get('SAGEMAKER_BUCKET', 'your-sagemaker-default-bucket') # Replace with your S3 bucket
role = os.environ.get('SAGEMAKER_ROLE', 'arn:aws:iam::123456789012:role/SageMakerExecutionRole') # Replace with your SageMaker execution role

flow_file_s3_uri = S3Uploader.upload('path/to/local/flow_file.flow', f's3://{bucket}/data-wrangler-flows/')
input_data_s3_uri = S3Uploader.upload('path/to/local/input_data.csv', f's3://{bucket}/data-wrangler-inputs/')
output_data_s3_uri = f's3://{bucket}/data-wrangler-outputs/'

# Option 1: Run a Data Wrangler .flow file directly as a Processing Job
# This requires a Data Wrangler-specific container image.
# You would typically get the image URI from SageMaker documentation or your AWS account.
# dw_image_uri = 'your-data-wrangler-processing-image-uri'
# dw_processor = Processor(
#     role=role,
#     image_uri=dw_image_uri,
#     instance_count=1,
#     instance_type='ml.m5.xlarge',
#     max_runtime_in_seconds=3600
# )
# 
# dw_processor.run(
#     inputs=[sagemaker.processing.ProcessingInput(source=input_data_s3_uri, destination='/opt/ml/processing/input')],
#     outputs=[sagemaker.processing.ProcessingOutput(source='/opt/ml/processing/output', destination=output_data_s3_uri)],
#     arguments=['--flow', flow_file_s3_uri, '--output-uri', output_data_s3_uri]
# )

# Option 2: Run a Python script exported from Data Wrangler as a ScriptProcessor
# This assumes Data Wrangler exported a Python script that encapsulates the transformations.
# You would need to ensure the script is self-contained or has necessary dependencies.

# Placeholder for a Python script that would be generated by Data Wrangler export.
# Example content for 'transformation_script.py':
# import pandas as pd
# import argparse
# import os
# 
# if __name__ == '__main__':
#     parser = argparse.ArgumentParser()
#     parser.add_argument('--input-path', type=str, default='/opt/ml/processing/input/input_data.csv')
#     parser.add_argument('--output-path', type=str, default='/opt/ml/processing/output/transformed_data.csv')
#     args = parser.parse_args()
# 
#     df = pd.read_csv(args.input_path)
#     # Apply your Data Wrangler transformations here, e.g.,
#     df['new_feature'] = df['existing_feature'] * 2
#     df.to_csv(args.output_path, index=False)

script_processor = ScriptProcessor(
    role=role,
    image_uri='your-sagemaker-processing-python-image-uri', # e.g., '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-python-sdk:latest-cpu-py310'
    command=['python3'],
    instance_count=1,
    instance_type='ml.m5.xlarge',
    max_runtime_in_seconds=3600
)

# Upload the transformation script
S3Uploader.upload('path/to/local/transformation_script.py', f's3://{bucket}/data-wrangler-scripts/')

script_processor.run(
    code=f's3://{bucket}/data-wrangler-scripts/transformation_script.py',
    inputs=[
        sagemaker.processing.ProcessingInput(source=input_data_s3_uri, destination='/opt/ml/processing/input')
    ],
    outputs=[
        sagemaker.processing.ProcessingOutput(source='/opt/ml/processing/output', destination=output_data_s3_uri)
    ],
    arguments=['--input-path', '/opt/ml/processing/input/input_data.csv', '--output-path', '/opt/ml/processing/output/transformed_data.csv']
)

print(f"Data Wrangler processing job launched. Output will be in: {output_data_s3_uri}")

view raw JSON →