{"id":4756,"library":"sagemaker-datawrangler","title":"Amazon SageMaker Data Wrangler Library","description":"Amazon SageMaker Data Wrangler is a feature within Amazon SageMaker Studio Classic (and now integrated into SageMaker Canvas) that provides a visual interface for end-to-end data preparation for machine learning. It allows users to import, prepare, transform, featurize, and analyze data with little to no coding, offering over 300 built-in transformations. Users build 'data flows' graphically, which can then be exported as Python code, SageMaker Pipelines, or Data Wrangler jobs for automated ML workflows. The current PyPI version is 0.4.3, and it typically releases new features and updates aligned with SageMaker Studio and Canvas releases.","status":"active","version":"0.4.3","language":"en","source_language":"en","source_url":"https://github.com/aws/sagemaker-datawrangler","tags":["aws","sagemaker","data-preparation","etl","machine-learning","cloud"],"install":[{"cmd":"pip install sagemaker-datawrangler","lang":"bash","label":"Install PyPI package"}],"dependencies":[{"reason":"Required for AWS SDK interactions, especially when running exported Data Wrangler flows or interacting with SageMaker services programmatically.","package":"boto3","optional":false},{"reason":"The SageMaker Python SDK is essential for programmatic interaction with SageMaker Data Wrangler flows, including creating processing jobs from exported flows.","package":"sagemaker","optional":false}],"imports":[{"note":"The `sagemaker-datawrangler` PyPI package primarily provides utility for interacting with Data Wrangler *flows* created in the SageMaker Studio UI, rather than a direct Python API for individual transformations. Transformations are typically part of a generated script/flow that is then executed via a SageMaker Processing Job. For executing exported Data Wrangler flows programmatically, you generally use the `sagemaker` SDK's `Processor` or `ScriptProcessor` to run the generated Data Wrangler job or Python script.","wrong":"import sagemaker_datawrangler.transformations as dw_transforms","symbol":"DataWranglerProcessor","correct":"from sagemaker.processing import Processor\nfrom sagemaker.processing import ScriptProcessor\n# Or, if using the high-level Data Wrangler specific processor (depends on export method)\n# from sagemaker.analytics import DataWranglerProcessor"}],"quickstart":{"code":"import os\nfrom sagemaker.estimator import Estimator\nfrom sagemaker.processing import Processor, ScriptProcessor\nfrom sagemaker.s3 import S3Uploader\n\n# This quickstart demonstrates how to execute an *exported* Data Wrangler flow.\n# Data Wrangler flows are typically created and exported from the SageMaker Studio UI.\n# The 'flow_file.flow' and 'transformation_script.py' are hypothetical outputs from a Data Wrangler export.\n\n# Set up S3 bucket for input/output and flow file\nbucket = os.environ.get('SAGEMAKER_BUCKET', 'your-sagemaker-default-bucket') # Replace with your S3 bucket\nrole = os.environ.get('SAGEMAKER_ROLE', 'arn:aws:iam::123456789012:role/SageMakerExecutionRole') # Replace with your SageMaker execution role\n\nflow_file_s3_uri = S3Uploader.upload('path/to/local/flow_file.flow', f's3://{bucket}/data-wrangler-flows/')\ninput_data_s3_uri = S3Uploader.upload('path/to/local/input_data.csv', f's3://{bucket}/data-wrangler-inputs/')\noutput_data_s3_uri = f's3://{bucket}/data-wrangler-outputs/'\n\n# Option 1: Run a Data Wrangler .flow file directly as a Processing Job\n# This requires a Data Wrangler-specific container image.\n# You would typically get the image URI from SageMaker documentation or your AWS account.\n# dw_image_uri = 'your-data-wrangler-processing-image-uri'\n# dw_processor = Processor(\n#     role=role,\n#     image_uri=dw_image_uri,\n#     instance_count=1,\n#     instance_type='ml.m5.xlarge',\n#     max_runtime_in_seconds=3600\n# )\n# \n# dw_processor.run(\n#     inputs=[sagemaker.processing.ProcessingInput(source=input_data_s3_uri, destination='/opt/ml/processing/input')],\n#     outputs=[sagemaker.processing.ProcessingOutput(source='/opt/ml/processing/output', destination=output_data_s3_uri)],\n#     arguments=['--flow', flow_file_s3_uri, '--output-uri', output_data_s3_uri]\n# )\n\n# Option 2: Run a Python script exported from Data Wrangler as a ScriptProcessor\n# This assumes Data Wrangler exported a Python script that encapsulates the transformations.\n# You would need to ensure the script is self-contained or has necessary dependencies.\n\n# Placeholder for a Python script that would be generated by Data Wrangler export.\n# Example content for 'transformation_script.py':\n# import pandas as pd\n# import argparse\n# import os\n# \n# if __name__ == '__main__':\n#     parser = argparse.ArgumentParser()\n#     parser.add_argument('--input-path', type=str, default='/opt/ml/processing/input/input_data.csv')\n#     parser.add_argument('--output-path', type=str, default='/opt/ml/processing/output/transformed_data.csv')\n#     args = parser.parse_args()\n# \n#     df = pd.read_csv(args.input_path)\n#     # Apply your Data Wrangler transformations here, e.g.,\n#     df['new_feature'] = df['existing_feature'] * 2\n#     df.to_csv(args.output_path, index=False)\n\nscript_processor = ScriptProcessor(\n    role=role,\n    image_uri='your-sagemaker-processing-python-image-uri', # e.g., '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-python-sdk:latest-cpu-py310'\n    command=['python3'],\n    instance_count=1,\n    instance_type='ml.m5.xlarge',\n    max_runtime_in_seconds=3600\n)\n\n# Upload the transformation script\nS3Uploader.upload('path/to/local/transformation_script.py', f's3://{bucket}/data-wrangler-scripts/')\n\nscript_processor.run(\n    code=f's3://{bucket}/data-wrangler-scripts/transformation_script.py',\n    inputs=[\n        sagemaker.processing.ProcessingInput(source=input_data_s3_uri, destination='/opt/ml/processing/input')\n    ],\n    outputs=[\n        sagemaker.processing.ProcessingOutput(source='/opt/ml/processing/output', destination=output_data_s3_uri)\n    ],\n    arguments=['--input-path', '/opt/ml/processing/input/input_data.csv', '--output-path', '/opt/ml/processing/output/transformed_data.csv']\n)\n\nprint(f\"Data Wrangler processing job launched. Output will be in: {output_data_s3_uri}\")","lang":"python","description":"This quickstart demonstrates how to programmatically execute a data preparation flow defined and exported from SageMaker Data Wrangler. Since Data Wrangler is a UI-driven tool, the Python library `sagemaker-datawrangler` itself doesn't offer direct transformation functions. Instead, you typically export your flow (either as a `.flow` file or a Python script) and then use the SageMaker Python SDK to run it as a SageMaker Processing Job. This example outlines how to set up and run such a processing job, requiring an S3 bucket for inputs, outputs, and the exported flow/script, as well as an appropriate IAM role."},"warnings":[{"fix":"Refer to AWS documentation on 'Shut Down Data Wrangler' and general SageMaker resource management. In SageMaker Studio, check the 'Running instances' and 'Applications' tabs and shut down idle resources.","message":"SageMaker Data Wrangler operations, especially processing jobs, incur AWS costs. Ensure you shut down Data Wrangler instances, Studio applications, and explicitly stop or delete any running processing jobs or endpoints after use to avoid unexpected charges.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Embrace the visual workflow in SageMaker Studio/Canvas. For programmatic execution, export the flow as a SageMaker Processing Job or Python script and use the `sagemaker` SDK to manage its execution. Do not expect to import and directly call `sagemaker_datawrangler.transform_data()`.","message":"The `sagemaker-datawrangler` PyPI package provides utilities for interacting with Data Wrangler, but the primary data transformation capabilities reside within the SageMaker Studio/Canvas UI. Direct programmatic access to individual transformation functions (like `pandas` or `numpy`) is generally not the intended use case. Instead, users define workflows visually and export them for execution.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For the latest features and improved user experience, consider utilizing Data Wrangler within Amazon SageMaker Canvas. Existing workflows in Studio Classic should continue to function but may not receive the newest UI enhancements.","message":"Amazon SageMaker Data Wrangler has been integrated into Amazon SageMaker Canvas. While it still exists in Studio Classic, new features and a natural language interface are primarily being added to the Canvas experience.","severity":"deprecated","affected_versions":"All versions, especially post 2023"},{"fix":"If exporting image transformations, consider creating a destination node in the Data Wrangler flow to specify output, or export the transformations as a Python script to apply them programmatically outside of Data Wrangler. Ensure your SageMaker Studio and Data Wrangler versions are up to date.","message":"When exporting transformed image data from Data Wrangler, the system typically exports the *transformations applied* rather than the actual transformed image files themselves. Attempts to export 'IMAGE' as a file format might result in errors.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}