SageMaker Data Insights

0.4.0 · active · verified Sun Apr 12

The SageMaker Data Insights library (current version 0.4.0) is an open-source Python library designed by AWS to help users analyze and understand their data for various SageMaker workloads. It provides utilities for extracting insights from datasets used in SageMaker Labeling Jobs and other data processing tasks, helping identify potential data quality issues or patterns. Given its 0.x.x version, it maintains a relatively agile release cadence, with API changes possible between minor versions.

Warnings

Install

Imports

Quickstart

Initializes a SageMaker session, instantiates `LabelingJobDataInsights` with user-provided S3 URI and IAM role, and demonstrates how to retrieve and print a basic summary of the data insights. Requires configured AWS credentials and a valid S3 path to a labeling job manifest file.

import os
import sagemaker
import boto3
from sagemaker_data_insights.labeling_job.data_insights import LabelingJobDataInsights

# Initialize SageMaker session using default boto3 credential chain
# Ensure your AWS credentials are configured (e.g., via AWS CLI, environment variables)
# The region can be specified if not configured globally.
region = os.environ.get('AWS_REGION', 'us-east-1')
boto_session = boto3.Session(region_name=region)
sagemaker_session = sagemaker.Session(boto_session=boto_session)

# Define placeholders for your specific S3 input data and IAM role
# Replace with actual S3 URI to your labeling job manifest file
input_s3_uri = os.environ.get(
    'SAGEMAKER_DATA_INSIGHTS_INPUT_URI',
    's3://your-bucket-name/path/to/manifest-file/output.manifest'
)
# Replace with the ARN of an IAM role with S3 read/write and SageMaker permissions
role_arn = os.environ.get(
    'SAGEMAKER_EXECUTION_ROLE_ARN',
    'arn:aws:iam::123456789012:role/YourSageMakerExecutionRole' # Placeholder, replace with actual role
)

print(f"Analyzing data from: {input_s3_uri}")
print(f"Using SageMaker execution role: {role_arn}")

try:
    # Instantiate the insights calculator for a Labeling Job
    insights_calculator = LabelingJobDataInsights(
        sagemaker_session=sagemaker_session,
        s3_input_uri=input_s3_uri,
        role_arn=role_arn,
        number_of_samples=10 # Use a small number of samples for quick demo
    )

    # Get insights (this will perform data sampling and analysis)
    # NOTE: This call requires valid S3 URI, role, and data.
    # It might take some time to run and will likely fail if placeholders are not replaced.
    print("Attempting to get insights (this may take a moment)...")
    insights_result = insights_calculator.get_insights()

    print("\n--- Data Insights Summary ---")
    print(f"Total entries analyzed: {insights_result.number_of_samples}")
    if insights_result.annotation_label_distribution:
        print("Annotation Label Distribution:")
        for label, count in insights_result.annotation_label_distribution.items():
            print(f"  - {label}: {count}")
    else:
        print("No annotation label distribution found (check input data/sampling).")

except Exception as e:
    print(f"\nAn error occurred during insights calculation: {e}")
    print("Please ensure your AWS credentials, S3 input URI, and IAM role are correctly configured and point to valid data.")

view raw JSON →