SageMaker Data Insights
The SageMaker Data Insights library (current version 0.4.0) is an open-source Python library designed by AWS to help users analyze and understand their data for various SageMaker workloads. It provides utilities for extracting insights from datasets used in SageMaker Labeling Jobs and other data processing tasks, helping identify potential data quality issues or patterns. Given its 0.x.x version, it maintains a relatively agile release cadence, with API changes possible between minor versions.
Warnings
- breaking Prior to version 1.0.0, API methods, class constructors, and parameter names may change between minor versions (e.g., 0.3.0 to 0.4.0) without explicit deprecation warnings.
- gotcha The execution role used by the SageMaker session (or provided to data insights classes) must have appropriate S3 read/write permissions for the input/output data and permissions to interact with SageMaker resources (e.g., Labeling Jobs, Feature Store).
- gotcha Input S3 URIs must point to valid data in expected formats (e.g., JSONLines for Labeling Jobs, CSV/Parquet for Tabular Data) and match the AWS region of your SageMaker session.
Install
-
pip install sagemaker-data-insights
Imports
- LabelingJobDataInsights
from sagemaker_data_insights.labeling_job.data_insights import LabelingJobDataInsights
- TabularDataInsights
from sagemaker_data_insights.tabular_data_insights.data_insights import TabularDataInsights
- DataInsightsResult
from sagemaker_data_insights.labeling_job.data_insights_result import LabelingJobDataInsightsResult
Quickstart
import os
import sagemaker
import boto3
from sagemaker_data_insights.labeling_job.data_insights import LabelingJobDataInsights
# Initialize SageMaker session using default boto3 credential chain
# Ensure your AWS credentials are configured (e.g., via AWS CLI, environment variables)
# The region can be specified if not configured globally.
region = os.environ.get('AWS_REGION', 'us-east-1')
boto_session = boto3.Session(region_name=region)
sagemaker_session = sagemaker.Session(boto_session=boto_session)
# Define placeholders for your specific S3 input data and IAM role
# Replace with actual S3 URI to your labeling job manifest file
input_s3_uri = os.environ.get(
'SAGEMAKER_DATA_INSIGHTS_INPUT_URI',
's3://your-bucket-name/path/to/manifest-file/output.manifest'
)
# Replace with the ARN of an IAM role with S3 read/write and SageMaker permissions
role_arn = os.environ.get(
'SAGEMAKER_EXECUTION_ROLE_ARN',
'arn:aws:iam::123456789012:role/YourSageMakerExecutionRole' # Placeholder, replace with actual role
)
print(f"Analyzing data from: {input_s3_uri}")
print(f"Using SageMaker execution role: {role_arn}")
try:
# Instantiate the insights calculator for a Labeling Job
insights_calculator = LabelingJobDataInsights(
sagemaker_session=sagemaker_session,
s3_input_uri=input_s3_uri,
role_arn=role_arn,
number_of_samples=10 # Use a small number of samples for quick demo
)
# Get insights (this will perform data sampling and analysis)
# NOTE: This call requires valid S3 URI, role, and data.
# It might take some time to run and will likely fail if placeholders are not replaced.
print("Attempting to get insights (this may take a moment)...")
insights_result = insights_calculator.get_insights()
print("\n--- Data Insights Summary ---")
print(f"Total entries analyzed: {insights_result.number_of_samples}")
if insights_result.annotation_label_distribution:
print("Annotation Label Distribution:")
for label, count in insights_result.annotation_label_distribution.items():
print(f" - {label}: {count}")
else:
print("No annotation label distribution found (check input data/sampling).")
except Exception as e:
print(f"\nAn error occurred during insights calculation: {e}")
print("Please ensure your AWS credentials, S3 input URI, and IAM role are correctly configured and point to valid data.")