{"id":4755,"library":"sagemaker-data-insights","title":"SageMaker Data Insights","description":"The SageMaker Data Insights library (current version 0.4.0) is an open-source Python library designed by AWS to help users analyze and understand their data for various SageMaker workloads. It provides utilities for extracting insights from datasets used in SageMaker Labeling Jobs and other data processing tasks, helping identify potential data quality issues or patterns. Given its 0.x.x version, it maintains a relatively agile release cadence, with API changes possible between minor versions.","status":"active","version":"0.4.0","language":"en","source_language":"en","source_url":"https://github.com/aws/sagemaker-data-insights/","tags":["aws","sagemaker","machine-learning","data-analysis","data-insights","labeling-job"],"install":[{"cmd":"pip install sagemaker-data-insights","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Core SageMaker Python SDK dependency for session management and role resolution.","package":"sagemaker"},{"reason":"AWS SDK for Python, used for interacting with AWS services like S3 and IAM.","package":"boto3"},{"reason":"Used for data manipulation and analysis internally.","package":"pandas"}],"imports":[{"symbol":"LabelingJobDataInsights","correct":"from sagemaker_data_insights.labeling_job.data_insights import LabelingJobDataInsights"},{"symbol":"TabularDataInsights","correct":"from sagemaker_data_insights.tabular_data_insights.data_insights import TabularDataInsights"},{"symbol":"DataInsightsResult","correct":"from sagemaker_data_insights.labeling_job.data_insights_result import LabelingJobDataInsightsResult"}],"quickstart":{"code":"import os\nimport sagemaker\nimport boto3\nfrom sagemaker_data_insights.labeling_job.data_insights import LabelingJobDataInsights\n\n# Initialize SageMaker session using default boto3 credential chain\n# Ensure your AWS credentials are configured (e.g., via AWS CLI, environment variables)\n# The region can be specified if not configured globally.\nregion = os.environ.get('AWS_REGION', 'us-east-1')\nboto_session = boto3.Session(region_name=region)\nsagemaker_session = sagemaker.Session(boto_session=boto_session)\n\n# Define placeholders for your specific S3 input data and IAM role\n# Replace with actual S3 URI to your labeling job manifest file\ninput_s3_uri = os.environ.get(\n    'SAGEMAKER_DATA_INSIGHTS_INPUT_URI',\n    's3://your-bucket-name/path/to/manifest-file/output.manifest'\n)\n# Replace with the ARN of an IAM role with S3 read/write and SageMaker permissions\nrole_arn = os.environ.get(\n    'SAGEMAKER_EXECUTION_ROLE_ARN',\n    'arn:aws:iam::123456789012:role/YourSageMakerExecutionRole' # Placeholder, replace with actual role\n)\n\nprint(f\"Analyzing data from: {input_s3_uri}\")\nprint(f\"Using SageMaker execution role: {role_arn}\")\n\ntry:\n    # Instantiate the insights calculator for a Labeling Job\n    insights_calculator = LabelingJobDataInsights(\n        sagemaker_session=sagemaker_session,\n        s3_input_uri=input_s3_uri,\n        role_arn=role_arn,\n        number_of_samples=10 # Use a small number of samples for quick demo\n    )\n\n    # Get insights (this will perform data sampling and analysis)\n    # NOTE: This call requires valid S3 URI, role, and data.\n    # It might take some time to run and will likely fail if placeholders are not replaced.\n    print(\"Attempting to get insights (this may take a moment)...\")\n    insights_result = insights_calculator.get_insights()\n\n    print(\"\\n--- Data Insights Summary ---\")\n    print(f\"Total entries analyzed: {insights_result.number_of_samples}\")\n    if insights_result.annotation_label_distribution:\n        print(\"Annotation Label Distribution:\")\n        for label, count in insights_result.annotation_label_distribution.items():\n            print(f\"  - {label}: {count}\")\n    else:\n        print(\"No annotation label distribution found (check input data/sampling).\")\n\nexcept Exception as e:\n    print(f\"\\nAn error occurred during insights calculation: {e}\")\n    print(\"Please ensure your AWS credentials, S3 input URI, and IAM role are correctly configured and point to valid data.\")\n","lang":"python","description":"Initializes a SageMaker session, instantiates `LabelingJobDataInsights` with user-provided S3 URI and IAM role, and demonstrates how to retrieve and print a basic summary of the data insights. Requires configured AWS credentials and a valid S3 path to a labeling job manifest file."},"warnings":[{"fix":"Always consult the latest README and examples for your installed version. Pin dependency versions (e.g., `sagemaker-data-insights==0.4.0`) to ensure stability in production environments.","message":"Prior to version 1.0.0, API methods, class constructors, and parameter names may change between minor versions (e.g., 0.3.0 to 0.4.0) without explicit deprecation warnings.","severity":"breaking","affected_versions":"<1.0.0"},{"fix":"Ensure your IAM role has `AmazonSageMakerFullAccess` or a more granular policy granting S3 `GetObject`, `PutObject`, `ListBucket` on relevant buckets, and SageMaker permissions for the specific resource being analyzed.","message":"The execution role used by the SageMaker session (or provided to data insights classes) must have appropriate S3 read/write permissions for the input/output data and permissions to interact with SageMaker resources (e.g., Labeling Jobs, Feature Store).","severity":"gotcha","affected_versions":"All"},{"fix":"Verify S3 paths are correct, the data exists, is accessible by the IAM role, and adheres to the format required by the specific insight class (e.g., `LabelingJobDataInsights` expects Manifest files for labeling job outputs).","message":"Input S3 URIs must point to valid data in expected formats (e.g., JSONLines for Labeling Jobs, CSV/Parquet for Tabular Data) and match the AWS region of your SageMaker session.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}