SageMaker Training Toolkit
The `sagemaker-training` library provides the core toolkit that runs inside Amazon SageMaker training containers. It handles downloading input data, parsing hyperparameters, executing user training scripts, and uploading model artifacts. It's currently at version 5.1.1 and has a relatively active release cadence, with minor versions released every few weeks/months and major versions less frequently.
Common errors
-
ModuleNotFoundError: No module named 'sagemaker_training'
cause The `sagemaker-training` library is not installed in the environment where the script is being executed (e.g., local machine or custom Docker container without proper installation).fixRun `pip install sagemaker-training` in your environment. If using a custom Dockerfile, add `RUN pip install sagemaker-training`. -
AttributeError: module 'google.protobuf.descriptor' has no attribute '_HAS_OPTIONAL_FIELD_ACCESSORS'
cause This error typically occurs when `protobuf` v5+ is installed, but another library (or an older version of `sagemaker-training` itself) expects an API from `protobuf` v3 or v4. This was a common issue after `sagemaker-training` v5.0.0's `protobuf` upgrade.fixVerify all `protobuf` installations. Ensure all libraries in your environment are compatible with `protobuf>=5.0.0`. You may need to upgrade other dependencies or explicitly pin `protobuf` to a specific v5+ version (e.g., `protobuf>=5.28.1`). -
KeyError: 'SM_HP_YOUR_HYPERPARAMETER'
cause Attempting to access a hyperparameter using `env.hyperparameters['YOUR_HYPERPARAMETER']` when that hyperparameter was not provided to the SageMaker training job.fixAlways use `.get()` with a default value when accessing hyperparameters (e.g., `hyperparameters.get('your_hyperparameter', default_value)`). Double-check that the hyperparameter name passed to the SageMaker estimator matches the key used in your script. -
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/training/my_file.csv'
cause The script is trying to access an input file or directory that does not exist at the specified path within the SageMaker container, or the input data channel was not correctly configured.fixEnsure your SageMaker estimator's `inputs` argument correctly maps your S3 data to the expected channel name (e.g., 'training'). Verify the file exists in your S3 bucket and that your script uses the correct path derived from `env.channel_input_dirs['channel_name']`.
Warnings
- breaking Version 5.0.0 updated the required `protobuf` dependency to version 5.28.1. This can cause significant conflicts and runtime errors (`AttributeError`, `TypeError`) if your custom training environment or other dependencies rely on an older (v3 or v4) `protobuf` version.
- gotcha The `sagemaker-training` library is designed to run *inside* the SageMaker training container. Trying to run scripts locally that heavily rely on `sagemaker_training.environment` calls without mocking or setting up the corresponding environment variables will result in errors (e.g., `KeyError` for missing environment variables or incorrect paths).
- gotcha Mismatched `boto3` versions between the SageMaker Training Toolkit and your custom code or base image can lead to issues with S3 interactions (e.g., downloading data, uploading model artifacts) or credential handling.
- gotcha The training toolkit expects your user script to be at `/opt/ml/code/your_script.py` (or similar) within the container. Custom entrypoints or Dockerfiles that deviate from this structure without proper configuration can lead to `FileNotFoundError` or the script not being executed.
Install
-
pip install sagemaker-training
Imports
- environment
import sagemaker_training
from sagemaker_training import environment
- get_environment
from sagemaker_training import get_environment
from sagemaker_training.environment import get_environment
- get_hyperparameters
sagemaker_training.get_hyperparameters()
from sagemaker_training.environment import get_hyperparameters
Quickstart
from sagemaker_training import environment
import os
def train():
# Get SageMaker training environment details
env = environment.get_environment()
# Access hyperparameters
hyperparameters = env.hyperparameters
learning_rate = hyperparameters.get('learning_rate', 0.01)
# Access input data paths
train_data_path = os.path.join(env.channel_input_dirs['training'], 'data.csv')
# Access model output path
model_dir = env.model_dir
print(f"Learning Rate: {learning_rate}")
print(f"Training data path: {train_data_path}")
print(f"Model output directory: {model_dir}")
# Your training logic here
# Example: Save a dummy model artifact
with open(os.path.join(model_dir, 'model.txt'), 'w') as f:
f.write('My trained model output')
if __name__ == '__main__':
train()