{"id":8615,"library":"s3torchconnector","title":"S3TorchConnector","description":"S3TorchConnector provides an efficient integration for PyTorch `Dataset` and `DataLoader` to stream data directly from Amazon S3. It enables training machine learning models on S3-resident data without needing to download it locally, optimized for large-scale and distributed workloads. The current version is 1.5.0, with releases typically aligning with PyTorch and AWS SDK updates.","status":"active","version":"1.5.0","language":"en","source_language":"en","source_url":"https://github.com/pytorch/s3torchconnector","tags":["pytorch","aws","s3","data-loading","machine-learning","etl"],"install":[{"cmd":"pip install s3torchconnector","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Core PyTorch library, provides Dataset and DataLoader abstractions.","package":"torch","optional":false},{"reason":"AWS SDK for Python, used for interacting with S3.","package":"boto3","optional":false}],"imports":[{"symbol":"S3MapDataset","correct":"from s3torchconnector import S3MapDataset"},{"symbol":"S3IterableDataset","correct":"from s3torchconnector import S3IterableDataset"},{"symbol":"S3FilePipe","correct":"from s3torchconnector import S3FilePipe"},{"note":"The `S3Dataset` class was deprecated in early versions and replaced by `S3MapDataset` and `S3IterableDataset`.","wrong":"from s3torchconnector.s3dataset import S3Dataset","symbol":"S3Dataset","correct":"from s3torchconnector import S3MapDataset # or S3IterableDataset"}],"quickstart":{"code":"import torch\nfrom torch.utils.data import DataLoader\nfrom s3torchconnector import S3MapDataset, S3IterableDataset\nimport os\nimport io\n\n# --- Configuration for S3 Access ---\n# Ensure your AWS credentials are configured (e.g., via AWS CLI, environment variables, or IAM roles).\n# Example environment variables: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION.\n#\n# IMPORTANT: Replace 's3torchconnector-example-bucket' and 'quickstart-data/' with your\n# actual S3 bucket and prefix. The bucket should contain some files (e.g., text files)\n# for the example to successfully load data.\nS3_BUCKET = os.environ.get('S3_QUICKSTART_BUCKET', 's3torchconnector-example-bucket')\nS3_PREFIX = os.environ.get('S3_QUICKSTART_PREFIX', 'quickstart-data/')\nS3_URI = f\"s3://{S3_BUCKET}/{S3_PREFIX}\"\n\nprint(f\"Attempting to connect to S3 URI: {S3_URI}\")\nprint(\"Please ensure this bucket/prefix exists and contains data, and your AWS credentials are configured.\")\nprint(\"If you encounter 'Forbidden' or 'NoCredentialsError', check your AWS setup.\")\n\n# --- S3MapDataset Example ---\n# S3MapDataset first lists all objects under the given S3 URI prefix, then allows indexed access.\n# Suitable when you need a fixed-size dataset and random access.\ntry:\n    print(\"\\n--- S3MapDataset Demonstration ---\")\n    map_dataset = S3MapDataset(S3_URI)\n    print(f\"S3MapDataset initialized. Found {len(map_dataset)} objects.\")\n\n    if len(map_dataset) > 0:\n        # Accessing an item by index\n        item_data = map_dataset[0] # Returns a file-like object (BytesIO by default)\n        if isinstance(item_data, io.BytesIO):\n            content_sample = item_data.read(100).decode('utf-8', errors='ignore') # Read first 100 bytes\n            print(f\"Sample from first item (S3MapDataset): '{content_sample}'...\")\n        else:\n            print(f\"First item type: {type(item_data)}\")\n\n        # Using DataLoader with S3MapDataset\n        map_dataloader = DataLoader(map_dataset, batch_size=2, num_workers=0) # num_workers=0 for simplicity\n        print(\"Iterating through S3MapDataset with DataLoader:\")\n        for i, batch in enumerate(map_dataloader):\n            print(f\"MapDataset Batch {i}: {len(batch)} items.\")\n            if len(batch) > 0 and isinstance(batch[0], io.BytesIO):\n                print(f\"  First item in batch content sample: {batch[0].read(30).decode('utf-8', errors='ignore')}...\")\n            if i >= 1: # Limit iterations for a quick example\n                break\n    else:\n        print(\"S3MapDataset found no objects. Please ensure your S3 bucket/prefix contains files.\")\n\nexcept Exception as e:\n    print(f\"Error during S3MapDataset example: {e}\")\n    print(\"Ensure AWS credentials are valid and the S3 path exists and is accessible.\")\n\n# --- S3IterableDataset Example ---\n# S3IterableDataset streams objects one by one as they are iterated.\n# Suitable for very large datasets where listing all objects upfront is too slow or memory intensive.\ntry:\n    print(\"\\n--- S3IterableDataset Demonstration ---\")\n    iterable_dataset = S3IterableDataset(S3_URI)\n\n    # Using DataLoader with S3IterableDataset\n    # For num_workers > 0, consider using a `worker_init_fn` for proper distributed data loading.\n    iterable_dataloader = DataLoader(iterable_dataset, batch_size=2, num_workers=0)\n    print(\"Iterating through S3IterableDataset with DataLoader:\")\n    for i, batch in enumerate(iterable_dataloader):\n        print(f\"IterableDataset Batch {i}: {len(batch)} items.\")\n        if len(batch) > 0 and isinstance(batch[0], io.BytesIO):\n            print(f\"  First item in batch content sample: {batch[0].read(30).decode('utf-8', errors='ignore')}...\")\n        if i >= 1: # Limit iterations\n            break\n    print(\"S3IterableDataset iteration complete (limited for quickstart).\")\n\nexcept Exception as e:\n    print(f\"Error during S3IterableDataset example: {e}\")\n    print(\"Ensure AWS credentials are valid and the S3 path exists and is accessible.\")\n\nprint(\"\\nQuickstart examples concluded.\")\n","lang":"python","description":"This quickstart demonstrates how to use both `S3MapDataset` and `S3IterableDataset` with PyTorch's `DataLoader`. It shows how to initialize them with an S3 URI, iterate through data, and includes error handling for common S3 access issues. Ensure your AWS credentials are configured and the specified S3 bucket/prefix contains data."},"warnings":[{"fix":"Migrate to `S3MapDataset` for map-style access (indexed, random access) or `S3IterableDataset` for iterable-style streaming access. Both are imported directly from `s3torchconnector`.","message":"The `S3Dataset` class was deprecated in earlier versions (pre-1.0.0) and has been removed in favor of `S3MapDataset` and `S3IterableDataset` for clearer semantics.","severity":"deprecated","affected_versions":"<1.0.0"},{"fix":"Ensure your IAM user/role has `s3:ListBucket` (for prefixes), `s3:GetObject` (to read objects), and `s3:HeadObject` (for object metadata) permissions for the target bucket and prefix. Configure AWS credentials via environment variables (e.g., `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`), AWS CLI config, or IAM instance profiles/roles.","message":"Incorrect AWS IAM permissions or missing credentials lead to `Forbidden` or `NoCredentialsError` when accessing S3 buckets.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Consider combining small files into larger archives (e.g., tar files, TFRecords, Parquet) on S3. If listing all files upfront is too slow, `S3IterableDataset` can stream data without needing a full listing, but the per-file overhead remains for retrieval.","message":"High latency and increased cost can occur when loading a very large number of small files from S3 due to per-file overhead for each `GetObject` call.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Implement a `worker_init_fn` to initialize each worker's dataset instance with a unique seed or offset, often utilizing `torch.utils.data.get_worker_info()` and potentially `S3Checkpoint` for resumable training. Refer to the official `s3torchconnector` documentation on worker initialization.","message":"`S3IterableDataset` when used with `torch.utils.data.DataLoader` and `num_workers > 0` requires a `worker_init_fn` to ensure proper data distribution and prevent duplicate or missed data across workers.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Review and update your IAM policy to grant `s3:ListBucket` for the bucket (and its prefix) and `s3:GetObject` for objects within the prefix. Also, ensure the bucket policy doesn't explicitly deny access.","cause":"AWS IAM user/role lacks necessary permissions (`s3:GetObject`, `s3:ListBucket`) for the specified S3 bucket or objects.","error":"botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden"},{"fix":"Configure AWS credentials. Common methods include setting environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`), configuring the AWS CLI (`aws configure`), or using an IAM role for EC2 instances/EKS pods.","cause":"AWS SDK for Python (boto3) cannot find any configured AWS credentials to authenticate with S3.","error":"botocore.exceptions.NoCredentialsError: Unable to locate credentials"},{"fix":"Double-check the S3 URI (`s3://bucket/prefix`) for typos. Verify the bucket name and prefix are correct. Ensure your AWS region (e.g., `AWS_REGION` environment variable or boto3 client configuration) matches the region where the S3 bucket exists.","cause":"The specified S3 bucket name or object prefix is incorrect, or the configured AWS region does not contain the bucket.","error":"botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found"},{"fix":"Replace `S3Dataset` with `S3MapDataset` or `S3IterableDataset` from the top-level `s3torchconnector` package. For example, change `from s3torchconnector.s3dataset import S3Dataset` to `from s3torchconnector import S3MapDataset`.","cause":"Attempting to import the old, deprecated `S3Dataset` class which has been removed from the library's public API.","error":"ModuleNotFoundError: No module named 's3torchconnector.s3dataset'"}]}