Smart-open

raw JSON →
7.5.1 verified Tue May 12 auth: no python install: verified quickstart: stale

Smart-open is a Python 3 library (current version 7.5.1) for efficient streaming of very large files from and to various storage systems, including S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, and local filesystems. It provides transparent, on-the-fly (de-)compression for formats like gzip, bz2, and zst, acting as a drop-in replacement for Python's built-in `open()` function. The library is actively maintained with frequent releases, offering a unified Pythonic API to simplify working with remote files and cloud storage services.

pip install smart-open
error ModuleNotFoundError: No module named 'smart_open'
cause The `smart-open` library has not been installed in your current Python environment.
fix
pip install smart-open
error ImportError: Missing optional dependency 'boto3'. Use pip or conda to install smart-open[s3].
cause You are attempting to open a file from Amazon S3, but the required `boto3` dependency (part of the `s3` extra) is not installed.
fix
pip install smart-open[s3]
error ImportError: Missing optional dependency 'google-cloud-storage'. Use pip or conda to install smart-open[gcs].
cause You are attempting to open a file from Google Cloud Storage, but the required `google-cloud-storage` dependency (part of the `gcs` extra) is not installed.
fix
pip install smart-open[gcs]
error FileNotFoundError: [Errno 2] No such file or directory: 's3://your-bucket/non-existent-file.txt'
cause The specified file path or URI (e.g., S3 object key, GCS blob path, local file path) does not exist in the given storage system.
fix
Verify that the file path or URI is correct and the file exists at the specified location and that you have necessary permissions.
breaking As of `smart-open` v7.5.1, the minimum supported Python version is 3.10. Earlier versions (e.g., v2.0.0) supported Python 3.5+.
fix Upgrade your Python environment to 3.10 or newer. If you need to use an older Python version, pin `smart-open` to a compatible version (e.g., `<7.5.1`).
breaking The primary import for the `open` function changed from `from smart_open import smart_open` (pre-v1.8.1) to `from smart_open import open` (post-v1.8.1, solidified in v2.0.0) to align with Python's built-in `open`.
fix Update your import statements from `from smart_open import smart_open` to `from smart_open import open`.
breaking The default read mode for `smart_open.open` changed from 'rb' (read binary) to 'r' (read text) in v1.8.1 to match the behavior of Python's built-in `open`.
fix If your code implicitly relied on 'rb' as the default, explicitly pass `mode='rb'` to `smart_open.open`.
gotcha `smart-open` does not install cloud or compression library dependencies by default to keep installation size small. Functionality like S3 or GCS will fail if their respective dependencies (`boto3`, `google-cloud-storage`) are not installed.
fix Install `smart-open` with the necessary extras, e.g., `pip install 'smart-open[s3,gcs]'` for S3 and GCS support.
gotcha Cloud storage operations (S3, GCS, Azure) require proper credential configuration. Failing to provide credentials (e.g., via environment variables, SDK defaults, or `transport_params`) will result in authentication errors.
fix Refer to the documentation for your cloud provider's SDK (e.g., boto3 for AWS, google-cloud-storage for GCS) for credential setup. You can also pass client objects or credentials via the `transport_params` argument to `smart_open.open`.
deprecated Version 7.3.0 was yanked from PyPI because its `pyproject.toml` incorrectly claimed Python 3.7 support, even though it had already been dropped in that release train.
fix Avoid installing or upgrading to `smart-open==7.3.0`. Install a subsequent patch release like `7.3.1` or the latest stable version.
breaking In `smart-open` v7.4.0, the `smart_open.s3.iter_bucket` function was updated to use a single shared `concurrent.futures.ThreadPoolExecutor` and a single shared thread-safe `S3.Client`.
fix Review any existing code that directly used or configured `smart_open.s3.iter_bucket` for custom thread pool or client management, as its internal concurrency model has changed. If you relied on separate clients per thread/process, adjust your logic accordingly.
pip install 'smart-open[s3,gcs,azure,http,webhdfs,ssh,zst]'
python os / libc variant status wheel install import disk
3.10 alpine (musl) s3,gcs,azure,http,webhdfs,ssh,zst sdist - 2.80s 100.8M
3.10 alpine (musl) smart-open wheel - 0.18s 18.7M
3.10 alpine (musl) s3,gcs,azure,http,webhdfs,ssh,zst - - 2.69s 99.6M
3.10 alpine (musl) smart-open - - 0.20s 18.7M
3.10 slim (glibc) s3,gcs,azure,http,webhdfs,ssh,zst wheel 9.2s 2.20s 102M
3.10 slim (glibc) smart-open wheel 2.0s 0.14s 19M
3.10 slim (glibc) s3,gcs,azure,http,webhdfs,ssh,zst - - 2.01s 100M
3.10 slim (glibc) smart-open - - 0.13s 19M
3.11 alpine (musl) s3,gcs,azure,http,webhdfs,ssh,zst sdist - 3.51s 108.4M
3.11 alpine (musl) smart-open wheel - 0.26s 20.7M
3.11 alpine (musl) s3,gcs,azure,http,webhdfs,ssh,zst - - 4.02s 107.2M
3.11 alpine (musl) smart-open - - 0.29s 20.7M
3.11 slim (glibc) s3,gcs,azure,http,webhdfs,ssh,zst wheel 8.3s 3.06s 109M
3.11 slim (glibc) smart-open wheel 1.9s 0.22s 21M
3.11 slim (glibc) s3,gcs,azure,http,webhdfs,ssh,zst - - 2.84s 108M
3.11 slim (glibc) smart-open - - 0.21s 21M
3.12 alpine (musl) s3,gcs,azure,http,webhdfs,ssh,zst sdist - 3.78s 99.4M
3.12 alpine (musl) smart-open wheel - 0.24s 12.6M
3.12 alpine (musl) s3,gcs,azure,http,webhdfs,ssh,zst - - 4.08s 98.2M
3.12 alpine (musl) smart-open - - 0.25s 12.5M
3.12 slim (glibc) s3,gcs,azure,http,webhdfs,ssh,zst wheel 7.0s 3.45s 100M
3.12 slim (glibc) smart-open wheel 1.7s 0.24s 13M
3.12 slim (glibc) s3,gcs,azure,http,webhdfs,ssh,zst - - 3.75s 99M
3.12 slim (glibc) smart-open - - 0.24s 13M
3.13 alpine (musl) s3,gcs,azure,http,webhdfs,ssh,zst sdist - 3.61s 98.9M
3.13 alpine (musl) smart-open wheel - 0.23s 12.3M
3.13 alpine (musl) s3,gcs,azure,http,webhdfs,ssh,zst - - 4.07s 97.6M
3.13 alpine (musl) smart-open - - 0.25s 12.2M
3.13 slim (glibc) s3,gcs,azure,http,webhdfs,ssh,zst wheel 7.2s 3.37s 100M
3.13 slim (glibc) smart-open wheel 1.7s 0.25s 13M
3.13 slim (glibc) s3,gcs,azure,http,webhdfs,ssh,zst - - 3.72s 98M
3.13 slim (glibc) smart-open - - 0.24s 13M
3.9 alpine (musl) s3,gcs,azure,http,webhdfs,ssh,zst sdist - 2.63s 100.6M
3.9 alpine (musl) smart-open wheel - 0.16s 18.2M
3.9 alpine (musl) s3,gcs,azure,http,webhdfs,ssh,zst - - 2.40s 99.6M
3.9 alpine (musl) smart-open - - 0.18s 18.2M
3.9 slim (glibc) s3,gcs,azure,http,webhdfs,ssh,zst wheel 10.9s 2.53s 101M
3.9 slim (glibc) smart-open wheel 2.4s 0.16s 19M
3.9 slim (glibc) s3,gcs,azure,http,webhdfs,ssh,zst - - 2.18s 100M
3.9 slim (glibc) smart-open - - 0.14s 19M

This quickstart demonstrates how to use `smart_open.open` to read from and write to an S3 bucket. It automatically handles transparent compression/decompression based on file extension and integrates with underlying SDKs like boto3 for S3 access. Make sure your environment has appropriate cloud credentials configured.

import os
from smart_open import open

# Example for S3; similar patterns apply to GCS, Azure, etc.
# Ensure AWS credentials are configured (e.g., via environment variables, AWS CLI config, or IAM role).
# For production, consider explicit credential management via transport_params.
S3_BUCKET_NAME = os.environ.get('SMART_OPEN_S3_BUCKET', 'my-smart-open-test-bucket')
S3_KEY = 'example.txt'
S3_URL = f"s3://{S3_BUCKET_NAME}/{S3_KEY}"

# Write to S3
print(f"Writing to {S3_URL}...")
with open(S3_URL, 'w') as fout:
    fout.write('Hello, smart-open from S3!\n')
    fout.write('This is a second line.\n')
print("Write complete.")

# Read from S3
print(f"Reading from {S3_URL}...")
with open(S3_URL, 'r') as fin:
    for line in fin:
        print(f"Read line: {line.strip()}")
print("Read complete.")