Smart-open
Smart-open is a Python 3 library (current version 7.5.1) for efficient streaming of very large files from and to various storage systems, including S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, and local filesystems. It provides transparent, on-the-fly (de-)compression for formats like gzip, bz2, and zst, acting as a drop-in replacement for Python's built-in `open()` function. The library is actively maintained with frequent releases, offering a unified Pythonic API to simplify working with remote files and cloud storage services.
Warnings
- breaking As of `smart-open` v7.5.1, the minimum supported Python version is 3.10. Earlier versions (e.g., v2.0.0) supported Python 3.5+.
- breaking The primary import for the `open` function changed from `from smart_open import smart_open` (pre-v1.8.1) to `from smart_open import open` (post-v1.8.1, solidified in v2.0.0) to align with Python's built-in `open`.
- breaking The default read mode for `smart_open.open` changed from 'rb' (read binary) to 'r' (read text) in v1.8.1 to match the behavior of Python's built-in `open`.
- gotcha `smart-open` does not install cloud or compression library dependencies by default to keep installation size small. Functionality like S3 or GCS will fail if their respective dependencies (`boto3`, `google-cloud-storage`) are not installed.
- gotcha Cloud storage operations (S3, GCS, Azure) require proper credential configuration. Failing to provide credentials (e.g., via environment variables, SDK defaults, or `transport_params`) will result in authentication errors.
- deprecated Version 7.3.0 was yanked from PyPI because its `pyproject.toml` incorrectly claimed Python 3.7 support, even though it had already been dropped in that release train.
- breaking In `smart-open` v7.4.0, the `smart_open.s3.iter_bucket` function was updated to use a single shared `concurrent.futures.ThreadPoolExecutor` and a single shared thread-safe `S3.Client`.
Install
-
pip install smart-open -
pip install 'smart-open[s3,gcs,azure,http,webhdfs,ssh,zst]'
Imports
- open
from smart_open import open
- smart_open
import smart_open # Access the main function as smart_open.open
Quickstart
import os
from smart_open import open
# Example for S3; similar patterns apply to GCS, Azure, etc.
# Ensure AWS credentials are configured (e.g., via environment variables, AWS CLI config, or IAM role).
# For production, consider explicit credential management via transport_params.
S3_BUCKET_NAME = os.environ.get('SMART_OPEN_S3_BUCKET', 'my-smart-open-test-bucket')
S3_KEY = 'example.txt'
S3_URL = f"s3://{S3_BUCKET_NAME}/{S3_KEY}"
# Write to S3
print(f"Writing to {S3_URL}...")
with open(S3_URL, 'w') as fout:
fout.write('Hello, smart-open from S3!\n')
fout.write('This is a second line.\n')
print("Write complete.")
# Read from S3
print(f"Reading from {S3_URL}...")
with open(S3_URL, 'r') as fin:
for line in fin:
print(f"Read line: {line.strip()}")
print("Read complete.")