AWS SDK for pandas (awswrangler)
AWS SDK for pandas, also known as awswrangler, extends the popular Pandas library to simplify data integration with various AWS services. It provides high-level abstractions for common data engineering tasks like reading and writing data to Amazon S3, querying data in Athena and Redshift, and interacting with AWS Glue, DynamoDB, Timestream, and more. The library is actively maintained, with frequent releases, often on a monthly basis, and the current version is 3.15.1.
Warnings
- breaking Python 3.9 was dropped in version 3.15.0, and Python 3.8 in 3.11.0. Older Python versions are no longer supported. Ensure your environment uses Python >= 3.10.
- breaking Starting from version 3.0, feature-specific dependencies (e.g., for Redshift, MySQL, OpenSearch) must be installed explicitly using extras syntax (e.g., `pip install 'awswrangler[redshift]'`). Simply installing `awswrangler` will only include core dependencies.
- gotcha AWS SDK for pandas versions `>=3.14.0` default to PyArrow 21.0.0+, which requires CMake 3.25+ to build. This can cause issues in environments with older CMake versions (e.g., Amazon Linux 2 notebook instances).
- gotcha The output format for `wr.dynamodb.read_items` changed in version 3.5.0. It now returns DynamoDB datatypes within the DataFrame, which can break existing parsing logic.
- gotcha AWS Lambda functions using the `awswrangler` layer with less than 512MB of memory might be insufficient for some data processing workloads, leading to memory-related errors.
- gotcha Security vulnerabilities in underlying dependencies (e.g., `aiohttp`, `setuptools`, `pg8000`) are frequently fixed in new `awswrangler` releases. Running older versions might expose you to known CVEs.
Install
-
pip install awswrangler -
pip install 'awswrangler[redshift,mysql,postgresql]'
Imports
- awswrangler
import awswrangler as wr
- pandas
import pandas as pd
Quickstart
import awswrangler as wr
import pandas as pd
from datetime import datetime
import os
# Ensure you have AWS credentials configured (e.g., via AWS CLI or environment variables)
# For quickstart, ensure the S3_BUCKET is set in your environment
s3_bucket = os.environ.get('S3_BUCKET', 'your-aws-s3-bucket-name')
if s3_bucket == 'your-aws-s3-bucket-name':
print("WARNING: Please set the S3_BUCKET environment variable or replace 'your-aws-s3-bucket-name' in the code.")
database_name = os.environ.get('ATHENA_DATABASE', 'awswrangler_db')
if database_name == 'awswrangler_db':
print("WARNING: Using default Athena database 'awswrangler_db'. Consider setting ATHENA_DATABASE env var.")
# Create a sample DataFrame
df = pd.DataFrame({
"id": [1, 2],
"value": ["foo", "boo"],
"timestamp": [datetime.now(), datetime.now()]
})
# 1. Store data on S3 as Parquet and register with Glue Catalog
s3_path = f"s3://{s3_bucket}/awswrangler-quickstart/my_dataset/"
print(f"Writing DataFrame to S3: {s3_path}")
wr.s3.to_parquet(
df=df,
path=s3_path,
dataset=True,
database=database_name,
table="my_table_parquet",
mode="overwrite",
partition_cols=['value']
)
print("Data written and cataloged.")
# 2. Retrieve the data directly from Amazon S3
print(f"Reading data from S3: {s3_path}")
df_from_s3 = wr.s3.read_parquet(s3_path, dataset=True)
print(f"Read {len(df_from_s3)} rows from S3:\n{df_from_s3}")
# 3. Retrieve the data from Amazon Athena
print(f"Reading data from Athena table '{database_name}.my_table_parquet'")
df_from_athena = wr.athena.read_sql_query(f"SELECT * FROM my_table_parquet", database=database_name)
print(f"Read {len(df_from_athena)} rows from Athena:\n{df_from_athena}")
# Example for Redshift (requires awswrangler[redshift] and a Glue connection)
# try:
# # Replace 'my-glue-connection' with your actual Glue connection name
# con = wr.redshift.connect("my-glue-connection")
# df_from_redshift = wr.redshift.read_sql_query("SELECT 1 as example_col", con=con)
# print(f"Read from Redshift:\n{df_from_redshift}")
# con.close()
# except Exception as e:
# print(f"Could not connect to Redshift or run query (this is expected if not configured): {e}")