AWS SDK for pandas (awswrangler)

3.15.1 · active · verified Sat Mar 28

AWS SDK for pandas, also known as awswrangler, extends the popular Pandas library to simplify data integration with various AWS services. It provides high-level abstractions for common data engineering tasks like reading and writing data to Amazon S3, querying data in Athena and Redshift, and interacting with AWS Glue, DynamoDB, Timestream, and more. The library is actively maintained, with frequent releases, often on a monthly basis, and the current version is 3.15.1.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `awswrangler` to write a Pandas DataFrame to Amazon S3 as a Parquet dataset, register it in the AWS Glue Data Catalog, and then read it back using both direct S3 access and Amazon Athena. It assumes AWS credentials are configured in the environment (e.g., via `~/.aws/credentials` or environment variables) and requires an S3 bucket and an Athena database for execution. Database connection examples (e.g., Redshift) are commented out as they require additional setup.

import awswrangler as wr
import pandas as pd
from datetime import datetime
import os

# Ensure you have AWS credentials configured (e.g., via AWS CLI or environment variables)
# For quickstart, ensure the S3_BUCKET is set in your environment
s3_bucket = os.environ.get('S3_BUCKET', 'your-aws-s3-bucket-name')
if s3_bucket == 'your-aws-s3-bucket-name':
    print("WARNING: Please set the S3_BUCKET environment variable or replace 'your-aws-s3-bucket-name' in the code.")

database_name = os.environ.get('ATHENA_DATABASE', 'awswrangler_db')
if database_name == 'awswrangler_db':
    print("WARNING: Using default Athena database 'awswrangler_db'. Consider setting ATHENA_DATABASE env var.")

# Create a sample DataFrame
df = pd.DataFrame({
    "id": [1, 2],
    "value": ["foo", "boo"],
    "timestamp": [datetime.now(), datetime.now()]
})

# 1. Store data on S3 as Parquet and register with Glue Catalog
s3_path = f"s3://{s3_bucket}/awswrangler-quickstart/my_dataset/"
print(f"Writing DataFrame to S3: {s3_path}")
wr.s3.to_parquet(
    df=df,
    path=s3_path,
    dataset=True,
    database=database_name,
    table="my_table_parquet",
    mode="overwrite",
    partition_cols=['value']
)
print("Data written and cataloged.")

# 2. Retrieve the data directly from Amazon S3
print(f"Reading data from S3: {s3_path}")
df_from_s3 = wr.s3.read_parquet(s3_path, dataset=True)
print(f"Read {len(df_from_s3)} rows from S3:\n{df_from_s3}")

# 3. Retrieve the data from Amazon Athena
print(f"Reading data from Athena table '{database_name}.my_table_parquet'")
df_from_athena = wr.athena.read_sql_query(f"SELECT * FROM my_table_parquet", database=database_name)
print(f"Read {len(df_from_athena)} rows from Athena:\n{df_from_athena}")

# Example for Redshift (requires awswrangler[redshift] and a Glue connection)
# try:
#     # Replace 'my-glue-connection' with your actual Glue connection name
#     con = wr.redshift.connect("my-glue-connection")
#     df_from_redshift = wr.redshift.read_sql_query("SELECT 1 as example_col", con=con)
#     print(f"Read from Redshift:\n{df_from_redshift}")
#     con.close()
# except Exception as e:
#     print(f"Could not connect to Redshift or run query (this is expected if not configured): {e}")

view raw JSON →