AWS SDK for pandas (awswrangler)

3.15.1 verified Tue May 12 auth: no python install: verified quickstart: stale

AWS SDK for pandas, also known as awswrangler, extends the popular Pandas library to simplify data integration with various AWS services. It provides high-level abstractions for common data engineering tasks like reading and writing data to Amazon S3, querying data in Athena and Redshift, and interacting with AWS Glue, DynamoDB, Timestream, and more. The library is actively maintained, with frequent releases, often on a monthly basis, and the current version is 3.15.1.

pip install awswrangler

Common errors

error ImportError: cannot import name 'infer_compression' ↓

cause This error occurs when the 'infer_compression' function is not found in the 'pandas.io.common' module, possibly due to version incompatibility between pandas and awswrangler.

fix

Ensure that both pandas and awswrangler are updated to compatible versions. You can update them using pip: 'pip install --upgrade pandas awswrangler'.

error Runtime.ImportModuleError: Unable to import module 'index': No module named 'awswrangler' ↓

cause This error indicates that the awswrangler module is not found in the AWS Lambda environment, likely because it wasn't included in the deployment package.

fix

Include awswrangler in your Lambda deployment package by creating a Lambda layer with awswrangler installed, and attach it to your Lambda function.

error botocore.exceptions.NoRegionError: You must specify a region. ↓

cause This error occurs when the AWS region is not specified in the boto3 session, which is required for awswrangler operations.

fix

Specify the AWS region in your boto3 session by using 'boto3.setup_default_session(region_name="your-region")' before calling awswrangler functions.

error NoFilesFound: No files Found on S3 path: s3://example_bucket/data/parquet_files/y=2021/m=4/d=13/h=170/ ↓

cause This error occurs when awswrangler's 'read_parquet' function cannot find any files at the specified S3 path.

fix

Ensure that the specified S3 path is correct and that the files exist. You can also handle this exception in your code to manage cases where files might be missing.

error AttributeError: _ARRAY_API not found ↓

cause This error occurs when there is a version incompatibility between NumPy and awswrangler, leading to missing attributes.

fix

Ensure that both NumPy and awswrangler are updated to compatible versions. You can update them using pip: 'pip install --upgrade numpy awswrangler'.

Warnings

breaking Python 3.9 was dropped in version 3.15.0, and Python 3.8 in 3.11.0. Older Python versions are no longer supported. Ensure your environment uses Python >= 3.10. ↓

fix Upgrade your Python environment to 3.10 or newer (currently up to 3.14 supported).

breaking Starting from version 3.0, feature-specific dependencies (e.g., for Redshift, MySQL, OpenSearch) must be installed explicitly using extras syntax (e.g., `pip install 'awswrangler[redshift]'`). Simply installing `awswrangler` will only include core dependencies. ↓

fix Specify the required extra packages during installation: `pip install 'awswrangler[feature1,feature2]'`. Refer to the documentation for available extras.

gotcha AWS SDK for pandas versions `>=3.14.0` default to PyArrow 21.0.0+, which requires CMake 3.25+ to build. This can cause issues in environments with older CMake versions (e.g., Amazon Linux 2 notebook instances). ↓

fix If encountering build issues, you can pin PyArrow to an older version (e.g., `pip install awswrangler 'pyarrow<21'`), upgrade CMake, or use a newer platform environment like AL2023-V1 for SageMaker. For Glue PySpark jobs, specific PyArrow versions may be required (e.g., `pyarrow==14,pandas==1.5.3,awswrangler==3.15.1`).

gotcha The output format for `wr.dynamodb.read_items` changed in version 3.5.0. It now returns DynamoDB datatypes within the DataFrame, which can break existing parsing logic. ↓

fix Update your code to handle the new DataFrame output format when reading from DynamoDB, specifically for `wr.dynamodb.read_items`.

gotcha AWS Lambda functions using the `awswrangler` layer with less than 512MB of memory might be insufficient for some data processing workloads, leading to memory-related errors. ↓

fix Increase the memory allocation for your AWS Lambda function to 512MB or higher, depending on your workload's data volume and complexity.

gotcha Security vulnerabilities in underlying dependencies (e.g., `aiohttp`, `setuptools`, `pg8000`) are frequently fixed in new `awswrangler` releases. Running older versions might expose you to known CVEs. ↓

fix Regularly update `awswrangler` to the latest version to incorporate the most recent security fixes and dependency updates.

breaking AWS SDK operations require an AWS region to be specified. The 'botocore.exceptions.NoRegionError' indicates that no region could be found through environment variables (AWS_REGION, AWS_DEFAULT_REGION), AWS config files (~/.aws/config), or explicit session configuration when awswrangler attempts to interact with AWS services like Glue or S3. ↓

fix Ensure an AWS region is configured. This can be achieved by setting the `AWS_REGION` or `AWS_DEFAULT_REGION` environment variables (e.g., `export AWS_REGION=us-east-1`), configuring it in `~/.aws/config` or `~/.aws/credentials`, or by explicitly passing a `region_name` to your boto3 session or relevant `awswrangler` functions (e.g., `wr.s3.to_parquet(..., boto3_session=boto3.Session(region_name='us-east-1'))`).

breaking AWS SDK for pandas (awswrangler) requires an AWS region to be specified for AWS API calls. If not explicitly provided in code (e.g., through a `boto3_session` with a configured region), the region must be set in the environment (e.g., `AWS_REGION` environment variable) or via AWS configuration files (e.g., `~/.aws/config`). ↓

fix Ensure an AWS region is configured in your execution environment (e.g., `export AWS_REGION=us-east-1` or `aws configure`) or pass a boto3 session with a specified region to awswrangler functions.

Install

pip install 'awswrangler[redshift,mysql,postgresql]'

Install compatibility verified last tested: 2026-05-12

python os / libc variant status wheel install import disk

3.10 alpine (musl) redshift,mysql,postgresql - - 2.52s 388.5M

3.10 alpine (musl) awswrangler - - 2.04s 370.0M

3.10 slim (glibc) redshift,mysql,postgresql - - 1.93s 363M

3.10 slim (glibc) awswrangler - - 1.55s 344M

3.11 alpine (musl) redshift,mysql,postgresql - - 3.30s 414.5M

3.11 alpine (musl) awswrangler - - 2.53s 392.4M

3.11 slim (glibc) redshift,mysql,postgresql - - 2.69s 382M

3.11 slim (glibc) awswrangler - - 2.13s 360M

3.12 alpine (musl) redshift,mysql,postgresql - - 2.88s 407.8M

3.12 alpine (musl) awswrangler - - 2.33s 385.9M

3.12 slim (glibc) redshift,mysql,postgresql - - 2.89s 375M

3.12 slim (glibc) awswrangler - - 2.35s 354M

3.13 alpine (musl) redshift,mysql,postgresql - - 2.72s 406.8M

3.13 alpine (musl) awswrangler - - 2.18s 384.9M

3.13 slim (glibc) redshift,mysql,postgresql - - 2.77s 374M

3.13 slim (glibc) awswrangler - - 2.26s 352M

3.9 alpine (musl) redshift,mysql,postgresql - - 2.38s 382.8M

3.9 alpine (musl) awswrangler - - 1.86s 364.5M

3.9 slim (glibc) redshift,mysql,postgresql - - 2.17s 360M

3.9 slim (glibc) awswrangler - - 1.75s 342M

Imports

awswrangler
```
import awswrangler as wr
```
pandas
```
import pandas as pd
```

Quickstart stale last tested: 2026-04-23

This quickstart demonstrates how to use `awswrangler` to write a Pandas DataFrame to Amazon S3 as a Parquet dataset, register it in the AWS Glue Data Catalog, and then read it back using both direct S3 access and Amazon Athena. It assumes AWS credentials are configured in the environment (e.g., via `~/.aws/credentials` or environment variables) and requires an S3 bucket and an Athena database for execution. Database connection examples (e.g., Redshift) are commented out as they require additional setup.

import awswrangler as wr
import pandas as pd
from datetime import datetime
import os

# Ensure you have AWS credentials configured (e.g., via AWS CLI or environment variables)
# For quickstart, ensure the S3_BUCKET is set in your environment
s3_bucket = os.environ.get('S3_BUCKET', 'your-aws-s3-bucket-name')
if s3_bucket == 'your-aws-s3-bucket-name':
    print("WARNING: Please set the S3_BUCKET environment variable or replace 'your-aws-s3-bucket-name' in the code.")

database_name = os.environ.get('ATHENA_DATABASE', 'awswrangler_db')
if database_name == 'awswrangler_db':
    print("WARNING: Using default Athena database 'awswrangler_db'. Consider setting ATHENA_DATABASE env var.")

# Create a sample DataFrame
df = pd.DataFrame({
    "id": [1, 2],
    "value": ["foo", "boo"],
    "timestamp": [datetime.now(), datetime.now()]
})

# 1. Store data on S3 as Parquet and register with Glue Catalog
s3_path = f"s3://{s3_bucket}/awswrangler-quickstart/my_dataset/"
print(f"Writing DataFrame to S3: {s3_path}")
wr.s3.to_parquet(
    df=df,
    path=s3_path,
    dataset=True,
    database=database_name,
    table="my_table_parquet",
    mode="overwrite",
    partition_cols=['value']
)
print("Data written and cataloged.")

# 2. Retrieve the data directly from Amazon S3
print(f"Reading data from S3: {s3_path}")
df_from_s3 = wr.s3.read_parquet(s3_path, dataset=True)
print(f"Read {len(df_from_s3)} rows from S3:\n{df_from_s3}")

# 3. Retrieve the data from Amazon Athena
print(f"Reading data from Athena table '{database_name}.my_table_parquet'")
df_from_athena = wr.athena.read_sql_query(f"SELECT * FROM my_table_parquet", database=database_name)
print(f"Read {len(df_from_athena)} rows from Athena:\n{df_from_athena}")

# Example for Redshift (requires awswrangler[redshift] and a Glue connection)
# try:
#     # Replace 'my-glue-connection' with your actual Glue connection name
#     con = wr.redshift.connect("my-glue-connection")
#     df_from_redshift = wr.redshift.read_sql_query("SELECT 1 as example_col", con=con)
#     print(f"Read from Redshift:\n{df_from_redshift}")
#     con.close()
# except Exception as e:
#     print(f"Could not connect to Redshift or run query (this is expected if not configured): {e}")