qpd: Query Pandas Using SQL
QPD (Query Pandas Dataframes) is a Python library that allows users to run SQL `SELECT` statements on pandas-like dataframes, including Pandas, Dask, and Ray (via Modin on Ray). It translates SQL directly into dataframe operations, prioritizing correctness and consistent behavior across backends, even handling SQL-specific behaviors like `GROUP BY` with null keys differently than default pandas. The current version is 0.4.4. The library has a sporadic but active release cadence, with multiple minor updates in recent history, primarily focusing on compatibility and bug fixes.
Warnings
- gotcha QPD's SQL implementation for `GROUP BY` follows SQL standards, where `NULL` keys are treated as a distinct group. This differs from default Pandas `groupby()` behavior which often drops `NULL` keys by default. Users should be aware of this semantic difference when migrating SQL queries or expecting Pandas-like output for null values in grouping columns.
- breaking QPD versions prior to 0.4.1 might have compatibility issues with Pandas 2.x. Version 0.4.1 introduced explicit compatibility for Pandas 2.
- gotcha In `qpd` versions 0.2.3 and older, certain set operations could unexpectedly alter the input DataFrame by adding columns. This bug was fixed in version 0.2.4.
- gotcha QPD currently focuses primarily on `SELECT` statements in SQL. Support for other SQL DDL/DML operations (e.g., `INSERT`, `UPDATE`, `DELETE`, `CREATE TABLE`) is not explicitly provided or may be limited.
Install
-
pip install qpd -
pip install qpd[all]
Imports
- run_sql_on_pandas
from qpd_pandas import run_sql_on_pandas
Quickstart
import pandas as pd
from qpd_pandas import run_sql_on_pandas
# Create a sample Pandas DataFrame
data = {
'id': [1, 2, 3, 4, 5, 6, 7, 8],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Heidi'],
'age': [25, 30, 35, 25, 40, 30, 35, 25],
'city': ['New York', 'London', 'Paris', 'London', 'New York', 'Paris', 'London', 'New York']
}
df = pd.DataFrame(data)
# Define an SQL query
sql_query = """
SELECT city, AVG(age) AS avg_age, COUNT(id) AS num_people
FROM df
WHERE age > 25
GROUP BY city
HAVING COUNT(id) > 1
ORDER BY avg_age DESC
"""
# Run the SQL query on the DataFrame
result_df = run_sql_on_pandas(sql_query, df=df)
print(result_df)