Pandasql
Pandasql is a Python library that allows users to query pandas DataFrames using SQL syntax. It functions similarly to `sqldf` in R, leveraging SQLite under the hood to provide a familiar interface for data manipulation and analysis for those comfortable with SQL. It is currently at version 0.7.3 and receives limited updates, with alternatives like DuckDB or Polars SQL often recommended for more active development or performance needs.
Warnings
- gotcha Pandasql uses SQLite as its backend, which has inherent limitations, such as a default limit of around 999 SQL variables/parameters. This can lead to `OperationalError: too many SQL variables` when querying very wide DataFrames (many columns) or complex queries with numerous parameters.
- gotcha For `sqldf` to correctly identify DataFrames, they must be available in the global or local scope where `sqldf` is called. If `sqldf` is used within a function and the DataFrame is not passed into the function's `locals()` or `globals()`, you might encounter `no such table` errors.
- gotcha Pandasql only supports Data Query Language (DQL) operations (e.g., SELECT statements). It does not support Data Manipulation Language (DML) like INSERT, UPDATE, DELETE, or Data Definition Language (DDL) like CREATE, ALTER, DROP for modifying DataFrames or their structure.
- deprecated The `pandasql` library receives limited updates and is less actively maintained compared to other SQL-on-DataFrame solutions. For new projects or performance-critical applications, modern alternatives like `duckdb` (via `duckdb.query_df`), Polars SQL, or even native `pandas.DataFrame.query` are often recommended for better performance and ongoing development.
Install
-
pip install pandasql
Imports
- sqldf
from pandasql import sqldf
Quickstart
import pandas as pd
from pandasql import sqldf
# Create a sample pandas DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 22],
'city': ['New York', 'Los Angeles', 'Chicago']
})
# Define an SQL query as a string
query = """
SELECT name, age
FROM df
WHERE age > 23
ORDER BY age DESC
"""
# Execute the SQL query using sqldf
result_df = sqldf(query)
print(result_df)