Apache DataFusion Python

52.3.0 · active · verified Sat Apr 11

A Python library that provides bindings to the Apache Arrow in-memory query engine, DataFusion. It enables users to build and execute high-performance queries using SQL or a DataFrame API against various data sources, including CSV, Parquet, JSON, and in-memory data. Leveraging its Rust-written query engine, it focuses on efficient, zero-copy data exchange with PyArrow. The library is actively maintained, with a current version of 52.3.0, and typically releases in sync with the core DataFusion project.

Warnings

Install

Imports

Quickstart

Demonstrates how to create an in-memory PyArrow table, register it with DataFusion's `SessionContext`, and then query it using both SQL and the DataFrame API. Results are converted to Pandas DataFrames for easy display.

from datafusion import SessionContext, col
import pyarrow as pa

# Create a DataFusion session context
ctx = SessionContext()

# Create an in-memory PyArrow table
data = {
    "id": [1, 2, 3, 4],
    "value": [10, 20, 15, 25],
    "category": ["A", "B", "A", "C"]
}
pyarrow_table = pa.table(data)

# Register the PyArrow table as a DataFusion table
ctx.register_record_batches("my_table", [pyarrow_table.to_batches()])

# Execute a SQL query
df_sql = ctx.sql("SELECT category, SUM(value) FROM my_table GROUP BY category ORDER BY category")
print("SQL Query Result:")
print(df_sql.to_pandas())

# Execute a DataFrame API query
df_dataframe = ctx.table("my_table")
df_dataframe = df_dataframe.group_by(col("category")) \
                             .aggregate([col("value").sum().alias("total_value")]) \
                             .sort(col("category"))
print("\nDataFrame API Query Result:")
print(df_dataframe.to_pandas())

view raw JSON →