Apache DataFusion Python
A Python library that provides bindings to the Apache Arrow in-memory query engine, DataFusion. It enables users to build and execute high-performance queries using SQL or a DataFrame API against various data sources, including CSV, Parquet, JSON, and in-memory data. Leveraging its Rust-written query engine, it focuses on efficient, zero-copy data exchange with PyArrow. The library is actively maintained, with a current version of 52.3.0, and typically releases in sync with the core DataFusion project.
Warnings
- breaking Breaking changes to Foreign Function Interface (FFI) for Python extensions (e.g., custom CatalogProvider, TableProvider). Users implementing custom FFI-based providers must now provide `LogicalExtensionCodec` and `TaskContextProvider`, and method signatures have changed.
- gotcha DataFusion's Python bindings are tightly coupled with the core Rust DataFusion library. Downstream libraries (e.g., `deltalake`, `pyiceberg`) that provide DataFusion table providers often require exact version matches. This can lead to dependency conflicts when using multiple such libraries.
- breaking The way schemas are passed to `FileSource` constructors and `FileScanConfigBuilder` has been refactored. File sources now require the schema (including partition columns) at construction, and `FileScanConfigBuilder` no longer accepts a separate schema parameter. Additionally, `FilePruner::try_new()` signature changed.
- deprecated The `SchemaAdapterFactory` has been fully removed from Parquet scanning. This includes the `SchemaAdapter`, `SchemaMapper`, `DefaultSchemaAdapterFactory` traits/structs.
- gotcha The default value of the `datafusion.execution.collect_statistics` configuration setting changed from `false` to `true`. This means DataFusion will now collect and store statistics by default when a table is first created via `CREATE EXTERNAL TABLE` or DataFrame `register_*` APIs.
- breaking For advanced User-Defined Functions (UDFs), `UDF` traits now use `FieldRef` rather than `DataType` and nullability directly. `FieldRef` provides access to metadata fields, supporting extension types.
Install
-
pip install datafusion
Imports
- SessionContext
from datafusion import SessionContext
- col
from datafusion import col
- udf
from datafusion import udf
- functions
from datafusion import functions
Quickstart
from datafusion import SessionContext, col
import pyarrow as pa
# Create a DataFusion session context
ctx = SessionContext()
# Create an in-memory PyArrow table
data = {
"id": [1, 2, 3, 4],
"value": [10, 20, 15, 25],
"category": ["A", "B", "A", "C"]
}
pyarrow_table = pa.table(data)
# Register the PyArrow table as a DataFusion table
ctx.register_record_batches("my_table", [pyarrow_table.to_batches()])
# Execute a SQL query
df_sql = ctx.sql("SELECT category, SUM(value) FROM my_table GROUP BY category ORDER BY category")
print("SQL Query Result:")
print(df_sql.to_pandas())
# Execute a DataFrame API query
df_dataframe = ctx.table("my_table")
df_dataframe = df_dataframe.group_by(col("category")) \
.aggregate([col("value").sum().alias("total_value")]) \
.sort(col("category"))
print("\nDataFrame API Query Result:")
print(df_dataframe.to_pandas())