{"id":2920,"library":"datafusion","title":"Apache DataFusion Python","description":"A Python library that provides bindings to the Apache Arrow in-memory query engine, DataFusion. It enables users to build and execute high-performance queries using SQL or a DataFrame API against various data sources, including CSV, Parquet, JSON, and in-memory data. Leveraging its Rust-written query engine, it focuses on efficient, zero-copy data exchange with PyArrow. The library is actively maintained, with a current version of 52.3.0, and typically releases in sync with the core DataFusion project.","status":"active","version":"52.3.0","language":"en","source_language":"en","source_url":"https://github.com/apache/datafusion-python","tags":["data processing","query engine","SQL","dataframe","apache arrow","rust","etl"],"install":[{"cmd":"pip install datafusion","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Core data format and interoperability.","package":"pyarrow","optional":false},{"reason":"Commonly used for converting DataFusion results to Pandas DataFrames.","package":"pandas","optional":true},{"reason":"Required for interacting with Delta Lake tables.","package":"deltalake","optional":true},{"reason":"Required for interacting with Iceberg tables.","package":"pyiceberg","optional":true}],"imports":[{"symbol":"SessionContext","correct":"from datafusion import SessionContext"},{"note":"Used for DataFrame API operations, especially column selection and expressions.","symbol":"col","correct":"from datafusion import col"},{"note":"For defining User-Defined Scalar Functions (UDFs).","symbol":"udf","correct":"from datafusion import udf"},{"note":"Provides access to built-in DataFusion functions like `functions.sum()`.","symbol":"functions","correct":"from datafusion import functions"}],"quickstart":{"code":"from datafusion import SessionContext, col\nimport pyarrow as pa\n\n# Create a DataFusion session context\nctx = SessionContext()\n\n# Create an in-memory PyArrow table\ndata = {\n    \"id\": [1, 2, 3, 4],\n    \"value\": [10, 20, 15, 25],\n    \"category\": [\"A\", \"B\", \"A\", \"C\"]\n}\npyarrow_table = pa.table(data)\n\n# Register the PyArrow table as a DataFusion table\nctx.register_record_batches(\"my_table\", [pyarrow_table.to_batches()])\n\n# Execute a SQL query\ndf_sql = ctx.sql(\"SELECT category, SUM(value) FROM my_table GROUP BY category ORDER BY category\")\nprint(\"SQL Query Result:\")\nprint(df_sql.to_pandas())\n\n# Execute a DataFrame API query\ndf_dataframe = ctx.table(\"my_table\")\ndf_dataframe = df_dataframe.group_by(col(\"category\")) \\\n                             .aggregate([col(\"value\").sum().alias(\"total_value\")]) \\\n                             .sort(col(\"category\"))\nprint(\"\\nDataFrame API Query Result:\")\nprint(df_dataframe.to_pandas())\n","lang":"python","description":"Demonstrates how to create an in-memory PyArrow table, register it with DataFusion's `SessionContext`, and then query it using both SQL and the DataFrame API. Results are converted to Pandas DataFrames for easy display."},"warnings":[{"fix":"Update custom FFI implementations to include `LogicalExtensionCodec` and `TaskContextProvider` and adapt to new function signatures. Refer to the DataFusion Python Extensions documentation for migration details.","message":"Breaking changes to Foreign Function Interface (FFI) for Python extensions (e.g., custom CatalogProvider, TableProvider). Users implementing custom FFI-based providers must now provide `LogicalExtensionCodec` and `TaskContextProvider`, and method signatures have changed.","severity":"breaking","affected_versions":">= 52.0.0"},{"fix":"Carefully manage dependencies and their DataFusion version requirements. Consider using `pip freeze` and `pip check` to identify conflicts. Check release notes of downstream libraries for compatible DataFusion versions.","message":"DataFusion's Python bindings are tightly coupled with the core Rust DataFusion library. Downstream libraries (e.g., `deltalake`, `pyiceberg`) that provide DataFusion table providers often require exact version matches. This can lead to dependency conflicts when using multiple such libraries.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Adjust custom `FileSource` and `FileScanConfigBuilder` implementations to provide schemas upfront during construction. Update `FilePruner` usage as per the migration guide.","message":"The way schemas are passed to `FileSource` constructors and `FileScanConfigBuilder` has been refactored. File sources now require the schema (including partition columns) at construction, and `FileScanConfigBuilder` no longer accepts a separate schema parameter. Additionally, `FilePruner::try_new()` signature changed.","severity":"breaking","affected_versions":">= 44.0.0"},{"fix":"Remove reliance on `SchemaAdapterFactory` and related components for Parquet scanning. DataFusion now handles schema adaptation differently.","message":"The `SchemaAdapterFactory` has been fully removed from Parquet scanning. This includes the `SchemaAdapter`, `SchemaMapper`, `DefaultSchemaAdapterFactory` traits/structs.","severity":"deprecated","affected_versions":">= 49.0.0 (deprecated in 49.0.0, removed later)"},{"fix":"Be aware of potential performance implications due to statistics collection on table registration. If undesired, explicitly set `ctx.session_config().with_collect_statistics(False)` or configure via `config.set('datafusion.execution.collect_statistics', 'false')`.","message":"The default value of the `datafusion.execution.collect_statistics` configuration setting changed from `false` to `true`. This means DataFusion will now collect and store statistics by default when a table is first created via `CREATE EXTERNAL TABLE` or DataFrame `register_*` APIs.","severity":"gotcha","affected_versions":">= 48.0.0"},{"fix":"Update custom UDF implementations to utilize `FieldRef` where type and nullability information is accessed.","message":"For advanced User-Defined Functions (UDFs), `UDF` traits now use `FieldRef` rather than `DataType` and nullability directly. `FieldRef` provides access to metadata fields, supporting extension types.","severity":"breaking","affected_versions":">= 48.0.0"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}