Ibis: The Portable Python Dataframe Library
Ibis is a portable Python dataframe library that provides a Pythonic way to build and execute operations on data in various backends, including SQL databases, data warehouses, and data lakes. It offers a familiar dataframe API that compiles into the backend's native language, enabling local iteration and remote deployment by changing a single line of code. It is currently at version 12.0.0 and maintains an active release cadence with frequent updates.
Warnings
- breaking Python 3.9 is no longer supported. Users must upgrade to Python 3.10 or newer.
- breaking PySpark versions older than 3.5 are no longer supported when using the PySpark backend.
- breaking The `DataType.name` attribute has been removed. Use `DataType.__class__.__name__` instead to get the string name of a data type.
- breaking Explicit naming of `memtable`s is no longer supported. Use `create_table` or `create_view` for named objects.
- gotcha Ibis uses lazy evaluation. Operations define an expression graph but do not execute immediately. To get results, you must explicitly call methods like `.execute()` or `.to_pandas()`.
- gotcha There is a different PyPI package also named `ibis` which is a web templating framework. Ensure you install `ibis-framework` for the dataframe library.
Install
-
pip install 'ibis-framework[duckdb,examples]' -
pip install ibis-framework -
pip install 'ibis-framework[all]'
Imports
- ibis
import ibis
- ibis.options.interactive
ibis.options.interactive = True
Quickstart
import ibis
# Connect to an in-memory DuckDB database (default backend)
con = ibis.duckdb.connect(':memory:')
# Load example data (e.g., the 'penguins' dataset)
t = ibis.examples.penguins.fetch()
# Create a table in the connected database
con.create_table('penguins', t.to_pyarrow(), overwrite=True)
# Get a table expression from the connection
table = con.table('penguins')
# Perform a lazy computation: group by species and calculate mean bill length
result_expr = table.group_by('species').agg(avg_bill_length=table.bill_length_mm.mean())
# Execute the expression and fetch results into a pandas DataFrame
df = result_expr.to_pandas()
print(df)