Dagster Pandas Utilities
dagster-pandas is a library within the Dagster ecosystem that provides utilities for working with Pandas DataFrames. It enhances Dagster's capabilities by offering DataFrame-level validation, summary statistics generation, and reliable serialization/deserialization for Pandas objects. Currently at version 0.29.0, its release cadence is tied closely to the main Dagster core releases.
Warnings
- breaking Dagster core and library versions are tightly coupled. While libraries like `dagster-pandas` follow their own semantic versioning, major changes in Dagster core can necessitate upgrades or adjustments in library usage. Always check the Dagster core changelog for breaking changes relevant to your setup.
- gotcha Features such as `create_dagster_pandas_dataframe_type` and `PandasColumn` are often marked as 'beta' or 'preview' in the Dagster API. This means they might introduce breaking changes in minor versions or have behavior changes in patch releases.
- gotcha Pandas operations, especially on large DataFrames, can be memory-intensive and lead to Out Of Memory (OOM) errors. This is a common pitfall when processing big datasets within Dagster pipelines without proper memory management or architectural considerations.
- gotcha Dagster's layered execution model can sometimes make Python stack traces less developer-friendly, obscuring the direct cause of errors within your Pandas transformation logic.
- gotcha Python version compatibility is crucial. While `dagster-pandas` specifies `Python <3.15, >=3.10`, Dagster core regularly drops support for Python versions that reach End Of Life (EOL). Running on an unsupported Python version can lead to unexpected issues.
Install
-
pip install dagster-pandas
Imports
- asset
from dagster import asset
- Definitions
from dagster import Definitions
- load_assets_from_modules
from dagster import load_assets_from_modules
- create_dagster_pandas_dataframe_type
from dagster_pandas import create_dagster_pandas_dataframe_type
- PandasColumn
from dagster_pandas import PandasColumn
Quickstart
import os
import pandas as pd
from dagster import asset, Definitions, load_assets_from_modules
# --- defs/data/sample_data.csv ---
# name,age,city
# Alice,25,New York
# Bob,35,San Francisco
# Charlie,45,Chicago
# Diana,28,Boston
# --- defs/assets.py ---
@asset
def raw_data_csv() -> pd.DataFrame:
# In a real scenario, this would read from a persistent store, e.g., S3 or a database
# For quickstart, we simulate by creating a DataFrame
data = {
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 35, 45, 28],
'city': ['New York', 'San Francisco', 'Chicago', 'Boston']
}
return pd.DataFrame(data)
@asset
def processed_data(raw_data_csv: pd.DataFrame) -> pd.DataFrame:
return raw_data_csv[raw_data_csv['age'] > 30].copy()
# --- definitions.py ---
# Assuming assets.py is in a 'defs' directory or in the same file for quickstart
all_assets = load_assets_from_modules([__name__]) # Load assets from this file
defs = Definitions(assets=all_assets)
# To run this, save as a Python file (e.g., my_project.py) and run:
# dagster dev -f my_project.py
# Then open http://localhost:3000 and materialize 'processed_data'