Dagster Pandas Utilities

0.29.0 · active · verified Sat Apr 11

dagster-pandas is a library within the Dagster ecosystem that provides utilities for working with Pandas DataFrames. It enhances Dagster's capabilities by offering DataFrame-level validation, summary statistics generation, and reliable serialization/deserialization for Pandas objects. Currently at version 0.29.0, its release cadence is tied closely to the main Dagster core releases.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates defining two Dagster assets using Pandas. The `raw_data_csv` asset simulates loading data into a DataFrame, and `processed_data` transforms it by filtering based on age. To run, save the code as a Python file, execute `dagster dev -f <your_file.py>`, and then use the Dagster UI to materialize the assets.

import os
import pandas as pd
from dagster import asset, Definitions, load_assets_from_modules

# --- defs/data/sample_data.csv ---
# name,age,city
# Alice,25,New York
# Bob,35,San Francisco
# Charlie,45,Chicago
# Diana,28,Boston

# --- defs/assets.py ---
@asset
def raw_data_csv() -> pd.DataFrame:
    # In a real scenario, this would read from a persistent store, e.g., S3 or a database
    # For quickstart, we simulate by creating a DataFrame
    data = {
        'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
        'age': [25, 35, 45, 28],
        'city': ['New York', 'San Francisco', 'Chicago', 'Boston']
    }
    return pd.DataFrame(data)

@asset
def processed_data(raw_data_csv: pd.DataFrame) -> pd.DataFrame:
    return raw_data_csv[raw_data_csv['age'] > 30].copy()

# --- definitions.py ---
# Assuming assets.py is in a 'defs' directory or in the same file for quickstart

all_assets = load_assets_from_modules([__name__]) # Load assets from this file

defs = Definitions(assets=all_assets)

# To run this, save as a Python file (e.g., my_project.py) and run:
# dagster dev -f my_project.py
# Then open http://localhost:3000 and materialize 'processed_data'

view raw JSON →