Dagster GCP Pandas

0.29.0 · active · verified Thu Apr 16

The `dagster-gcp-pandas` library provides an I/O manager for persisting Pandas DataFrames to Google Cloud Storage (GCS) within Dagster assets. It leverages `pandas` and `gcsfs` for efficient data serialization (defaulting to Parquet). This library is part of the `dagster` ecosystem and its versioning is tightly coupled with the core `dagster` library.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to define an asset that produces a Pandas DataFrame and uses `GCSPandasIOManager` to store it in a Google Cloud Storage bucket. It requires a GCS bucket to be configured and proper GCP authentication.

import pandas as pd
from dagster import asset, Definitions
from dagster_gcp_pandas import GCSPandasIOManager
import os

@asset
def my_pandas_dataframe_asset() -> pd.DataFrame:
    """Produces a Pandas DataFrame."""
    return pd.DataFrame({"value": [1, 2, 3], "label": ["A", "B", "C"]})

# Configure the GCSPandasIOManager to store DataFrames in a specified GCS bucket.
# Ensure the GCS_BUCKET_NAME environment variable is set or replace "your-gcs-bucket-name".
# You also need appropriate GCP credentials configured (e.g., GOOGLE_APPLICATION_CREDENTIALS).
gcs_io_manager = GCSPandasIOManager(
    gcs_bucket=os.environ.get("GCS_BUCKET_NAME", "your-gcs-bucket-name"),
    gcs_prefix="dagster_assets/pandas"
)

defs = Definitions(
    assets=[my_pandas_dataframe_asset],
    resources={
        "io_manager": gcs_io_manager
    }
)

# To run:
# 1. Save this code as a Python file (e.g., my_project/repo.py)
# 2. Set the GCS_BUCKET_NAME environment variable:
#    export GCS_BUCKET_NAME="your-actual-bucket-name"
# 3. Ensure GCP credentials are set up (e.g., using `gcloud auth application-default login`
#    or `GOOGLE_APPLICATION_CREDENTIALS` environment variable pointing to a service account key).
# 4. Execute: `dagster dev -f my_project/repo.py`
# 5. Navigate to the Dagster UI (usually http://localhost:3000) and materialize the asset.

view raw JSON →