Dagster GCP Pandas
The `dagster-gcp-pandas` library provides an I/O manager for persisting Pandas DataFrames to Google Cloud Storage (GCS) within Dagster assets. It leverages `pandas` and `gcsfs` for efficient data serialization (defaulting to Parquet). This library is part of the `dagster` ecosystem and its versioning is tightly coupled with the core `dagster` library.
Common errors
-
ModuleNotFoundError: No module named 'dagster_gcp'
cause The `dagster-gcp` library, which `dagster-gcp-pandas` depends on for core GCS functionality, is not installed.fixRun `pip install dagster-gcp-pandas` which should pull `dagster-gcp` as a dependency. If still missing, try `pip install dagster-gcp` explicitly. -
google.api_core.exceptions.PermissionDenied: 403 GET ... Insufficient Permission
cause The authenticated GCP identity (service account or user) does not have the necessary permissions to read from or write to the specified GCS bucket.fixGrant the appropriate IAM roles (e.g., `Storage Object Viewer`, `Storage Object Creator`) to the service account or user for the GCS bucket. Verify that GCP credentials are correctly configured in the execution environment. -
FileNotFoundError: Could not find object at gs://your-gcs-bucket-name/dagster_assets/pandas/my_pandas_dataframe_asset
cause The `GCSPandasIOManager` attempted to load an asset that does not exist at the specified GCS path, or the path is incorrect.fixEnsure the asset has been materialized at least once. Verify that the `gcs_bucket` and `gcs_prefix` configured in your `GCSPandasIOManager` match the actual location where the asset was stored or is expected to be found.
Warnings
- breaking Dagster libraries, including `dagster-gcp-pandas`, are versioned in lockstep with the core `dagster` library. Installing mismatched versions (e.g., `dagster==1.0.0` with `dagster-gcp-pandas==0.15.0`) can lead to `ModuleNotFoundError` or other runtime errors.
- gotcha Proper Google Cloud Platform (GCP) authentication and permissions are required for `dagster-gcp-pandas` to interact with GCS. Lack of credentials or insufficient permissions will result in `PermissionDenied` errors.
- gotcha The `GCSPandasIOManager` defaults to Parquet format for serialization. If you expect or require other formats like CSV, JSON, or feather, you must explicitly configure the `file_extension` parameter.
- gotcha Misconfiguring the `gcs_bucket` or `gcs_prefix` parameters can lead to assets not being found when loading, or being written to unexpected locations within GCS.
Install
-
pip install dagster-gcp-pandas
Imports
- GCSPandasIOManager
from dagster_gcp_pandas import GCSPandasIOManager
- GCSResource
from dagster_gcp.gcs import GCSResource
Quickstart
import pandas as pd
from dagster import asset, Definitions
from dagster_gcp_pandas import GCSPandasIOManager
import os
@asset
def my_pandas_dataframe_asset() -> pd.DataFrame:
"""Produces a Pandas DataFrame."""
return pd.DataFrame({"value": [1, 2, 3], "label": ["A", "B", "C"]})
# Configure the GCSPandasIOManager to store DataFrames in a specified GCS bucket.
# Ensure the GCS_BUCKET_NAME environment variable is set or replace "your-gcs-bucket-name".
# You also need appropriate GCP credentials configured (e.g., GOOGLE_APPLICATION_CREDENTIALS).
gcs_io_manager = GCSPandasIOManager(
gcs_bucket=os.environ.get("GCS_BUCKET_NAME", "your-gcs-bucket-name"),
gcs_prefix="dagster_assets/pandas"
)
defs = Definitions(
assets=[my_pandas_dataframe_asset],
resources={
"io_manager": gcs_io_manager
}
)
# To run:
# 1. Save this code as a Python file (e.g., my_project/repo.py)
# 2. Set the GCS_BUCKET_NAME environment variable:
# export GCS_BUCKET_NAME="your-actual-bucket-name"
# 3. Ensure GCP credentials are set up (e.g., using `gcloud auth application-default login`
# or `GOOGLE_APPLICATION_CREDENTIALS` environment variable pointing to a service account key).
# 4. Execute: `dagster dev -f my_project/repo.py`
# 5. Navigate to the Dagster UI (usually http://localhost:3000) and materialize the asset.