Metaflow
Metaflow is a human-centric framework for building and managing real-life data science projects, from prototyping to production. It enables data scientists and ML engineers to rapidly develop, deploy, and operate ML workflows. The `ob-metaflow` distribution is a specific PyPI package that provides the core Metaflow library. Its current version is 2.19.21.1, and it follows the frequent release cadence of the main Metaflow project.
Common errors
-
ModuleNotFoundError: No module named 'metaflow'
cause The `ob-metaflow` PyPI package has not been installed, or it's installed in a different Python environment.fixInstall the package using `pip install ob-metaflow` in your active Python environment. -
MetaflowException: You need to specify a S3 bucket or path using METAFLOW_DATATOOLS_S3ROOT or configure a default S3 root in ~/.metaflow/config.json
cause Metaflow is trying to store artifacts remotely (e.g., for `start --environment=conda`), but no S3 bucket has been configured.fixSet the `METAFLOW_DATATOOLS_S3ROOT` environment variable (e.g., `export METAFLOW_DATATOOLS_S3ROOT=s3://your-bucket/metaflow`) or configure it in `~/.metaflow/config.json`. Ensure your AWS credentials are also correctly set. -
MetaflowException: Could not find credentials to access S3
cause Metaflow requires AWS credentials to access S3 buckets for artifact storage and remote execution.fixEnsure your AWS credentials are configured via environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`), AWS IAM roles (for EC2/EKS), or a `~/.aws/credentials` file.
Warnings
- gotcha The PyPI package name is `ob-metaflow`, but the Python module you import is `metaflow`. Ensure you always use `from metaflow import ...` in your code.
- gotcha Metaflow flows are designed to be run via the Metaflow CLI (`python your_flow.py run`), not by simply executing the Python script (`python your_flow.py`). Running directly will not activate Metaflow's tracking, artifact storage, or other features.
- gotcha For robust, resumable, and shareable flows, Metaflow requires external storage (e.g., AWS S3, Google Cloud Storage) for artifacts. Local storage is primarily for development and prototyping and is not recommended for production.
- breaking Metaflow's default serialization engine switched from 'pickle' to 'cloudpickle' in version 2.0. Additionally, the default protocol for 'cloudpickle' was updated in later 2.x versions. This can cause issues when resuming or inspecting old runs created with different Metaflow versions.
Install
-
pip install ob-metaflow
Imports
- FlowSpec
from ob_metaflow import FlowSpec
from metaflow import FlowSpec
- step
from metaflow import step
- current
from metaflow import current
Quickstart
from metaflow import FlowSpec, step, card
import os
class MyFirstMetaflowFlow(FlowSpec):
"""
A simple Metaflow flow demonstrating basic steps.
"""
@step
def start(self):
self.message = "Hello Metaflow!"
print(f"Starting flow with message: {self.message}")
self.next(self.process_data)
@step
def process_data(self):
self.data = [len(self.message), 42]
print(f"Processing data: {self.data}")
self.next(self.end)
@step
def end(self):
print(f"Flow finished. Final data: {self.data}")
if __name__ == '__main__':
# To run: python your_flow_file.py run
# For this quickstart, we just instantiate it.
# Metaflow typically expects execution via its CLI for full features.
flow = MyFirstMetaflowFlow()
# Running via the CLI: python this_file.py run