Metaflow
Metaflow is a human-friendly Python library that helps scientists and engineers build and manage real-life data science projects. Originally developed at Netflix, it provides a unified API to the infrastructure stack required for data science projects, from prototype to production. It is actively maintained with frequent patch releases.
Warnings
- breaking While Metaflow generally prioritizes backward compatibility, minor breaking changes can occur, especially in patch versions addressing bug fixes or internal architectural improvements. Always review the GitHub release notes before upgrading.
- gotcha When scaling Metaflow flows to remote compute environments (e.g., AWS Batch, Kubernetes), locally `pip install`'d or `conda install`'d third-party dependencies are not automatically available. You must explicitly declare these dependencies using the `@pypi` or `@conda` decorators on your flow or individual steps to ensure reproducibility and correct execution in remote environments.
- gotcha Metaflow's most mature and battle-tested integrations are with Amazon Web Services (AWS), including S3 for storage, Batch for compute, and Step Functions for orchestration. While it supports other cloud providers like Azure and GCP, the level of integration and available features may vary, potentially requiring more manual configuration.
- gotcha Metaflow does not offer native support for Windows operating systems. Users on Windows must utilize the Windows Subsystem for Linux (WSL) to install and run Metaflow, as it relies on a *nix-like environment.
- gotcha Data artifacts (instance variables prefixed with `self.`) are automatically persisted and passed between steps. Directly relying on global variables or modifying external state outside of Metaflow's artifact management can lead to non-reproducible runs, especially in distributed or resumed executions, as these changes might not be tracked or correctly restored.
Install
-
pip install metaflow
Imports
- FlowSpec
from metaflow import FlowSpec
- step
from metaflow import step
- Parameter
from metaflow import Parameter
- card
from metaflow import card
- pypi
from metaflow import pypi
- conda
from metaflow import conda
Quickstart
import os
from metaflow import FlowSpec, step
class HelloFlow(FlowSpec):
"""A simple Metaflow that prints 'Hi'."""
@step
def start(self):
"""This is the 'start' step. All flows must have a step named 'start'."""
print("HelloFlow is starting.")
self.message = "Metaflow says: Hi!"
self.next(self.hello)
@step
def hello(self):
"""A step for Metaflow to introduce itself."""
print(self.message)
self.next(self.end)
@step
def end(self):
"""This is the 'end' step. All flows must have an 'end' step."""
print("HelloFlow is all done.")
if __name__ == "__main__":
HelloFlow()