Fugue: Abstraction Layer for Distributed Computing

0.9.7 · active · verified Thu Apr 09

Fugue is a Python library that provides a unified interface for defining data workflows, allowing them to run seamlessly on Pandas, Spark, Dask, Ray, and other distributed computing engines without code changes. It's designed to make data pipelines more portable and testable. The current version is 0.9.7, and it maintains an active release cadence with frequent patches and updates.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates defining a simple Fugue workflow. It creates a DataFrame, applies a transformation function (`map_to_string`) that converts a numeric column to a string, and then executes the workflow, printing the result to the console. The `schema` parameter in `transform` is crucial for explicit schema definition.

from fugue import FugueWorkflow
import pandas as pd

def map_to_string(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(value_str=df['value'].astype(str))

with FugueWorkflow() as dag:
    df = dag.df([{"id": 1, "value": 10}, {"id": 2, "value": 20}])
    result = df.transform(map_to_string, schema="*,value_str:str")
    result.show()

view raw JSON →