Fast GroupBy operations for Dask Arrays

0.11.2 · active · verified Tue Apr 14

Flox is a Python library that provides strategies for fast GroupBy reductions with dask.array, significantly enhancing performance for operations like climatologies, resampling, and histogramming. It was formerly known as `dask_groupby` and integrates seamlessly with xarray to offer more performant GroupBy and Resampling operations.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `flox.groupby_reduce` with a Dask array and a NumPy array of group labels to compute the mean for each group. The `expected_groups` argument ensures all groups are present in the output, even if some are empty.

import dask.array as da
from flox import groupby_reduce
import numpy as np

# Create a sample Dask array
data = da.random.random((1000, 10), chunks=(100, 10))

# Create a 'by' array for grouping (e.g., categories 0-9)
groups = np.random.randint(0, 10, size=1000)

# Perform a GroupBy reduction (e.g., mean)
result_mean, group_labels = groupby_reduce(
    data, groups, func="mean", expected_groups=np.arange(10)
)

print("Grouped Means (first 5 groups):\n", result_mean.compute()[:5])
print("Group Labels:\n", group_labels)

view raw JSON →