Dask Histogram
dask-histogram provides parallel and out-of-core histogramming capabilities by integrating Dask with the boost-histogram library. It enables users to compute histograms efficiently on large datasets that may not fit into memory, leveraging Dask's distributed computing framework. The library currently operates on version 2026.2.0 and follows a rapid release cadence, often releasing monthly or bi-monthly updates.
Common errors
-
AttributeError: 'DaskHistogram' object has no attribute 'values'
cause Attempting to access histogram data attributes (like 'values', 'counts', 'sum_weights') on a lazy DaskHistogram object before calling `.compute()`.fixCall `.compute()` on the DaskHistogram object first to get the concrete boost-histogram instance. Example: `computed_hist = dask_hist_obj.compute(); print(computed_hist.view())` -
TypeError: cannot pickle '_thread.RLock' object
cause A non-serializable object (like a lock or certain complex Python objects) was inadvertently included in the Dask graph, making it impossible to send across processes in a distributed Dask setup.fixEnsure all objects passed into Dask operations or stored within the Dask graph are serializable. This often requires careful construction of custom functions or `boost-histogram` axis definitions. Sometimes restarting the Dask client or environment can resolve transient pickling issues. -
ValueError: Mismatched number of dimensions in fill data
cause The number of data arrays provided to `Hist.fill()` does not match the number of axes defined in the underlying `boost-histogram` object.fixReview your `boost_histogram.Histogram` definition and the data arrays passed to `Hist.fill()`. For N-dimensional histograms, you need N 1D arrays, each corresponding to an axis. -
AttributeError: module 'dask_histogram' has no attribute 'factory'
cause The `dask_histogram.factory` module or its specific functions were either removed, refactored, or are incompatible with the installed Dask version, especially after Dask 2024.12.fixUpdate `dask-histogram` to the latest version. If `factory` functions are still missing or problematic, consult the `dask-histogram` documentation for the current recommended way to create or manipulate histogram layers, as the API might have changed.
Warnings
- gotcha Dask histograms are lazy computations. They return a Dask object that needs to be explicitly computed using `.compute()` to obtain the final boost-histogram object with actual results. Failing to call `.compute()` will result in working with a Dask graph, not the histogram data itself.
- breaking Compatibility with Dask versions can be sensitive. For instance, `dask-histogram.factory` functionality was broken with `dask>=2024.12.0` and required an update in `dask-histogram==2024.12.0` to fix. Ensure your `dask-histogram` version is compatible with your `dask` version, especially after major Dask releases.
- gotcha When using `dask_histogram.Hist.fill()`, the arguments (e.g., `x`, `y`) must be Dask arrays, not raw NumPy arrays or scalar values, unlike `boost-histogram`'s direct `fill()` method. This is a common mistake when migrating from `boost-histogram` to `dask-histogram`.
- gotcha The internal Dask graph construction for `Hist.fill()` was optimized in version `2024.3.0` to delay the creation of the task graph until `.compute()` is called. This can affect users who were relying on inspecting the Dask graph immediately after calling `fill()` but before `compute()`.
Install
-
pip install dask-histogram
Imports
- Hist
from dask_histogram import Hist
- histogram
from dask_histogram.routines import histogram
Quickstart
import dask.array as da
from dask_histogram.routines import histogram
import boost_histogram as bh
# Create a large Dask array
x = da.random.normal(0, 1, size=(10_000_000,), chunks=1_000_000)
# Method 1: NumPy-like interface
bins = 50
range_min, range_max = -5, 5
dask_hist_numpy_like = histogram(x, bins=bins, range=(range_min, range_max))
print(f"NumPy-like Dask histogram (lazy): {dask_hist_numpy_like}")
computed_hist_numpy_like = dask_hist_numpy_like.compute()
print(f"Computed histogram (NumPy-like): {computed_hist_numpy_like.view()}")
# Method 2: boost-histogram like interface
from dask_histogram import Hist
bh_hist = (bh.Histogram(bh.axis.Regular(bins, range_min, range_max, metadata="x")))
dask_hist_bh_like = Hist.from_boost_histogram(bh_hist, x)
print(f"boost-histogram-like Dask histogram (lazy): {dask_hist_bh_like}")
computed_hist_bh_like = dask_hist_bh_like.compute()
print(f"Computed histogram (boost-histogram-like): {computed_hist_bh_like.view()}")