{"id":6331,"library":"coffea","title":"coffea","description":"coffea is a Python toolkit designed for columnar data analysis in High-Energy Physics (HEP), providing basic tools and wrappers for efficient manipulation of HEP event data. It integrates with modern big-data technologies like Dask, Parsl, and TaskVine to enable scaling analyses from local machines to computing clusters without code changes. The library is actively developed, currently at version 2026.4.0, with frequent, often monthly or bi-monthly, releases of its calendar-versioned major releases and backports for its 0.7.x branch.","status":"active","version":"2026.4.0","language":"en","source_language":"en","source_url":"https://github.com/coffeateam/coffea","tags":["physics","HEP","data-analysis","columnar-data","big-data","scikit-hep"],"install":[{"cmd":"pip install coffea","lang":"bash","label":"Install stable version"},{"cmd":"pip install coffea[dask]","lang":"bash","label":"Install with Dask support"},{"cmd":"pip install coffea[parsl]","lang":"bash","label":"Install with Parsl support"}],"dependencies":[{"reason":"Required Python version.","package":"python","version":">=3.10"},{"reason":"Core array manipulation.","package":"numpy","version":">=1.22"},{"reason":"Interacting with ROOT files.","package":"uproot","version":""},{"reason":"Manipulating complex-structured columnar data (jagged arrays).","package":"awkward-array","version":""},{"reason":"Just-in-time compilation of Python functions.","package":"numba","version":""},{"reason":"Statistical functions.","package":"scipy","version":""},{"reason":"Plotting backend.","package":"matplotlib","version":""},{"reason":"Distributed executor for scaling analyses.","package":"dask","optional":true},{"reason":"Distributed executor for scaling analyses.","package":"parsl","optional":true},{"reason":"Distributed executor for scaling analyses.","package":"taskvine","optional":true}],"imports":[{"note":"Main module for defining analysis processors and runners.","symbol":"processor","correct":"from coffea import processor"},{"note":"Standard alias for Awkward Array, fundamental for columnar data structures.","symbol":"ak","correct":"import awkward as ak"},{"note":"Used for histogramming capabilities, often with dask-histogram.","symbol":"hist","correct":"import hist"}],"quickstart":{"code":"import awkward as ak\nimport hist\nfrom coffea import processor\nfrom coffea.nanoevents import NanoEventsFactory, BaseSchema\n\n# Define a simple processor\nclass MyProcessor(processor.ProcessorABC):\n    def process(self, events):\n        # For demonstration, assume 'events' has a 'Muon' collection\n        # In a real scenario, events would be loaded from a file using NanoEventsFactory\n        if 'Muon' not in events.fields:\n            # Create dummy muons if not present, for runnable example\n            dummy_muons = ak.zip({\n                \"pt\": ak.Array([ [10, 20], [30] ]), \n                \"eta\": ak.Array([ [0.5, 1.2], [-0.8] ]), \n                \"charge\": ak.Array([ [1, -1], [1] ])\n            }, depth_limit=1)\n            events = ak.with_field(events, dummy_muons, \"Muon\")\n            \n        muons = events.Muon[events.Muon.pt > 15]\n        \n        # Select opposite-sign dimuons\n        dimuons = ak.combinations(muons, 2, fields=[\"lead\", \"trail\"])\n        dimuons = dimuons[dimuons.lead.charge != dimuons.trail.charge]\n        \n        # Calculate invariant mass (simplified for example)\n        # In a real analysis, vector-like operations would be used\n        if len(dimuons) > 0:\n             # Dummy mass calculation for illustration\n            mass = ak.flatten(dimuons.lead.pt + dimuons.trail.pt)\n        else:\n            mass = ak.Array([])\n\n        # Create a histogram and fill it\n        h_mass = hist.Hist.new.Reg(50, 0, 100, name=\"mass\", label=\"Dimuon Mass [GeV]\").Double()\n        h_mass.fill(mass=mass)\n\n        return {\"mymass_histogram\": h_mass, \"nevents\": len(events)}\n\n    def postprocess(self, accumulator):\n        return accumulator\n\n# Example usage with a local executor\nfileset = {\"dataset_A\": [\"dummy_file.root\"]}\n# Create a dummy events object for local testing without actual file I/O\nevents_data = {\"event_id\": ak.Array([1, 2, 3])}\ndummy_nanoevents = NanoEventsFactory.from_dict(events_data, schemaclass=BaseSchema).events()\n\n# Instantiate the processor\nmy_processor_instance = MyProcessor()\n\n# Run the processor with a local executor\n# In a real scenario, you'd load actual ROOT files\noutput = processor.Runner(\n    executor=processor.IterativeExecutor(status=False),\n    schema=BaseSchema,\n    xrootdtimeout=0 # dummy, for local execution\n)(fileset, \"Events\", processor_instance=my_processor_instance)\n\nprint(output[\"mymass_histogram\"])","lang":"python","description":"This quickstart demonstrates how to define a simple `coffea` processor to perform a basic analysis task (selecting dimuons and histogramming their invariant mass). It uses `coffea.processor.ProcessorABC` and runs with the `IterativeExecutor` for local execution. Note that for a truly runnable example without ROOT files, a dummy `events` object is constructed, and actual file loading is omitted. In a real application, `NanoEventsFactory` would load data from ROOT files."},"warnings":[{"fix":"Upgrade your Python environment to 3.10 or later (`conda install python=3.10` or similar).","message":"Python 3.9 support was dropped in `coffea` version 2025.12.0. Users on older Python versions will need to upgrade their environment to Python 3.10 or newer to use recent `coffea` releases.","severity":"breaking","affected_versions":">=2025.12.0"},{"fix":"Consult the `coffea` migration guides for detailed instructions. Ensure `dask-awkward` and `dask-histogram` are installed. Refactor analysis logic to leverage lazy evaluation and use `.compute()` only at the very end of array/histogram construction.","message":"Major API changes and mandatory `dask-awkward` dependency occurred when migrating from `coffea 0.7.x` to the calendar-versioned releases (`202X.Y.Z`). This transition involved fundamental shifts due to `awkward-array` v1 to v2 migration, making `dask-awkward` and `dask-histogram` mandatory for delayed computation. Old code might require significant updates to adapt to the new pattern, particularly regarding explicit `.compute()` calls.","severity":"breaking","affected_versions":"All versions >=2023.0.0 (approx.)"},{"fix":"Ensure all components of your `ProcessorABC` are picklable. For shared data or configurations, pass them into the `__init__` method and ensure they are read-only or managed externally. Test serialization with `coffea.util.save(my_processor_instance, 'test.coffea')`.","message":"ProcessorABC instances are expected to be fully serializable for distributed execution. Avoid tracking mutable state within a `ProcessorABC` instance, as it's treated as a stateless bundle of methods. Issues can arise with non-picklable objects or shared state that isn't properly handled during serialization/deserialization.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Structure your analysis to delay computation as much as possible. Only call `.compute()` when you need the final, fully-evaluated result (e.g., for plotting or saving to a concrete array/histogram object). `coffea`'s runners handle `.compute()` implicitly at the end of the processing chain.","message":"Premature calls to `.compute()` on `dask-awkward` arrays or `dask-histogram` objects can severely degrade performance in `coffea` analyses. Explicit `.compute()` calls force immediate evaluation, breaking the efficient Dask task graph designed for lazy, distributed processing.","severity":"gotcha","affected_versions":"All versions >=2023.0.0"},{"fix":"Always explicitly specify the desired version during installation (`pip install 'coffea>=2026.0.0,<2027.0.0'` for calendar versions, or `pip install 'coffea~=0.7.0'` for the backports). Be mindful of which documentation version corresponds to your installed library.","message":"The `coffea` project transitioned from semantic versioning (e.g., `0.7.x`) to calendar versioning (e.g., `202X.Y.Z`). There is an active `0.7.x` backports branch. This dual-versioning scheme can lead to confusion and API incompatibilities if users are not careful to install and develop against the intended version series.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-15T00:00:00.000Z","next_check":"2026-07-14T00:00:00.000Z","problems":[{"fix":"Upgrade `coffea` to a newer version (e.g., `pip install --upgrade coffea` or `conda update coffea`) which has adapted to the `awkward-array` changes.","cause":"This error typically occurs in older `coffea` versions (e.g., 0.6.46) due to a renaming of internal modules within the `awkward-array` library, a key dependency for `coffea`.","error":"ModuleNotFoundError: No module named 'awkward.array'"},{"fix":"Ensure all necessary packages and custom files are available on the Dask workers. This can be done by installing them into the worker's environment or by using Dask's `client.upload_file()` or `job_extra` configuration (e.g., `transfer_input_files` for `CoffeaCasaCluster`).","cause":"When running `coffea` processors with distributed executors (like Dask), required Python packages or analysis-specific files are often installed only on the client machine and not propagated to the Dask workers, leading to `ModuleNotFoundError` or `FileNotFoundError` on the workers.","error":"ModuleNotFound errors when attempting to run my processor. Ensure that you have installed your package onto the workers as well."},{"fix":"Instead of `.compute()`, use `awkward.materialize()` (often imported as `ak.materialize()`) to force the evaluation and loading of the array data. Alternatively, configure `NanoEventsFactory` to explicitly use `dask-awkward` if distributed computation is desired.","cause":"In recent `coffea` versions (e.g., v2025.7.0 and later), the default behavior of `NanoEventsFactory.from_root()` changed to use virtual arrays (lazy loading without Dask-Awkward by default), meaning `.compute()` is no longer directly applicable to materialize results.","error":"AttributeError: no field named 'compute'"},{"fix":"Ensure that any objects passed to the `coffea` executor (especially processors) are fully pickleable. This often means avoiding complex closures, lambda functions, or certain class attributes that are `property` objects, or making sure they are defined at the top level of a module. For processors, ensure all internal state is managed in a pickle-friendly way.","cause":"This error occurs when `coffea` attempts to serialize (pickle) an object, often a processor, that contains Python `property` objects which are not directly pickleable across processes, particularly with Dask or `concurrent.futures` executors.","error":"cannot pickle 'property' object"}]}