Apache Beam SDK for Python

raw JSON →
2.71.0 verified Tue May 12 auth: no python install: draft quickstart: stale

Apache Beam is an open-source, unified programming model for defining and executing data processing pipelines for both batch and streaming data. It offers language-specific SDKs, including Python, to construct pipelines that can run on various distributed processing backends such as Apache Flink, Apache Spark, and Google Cloud Dataflow. The library maintains an active development pace with minor releases approximately every 6 weeks, and its current version is 2.71.0.

pip install apache-beam
error ModuleNotFoundError: No module named 'apache_beam'
cause This error occurs when the Apache Beam library is not installed in the Python environment.
fix
Install Apache Beam using pip: pip install apache-beam.
error ImportError: cannot import name 'coder_impl'
cause This error arises due to an internal import issue within the Apache Beam library, often related to version mismatches or installation problems.
fix
Ensure that Apache Beam is correctly installed and up to date: pip install --upgrade apache-beam.
error Total size of the BoundedSource objects returned by BoundedSource.split() operation is larger than the allowable limit
cause This error occurs when attempting to process a very large number of files in a single pipeline, exceeding the system's allowable limit.
fix
Reduce the number of files processed in a single pipeline or split the processing into smaller batches.
error OSError: Invalid data stream
cause This error happens when the pipeline encounters a malformed or corrupted file during processing.
fix
Implement error handling mechanisms to skip or log bad records, such as using try-except blocks around file reading operations.
error NameError: name 'parse_into_dict' is not defined
cause This error occurs when a function or variable is referenced before it has been defined or imported.
fix
Ensure that all functions and variables are properly defined and imported before they are used in the code.
breaking As of Apache Beam 2.70.0, many Python dependencies were split into 'extras'. If your pipeline previously relied on these implicitly, you may need to explicitly install them using `pip install apache-beam[gcp,interactive,yaml,redis,hadoop,tfrecord,...]` to ensure all necessary components are present.
fix Review your pipeline's dependencies and install Apache Beam with the appropriate extras, e.g., `pip install 'apache-beam[gcp,interactive]'`.
deprecated Support for Python 3.9 was removed in Apache Beam 2.70.0 after Python 3.9 reached its End-of-Life in October 2025. Ensure your development and runtime environments use Python 3.10 or newer.
fix Upgrade your Python environment to 3.10 or a later supported version (e.g., Python 3.10, 3.11, 3.12). The PyPI package requires >=3.10 for 2.71.0.
gotcha The `dill` library is no longer a required, default dependency as of Apache Beam 2.71.0. If your pipeline explicitly uses the `pickle_library=dill` pipeline option, you must manually ensure `dill==0.3.1.1` is installed in both your submission and runtime environments.
fix If using `pickle_library=dill`, add `dill==0.3.1.1` to your `requirements.txt` or ensure it's installed in your custom container.
gotcha Apache Beam's distributed nature makes global state problematic. Avoid using global variables to share data across elements. Instead, leverage Beam's side inputs for read-only shared data or stateful processing primitives (`StateSpec`, `TimerSpec`) for mutable, per-key state.
fix Refactor code to use Beam's recommended patterns for sharing data, such as side inputs or stateful `DoFn`s, which are designed for distributed execution.
gotcha Managing Python pipeline dependencies can lead to 'Diamond Dependency' problems, especially when bundling `apache-beam` with other libraries or using many optional extras. Incompatible versions of shared dependencies can cause runtime errors.
fix Define pipeline dependencies carefully using `requirements.txt` (or `setup.py` for packages). Consider using custom container images to control the exact environment and pre-install dependencies, ensuring reproducibility and avoiding conflicts.
breaking Installing Apache Beam from source requires system-level build tools (like GCC and Python development headers). This is particularly relevant in minimal environments such as Alpine Linux, where these tools are not pre-installed.
fix Ensure your environment has the necessary build tools installed. For Alpine Linux, run `apk add build-base python3-dev`.
pip install 'apache-beam[gcp]'
python os / libc variant status wheel install import disk
3.10 alpine (musl) gcp build_error - 0.1s - -
3.10 alpine (musl) apache-beam build_error - - - -
3.10 alpine (musl) gcp - - - -
3.10 alpine (musl) apache-beam - - - -
3.10 slim (glibc) gcp sdist 39.8s 4.13s 712M
3.10 slim (glibc) apache-beam wheel 17.2s 2.25s 448M
3.10 slim (glibc) gcp - - 3.94s 705M
3.10 slim (glibc) apache-beam - - 1.97s 446M
3.11 alpine (musl) gcp build_error - - - -
3.11 alpine (musl) apache-beam build_error - - - -
3.11 alpine (musl) gcp - - - -
3.11 alpine (musl) apache-beam - - - -
3.11 slim (glibc) gcp sdist 39.5s 5.26s 789M
3.11 slim (glibc) apache-beam wheel 16.9s 3.28s 487M
3.11 slim (glibc) gcp - - 6.51s 771M
3.11 slim (glibc) apache-beam - - 2.85s 474M
3.12 alpine (musl) gcp build_error - - - -
3.12 alpine (musl) apache-beam build_error - - - -
3.12 alpine (musl) gcp - - - -
3.12 alpine (musl) apache-beam - - - -
3.12 slim (glibc) gcp sdist 35.8s 6.11s 775M
3.12 slim (glibc) apache-beam wheel 15.9s 3.60s 481M
3.12 slim (glibc) gcp - - 9.62s 748M
3.12 slim (glibc) apache-beam - - 3.53s 459M
3.13 alpine (musl) gcp build_error - - - -
3.13 alpine (musl) apache-beam build_error - - - -
3.13 alpine (musl) gcp - - - -
3.13 alpine (musl) apache-beam - - - -
3.13 slim (glibc) gcp wheel 34.6s 5.85s 766M
3.13 slim (glibc) apache-beam wheel 15.9s 3.55s 480M
3.13 slim (glibc) gcp - - 10.50s 763M
3.13 slim (glibc) apache-beam - - 3.55s 478M
3.9 alpine (musl) gcp build_error - 0.1s - -
3.9 alpine (musl) apache-beam build_error - - - -
3.9 alpine (musl) gcp - - - -
3.9 alpine (musl) apache-beam - - - -
3.9 slim (glibc) gcp sdist 47.3s 4.53s 659M
3.9 slim (glibc) apache-beam sdist 22.3s 2.89s 411M
3.9 slim (glibc) gcp - - 3.77s 657M
3.9 slim (glibc) apache-beam - - 2.50s 410M

This classic WordCount example demonstrates basic Apache Beam concepts: reading data from a source (local file or GCS), applying transformations like splitting, mapping, and combining, and writing the results to an output file. Run it locally using the DirectRunner.

import re
import argparse
import apache_beam as beam
from apache_beam.io import ReadFromText, WriteToText
from apache_beam.options.pipeline_options import PipelineOptions

def main(argv=None):
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--input',
        dest='input',
        default='gs://dataflow-samples/shakespeare/kinglear.txt',
        help='Input file to process.')
    parser.add_argument(
        '--output',
        dest='output',
        default='output.txt',
        help='Output file to write results to.')
    known_args, pipeline_args = parser.parse_known_args(argv)

    pipeline_options = PipelineOptions(pipeline_args)

    with beam.Pipeline(options=pipeline_options) as p:
        # Read the text file into a PCollection
        lines = p | ReadFromText(known_args.input)

        # Count the occurrences of each word
        counts = (
            lines
            | 'Split' >> beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))
            | 'PairWithOne' >> beam.Map(lambda x: (x, 1))
            | 'GroupAndSum' >> beam.CombinePerKey(sum)
        )

        # Format the counts into strings
        output = counts | 'Format' >> beam.Map(
            lambda word_count: '%s: %s' % (word_count[0], word_count[1]))

        # Write the output
        output | WriteToText(known_args.output)

if __name__ == '__main__':
    print("Running Beam WordCount pipeline locally...")
    main()
    print("Pipeline finished. Check 'output.txt' for results.")