Apache Beam SDK for Python
Apache Beam is an open-source, unified programming model for defining and executing data processing pipelines for both batch and streaming data. It offers language-specific SDKs, including Python, to construct pipelines that can run on various distributed processing backends such as Apache Flink, Apache Spark, and Google Cloud Dataflow. The library maintains an active development pace with minor releases approximately every 6 weeks, and its current version is 2.71.0.
Common errors
-
ModuleNotFoundError: No module named 'apache_beam'
cause This error occurs when the Apache Beam library is not installed in the Python environment.fixInstall Apache Beam using pip: `pip install apache-beam`. -
ImportError: cannot import name 'coder_impl'
cause This error arises due to an internal import issue within the Apache Beam library, often related to version mismatches or installation problems.fixEnsure that Apache Beam is correctly installed and up to date: `pip install --upgrade apache-beam`. -
Total size of the BoundedSource objects returned by BoundedSource.split() operation is larger than the allowable limit
cause This error occurs when attempting to process a very large number of files in a single pipeline, exceeding the system's allowable limit.fixReduce the number of files processed in a single pipeline or split the processing into smaller batches. -
OSError: Invalid data stream
cause This error happens when the pipeline encounters a malformed or corrupted file during processing.fixImplement error handling mechanisms to skip or log bad records, such as using try-except blocks around file reading operations. -
NameError: name 'parse_into_dict' is not defined
cause This error occurs when a function or variable is referenced before it has been defined or imported.fixEnsure that all functions and variables are properly defined and imported before they are used in the code.
Warnings
- breaking As of Apache Beam 2.70.0, many Python dependencies were split into 'extras'. If your pipeline previously relied on these implicitly, you may need to explicitly install them using `pip install apache-beam[gcp,interactive,yaml,redis,hadoop,tfrecord,...]` to ensure all necessary components are present.
- deprecated Support for Python 3.9 was removed in Apache Beam 2.70.0 after Python 3.9 reached its End-of-Life in October 2025. Ensure your development and runtime environments use Python 3.10 or newer.
- gotcha The `dill` library is no longer a required, default dependency as of Apache Beam 2.71.0. If your pipeline explicitly uses the `pickle_library=dill` pipeline option, you must manually ensure `dill==0.3.1.1` is installed in both your submission and runtime environments.
- gotcha Apache Beam's distributed nature makes global state problematic. Avoid using global variables to share data across elements. Instead, leverage Beam's side inputs for read-only shared data or stateful processing primitives (`StateSpec`, `TimerSpec`) for mutable, per-key state.
- gotcha Managing Python pipeline dependencies can lead to 'Diamond Dependency' problems, especially when bundling `apache-beam` with other libraries or using many optional extras. Incompatible versions of shared dependencies can cause runtime errors.
- breaking Installing Apache Beam from source requires system-level build tools (like GCC and Python development headers). This is particularly relevant in minimal environments such as Alpine Linux, where these tools are not pre-installed.
Install
-
pip install apache-beam -
pip install 'apache-beam[gcp]'
Imports
- beam
import apache_beam as beam
- ReadFromText
from apache_beam.io import ReadFromText
- WriteToText
from apache_beam.io import WriteToText
- PipelineOptions
from apache_beam.options.pipeline_options import PipelineOptions
Quickstart
import re
import argparse
import apache_beam as beam
from apache_beam.io import ReadFromText, WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
def main(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument(
'--input',
dest='input',
default='gs://dataflow-samples/shakespeare/kinglear.txt',
help='Input file to process.')
parser.add_argument(
'--output',
dest='output',
default='output.txt',
help='Output file to write results to.')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
with beam.Pipeline(options=pipeline_options) as p:
# Read the text file into a PCollection
lines = p | ReadFromText(known_args.input)
# Count the occurrences of each word
counts = (
lines
| 'Split' >> beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))
| 'PairWithOne' >> beam.Map(lambda x: (x, 1))
| 'GroupAndSum' >> beam.CombinePerKey(sum)
)
# Format the counts into strings
output = counts | 'Format' >> beam.Map(
lambda word_count: '%s: %s' % (word_count[0], word_count[1]))
# Write the output
output | WriteToText(known_args.output)
if __name__ == '__main__':
print("Running Beam WordCount pipeline locally...")
main()
print("Pipeline finished. Check 'output.txt' for results.")