Apache Beam SDK for Python
Apache Beam is an open-source, unified programming model for defining and executing data processing pipelines for both batch and streaming data. It offers language-specific SDKs, including Python, to construct pipelines that can run on various distributed processing backends such as Apache Flink, Apache Spark, and Google Cloud Dataflow. The library maintains an active development pace with minor releases approximately every 6 weeks, and its current version is 2.71.0.
Warnings
- breaking As of Apache Beam 2.70.0, many Python dependencies were split into 'extras'. If your pipeline previously relied on these implicitly, you may need to explicitly install them using `pip install apache-beam[gcp,interactive,yaml,redis,hadoop,tfrecord,...]` to ensure all necessary components are present.
- deprecated Support for Python 3.9 was removed in Apache Beam 2.70.0 after Python 3.9 reached its End-of-Life in October 2025. Ensure your development and runtime environments use Python 3.10 or newer.
- gotcha The `dill` library is no longer a required, default dependency as of Apache Beam 2.71.0. If your pipeline explicitly uses the `pickle_library=dill` pipeline option, you must manually ensure `dill==0.3.1.1` is installed in both your submission and runtime environments.
- gotcha Apache Beam's distributed nature makes global state problematic. Avoid using global variables to share data across elements. Instead, leverage Beam's side inputs for read-only shared data or stateful processing primitives (`StateSpec`, `TimerSpec`) for mutable, per-key state.
- gotcha Managing Python pipeline dependencies can lead to 'Diamond Dependency' problems, especially when bundling `apache-beam` with other libraries or using many optional extras. Incompatible versions of shared dependencies can cause runtime errors.
Install
-
pip install apache-beam -
pip install 'apache-beam[gcp]'
Imports
- beam
import apache_beam as beam
- ReadFromText
from apache_beam.io import ReadFromText
- WriteToText
from apache_beam.io import WriteToText
- PipelineOptions
from apache_beam.options.pipeline_options import PipelineOptions
Quickstart
import re
import argparse
import apache_beam as beam
from apache_beam.io import ReadFromText, WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
def main(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument(
'--input',
dest='input',
default='gs://dataflow-samples/shakespeare/kinglear.txt',
help='Input file to process.')
parser.add_argument(
'--output',
dest='output',
default='output.txt',
help='Output file to write results to.')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
with beam.Pipeline(options=pipeline_options) as p:
# Read the text file into a PCollection
lines = p | ReadFromText(known_args.input)
# Count the occurrences of each word
counts = (
lines
| 'Split' >> beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))
| 'PairWithOne' >> beam.Map(lambda x: (x, 1))
| 'GroupAndSum' >> beam.CombinePerKey(sum)
)
# Format the counts into strings
output = counts | 'Format' >> beam.Map(
lambda word_count: '%s: %s' % (word_count[0], word_count[1]))
# Write the output
output | WriteToText(known_args.output)
if __name__ == '__main__':
print("Running Beam WordCount pipeline locally...")
main()
print("Pipeline finished. Check 'output.txt' for results.")