SparkMeasure Python API

0.27.0 · active · verified Sun Apr 12

sparkMeasure is a Python API for the core Scala library, designed for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark metrics, making it suitable for interactive analysis, testing, and production monitoring. The library focuses on easing metric collection and analysis for both developers and data engineers. Releases are frequent, typically on a quarterly to half-yearly cadence, with the current stable version being 0.27.0.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize SparkMeasure and use the `runandmeasure` method to collect and report performance metrics for a Spark SQL query. It sets up a local SparkSession, loads the necessary SparkMeasure JAR, and then uses `StageMetrics` to instrument a simple operation.

from pyspark.sql import SparkSession
from sparkmeasure import StageMetrics

# Configure SparkSession to include the spark-measure JAR
spark = (SparkSession.builder
         .appName("SparkMeasure Quickstart")
         .master("local[*]")
         .config("spark.jars.packages", "ch.cern.sparkmeasure:spark-measure_2.13:0.27")
         .getOrCreate())

# Initialize StageMetrics
stagemetrics = StageMetrics(spark)

# Run and measure a Spark job
print("Running a simple Spark SQL query and measuring performance...")
stagemetrics.runandmeasure(globals(), 'spark.sql("SELECT count(*) FROM range(1000) CROSS JOIN range(1000)").show()')

print("\nPrinting performance report:")
stagemetrics.print_report()

spark.stop()

view raw JSON →