{"id":4769,"library":"sparkmeasure","title":"SparkMeasure Python API","description":"sparkMeasure is a Python API for the core Scala library, designed for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark metrics, making it suitable for interactive analysis, testing, and production monitoring. The library focuses on easing metric collection and analysis for both developers and data engineers. Releases are frequent, typically on a quarterly to half-yearly cadence, with the current stable version being 0.27.0.","status":"active","version":"0.27.0","language":"en","source_language":"en","source_url":"https://github.com/lucacanali/sparkMeasure","tags":["spark","pyspark","performance","metrics","monitoring","troubleshooting"],"install":[{"cmd":"pip install sparkmeasure","lang":"bash","label":"Install Python package"},{"cmd":"pyspark --packages ch.cern.sparkmeasure:spark-measure_2.13:0.27","lang":"bash","label":"Run PySpark with SparkMeasure JAR (Spark 3/4 with Scala 2.13)"}],"dependencies":[{"reason":"sparkmeasure is a wrapper for a Scala Spark library and requires a Spark environment (PySpark for Python users).","package":"pyspark"}],"imports":[{"symbol":"StageMetrics","correct":"from sparkmeasure import StageMetrics"},{"symbol":"TaskMetrics","correct":"from sparkmeasure import TaskMetrics"}],"quickstart":{"code":"from pyspark.sql import SparkSession\nfrom sparkmeasure import StageMetrics\n\n# Configure SparkSession to include the spark-measure JAR\nspark = (SparkSession.builder\n         .appName(\"SparkMeasure Quickstart\")\n         .master(\"local[*]\")\n         .config(\"spark.jars.packages\", \"ch.cern.sparkmeasure:spark-measure_2.13:0.27\")\n         .getOrCreate())\n\n# Initialize StageMetrics\nstagemetrics = StageMetrics(spark)\n\n# Run and measure a Spark job\nprint(\"Running a simple Spark SQL query and measuring performance...\")\nstagemetrics.runandmeasure(globals(), 'spark.sql(\"SELECT count(*) FROM range(1000) CROSS JOIN range(1000)\").show()')\n\nprint(\"\\nPrinting performance report:\")\nstagemetrics.print_report()\n\nspark.stop()","lang":"python","description":"This quickstart demonstrates how to initialize SparkMeasure and use the `runandmeasure` method to collect and report performance metrics for a Spark SQL query. It sets up a local SparkSession, loads the necessary SparkMeasure JAR, and then uses `StageMetrics` to instrument a simple operation."},"warnings":[{"fix":"Update Spark configuration to use the new prefix for Kafka Producer properties (e.g., `spark.sparkmeasure.kafka.bootstrap.servers`).","message":"The configuration properties for the Kafka sink have changed in version 0.27. Kafka Producer properties must now be passed via Spark config using the prefix `spark.sparkmeasure.kafka.<kafkaProperty>=<value>`.","severity":"breaking","affected_versions":">=0.27.0"},{"fix":"Be aware of these limitations when using sparkMeasure with Spark Connect. Consider Flight Recorder mode for application-level metrics or alternative monitoring for client-specific details.","message":"Spark Connect integration is partial. While Flight Recorder mode can capture application-wide metrics, per-client (Spark Connect client-side) metrics are not fully reported. Direct access to SparkContext and its listener interface is required for full integration.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Monitor driver memory usage. For large-scale data, consider saving metrics to file or external sinks (Kafka, InfluxDB) using Flight Recorder mode, which directly writes collected metrics.","message":"Metrics collected by sparkMeasure are buffered in the driver's memory. For very large workloads or extensive metric collection, this can become a bottleneck or lead to out-of-memory errors on the driver.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Prefer `StageMetrics` for general performance overview and troubleshooting, and switch to `TaskMetrics` only when detailed task-level analysis is required.","message":"Collecting metrics using `TaskMetrics` (at the granularity of each task completion) incurs additional overhead compared to `StageMetrics` (aggregated by stage). Use `TaskMetrics` only when fine-grained data (e.g., for skew analysis) is strictly necessary.","severity":"gotcha","affected_versions":"All versions"},{"fix":"When analyzing job failures, be aware that reported metrics might not fully account for all resource usage leading up to the failure. Complement with Spark UI event logs for full failure context.","message":"sparkMeasure primarily collects metrics for *successfully* executed tasks. Resources consumed by failed tasks are generally not included in the reports.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}