{"id":2461,"library":"dbl-tempo","title":"Tempo - Timeseries Manipulation for Spark","description":"Tempo is a Python library that builds upon PySpark to provide scalable abstractions and functions for timeseries data manipulation on Spark. It simplifies common operations like resampling, interpolation, and as-of joins for large-scale time series datasets. The project is actively maintained as part of Databricks Labs, with frequent patch releases, currently at version 0.1.30.","status":"active","version":"0.1.30","language":"en","source_language":"en","source_url":"https://github.com/databrickslabs/tempo","tags":["spark","pyspark","timeseries","dataframe","databricks"],"install":[{"cmd":"pip install dbl-tempo","lang":"bash","label":"Install `dbl-tempo`"}],"dependencies":[{"reason":"Core dependency for Spark integration and DataFrame operations. Tempo directly extends PySpark functionalities and relies on `SparkSession`.","package":"pyspark","optional":false},{"reason":"Used for internal data handling, conversions, and often implicitly required by PySpark for interoperability.","package":"pandas","optional":false},{"reason":"Enables optimized data interchange between PySpark and pandas, improving performance for certain operations.","package":"pyarrow","optional":false}],"imports":[{"note":"This is Tempo's specialized SparkSession builder.","symbol":"TempoSparkSession","correct":"from tempo.spark import TempoSparkSession"},{"note":"The core timeseries DataFrame object.","symbol":"TSDF","correct":"from tempo.tsdf import TSDF"},{"note":"Required for general Spark operations, even when using Tempo's wrapper.","symbol":"SparkSession","correct":"from pyspark.sql import SparkSession"}],"quickstart":{"code":"from pyspark.sql import SparkSession\nfrom pyspark.sql.functions import to_timestamp, lit\nfrom tempo.spark import TempoSparkSession\nfrom tempo.tsdf import TSDF\n\n# Configure Spark for local mode. In a Databricks environment,\n# SparkSession is usually pre-configured and available as 'spark'.\nspark = None\ntry:\n    spark = TempoSparkSession.builder \\\n        .appName(\"TempoQuickstart\") \\\n        .master(\"local[*]\") \\\n        .config(\"spark.sql.shuffle.partitions\", \"2\") \\\n        .getOrCreate()\n\n    print(f\"SparkSession created: {spark.sparkContext.appName}\")\n\n    # 1. Create a sample PySpark DataFrame with timestamp and ID columns\n    data = [\n        (\"sensor_A\", \"2023-01-01 00:00:00\", 10.0),\n        (\"sensor_A\", \"2023-01-01 00:01:00\", 11.0),\n        (\"sensor_A\", \"2023-01-01 00:02:00\", 12.0),\n        (\"sensor_B\", \"2023-01-01 00:00:00\", 20.0),\n        (\"sensor_B\", \"2023-01-01 00:01:00\", 21.0),\n    ]\n    schema = [\"device_id\", \"timestamp_str\", \"value\"]\n    df = spark.createDataFrame(data, schema=schema).withColumn(\n        \"timestamp\", to_timestamp(\"timestamp_str\")\n    ).drop(\"timestamp_str\")\n\n    # 2. Convert to Tempo TSDF\n    tsdf = TSDF(df, ts_col=\"timestamp\", id_cols=[\"device_id\"])\n    print(\"\\nOriginal TSDF head:\")\n    tsdf.df.show()\n\n    # 3. Perform a basic Tempo operation: Resample to 5-minute intervals\n    # and aggregate by taking the average value\n    resampled_tsdf = tsdf.resample(\"5 minutes\", agg_f=\"mean\")\n    print(\"\\nResampled TSDF (5-min intervals, mean agg) head:\")\n    resampled_tsdf.df.show()\n\n    # 4. Another operation: Fill missing values with forward fill\n    # Ensure the resampled TSDF has gaps before filling for demonstration\n    # (e.g., if there were no 'sensor_A' data for a 5-min interval)\n    filled_tsdf = resampled_tsdf.ffill(group_cols=[\"device_id\"])\n    print(\"\\nFilled TSDF (forward fill) head:\")\n    filled_tsdf.df.show()\n\nexcept Exception as e:\n    print(f\"An error occurred during Tempo quickstart: {e}\")\nfinally:\n    if spark:\n        spark.stop()\n        print(\"SparkSession stopped.\")\n","lang":"python","description":"This quickstart demonstrates how to initialize a TempoSparkSession, create a sample Spark DataFrame, convert it into a Tempo TSDF, and perform common timeseries operations like resampling and forward-filling missing values. It's designed to run in a local PySpark environment or within a Databricks notebook."},"warnings":[{"fix":"Review `asofJoin` behavior and performance in DLT environments. If prior specific optimizations are critical, consider explicitly managing join strategies or table sizes, or consult Tempo's documentation for DLT best practices.","message":"The `asofJoin()` optimization logic for small tables was updated in v0.1.24 to bypass certain checks when used with Delta Live Tables (DLT). This change might affect performance or specific behavior if your DLT pipelines relied on the previous optimization strategy, particularly for join conditions involving table sizes.","severity":"gotcha","affected_versions":">=0.1.24"},{"fix":"If using `extractStateInterval()`, re-evaluate your logic and expected output based on the new 'per metric column' comparison. Adjust downstream processing or specify columns explicitly if you require the old behavior or a different grouping.","message":"The behavior of `TSDF.extractStateInterval()` was modified in v0.1.20 to perform state comparison per metric column, rather than across all metric columns combined. This changes the output and how intervals are extracted based on state changes.","severity":"breaking","affected_versions":">=0.1.20"},{"fix":"Pin `dbl-tempo` to exact patch versions (`0.x.y`) in production environments. Regularly review release notes and test thoroughly when upgrading, even for minor versions, to identify any unexpected changes.","message":"As a 'Databricks Labs' project and currently in `0.1.x` versions, `dbl-tempo` APIs may not be fully stable. Minor version updates can introduce breaking changes or significant behavioral shifts without strict adherence to semantic versioning until a `1.0` release.","severity":"gotcha","affected_versions":"<1.0.0"}],"env_vars":null,"last_verified":"2026-04-10T00:00:00.000Z","next_check":"2026-07-09T00:00:00.000Z"}