PySpark Stubs

3.0.0.post3 · active · verified Thu Apr 16

PySpark Stubs (pyspark-stubs) provides automatically generated type stubs for the Apache PySpark library. These stubs enable IDEs and static type checkers like MyPy to provide intelligent code completion, detect common programming errors, and improve code quality by enforcing type safety in PySpark applications. The current version is 3.0.0.post3, typically updated to align with major PySpark releases, often with `post` versions for stub refinements.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use PySpark with type hints, leveraging `pyspark-stubs`. You install `pyspark-stubs` alongside your `pyspark` installation. When a type checker like MyPy processes this code, it uses the installed stubs to validate types for PySpark objects like `SparkSession` and DataFrame methods. The stubs themselves have no runtime effect; they only assist static analysis.

import os
from pyspark.sql import SparkSession
from typing import List

# Instantiate SparkSession (requires PySpark to be installed and configured)
spark: SparkSession = (SparkSession.builder
    .appName("PySparkStubsExample")
    .getOrCreate()
)

# Example of using PySpark with type hints
def process_data(data: List[int]) -> List[int]:
    # In a real scenario, this would involve Spark RDDs/DataFrames
    # This is a simplified example to show type hints in action.
    # For a type checker, 'pyspark-stubs' helps validate Spark-specific types.
    print(f"Processing data: {data}")
    return [x * 2 for x in data]

if __name__ == '__main__':
    sample_data: List[int] = [1, 2, 3]
    processed_result = process_data(sample_data)
    print(f"Processed result: {processed_result}")

    # Example with a Spark DataFrame (for demonstration of type support)
    # This part requires a running SparkSession and actual PySpark code.
    # For type checking, `pyspark-stubs` ensures `spark` is typed correctly.
    data_df = spark.createDataFrame([("Alice", 1), ("Bob", 2)], ["name", "age"])
    data_df.printSchema()
    data_df.show()

    spark.stop()

view raw JSON →