PySpark Stubs
PySpark Stubs (pyspark-stubs) provides automatically generated type stubs for the Apache PySpark library. These stubs enable IDEs and static type checkers like MyPy to provide intelligent code completion, detect common programming errors, and improve code quality by enforcing type safety in PySpark applications. The current version is 3.0.0.post3, typically updated to align with major PySpark releases, often with `post` versions for stub refinements.
Common errors
-
mypy: error: Cannot find module named 'pyspark'
cause The `pyspark-stubs` package is either not installed, or `mypy` cannot find it in your environment's `site-packages`.fixRun `pip install pyspark-stubs`. Ensure your `mypy` configuration (e.g., `mypy.ini`) points to the correct Python environment or that `pyspark-stubs` is installed globally or in your project's virtual environment. -
mypy: error: Module 'pyspark.sql' has no attribute 'SparkSession'
cause This typically indicates a version mismatch between `pyspark-stubs` and your installed `pyspark` version, or that the stubs for `SparkSession` are not being correctly picked up by `mypy`.fixVerify that your `pyspark-stubs` version matches your `pyspark` version (e.g., `pyspark-stubs==3.0.*` for `pyspark==3.0.*`). Also, confirm that `pyspark-stubs` is correctly installed in the environment `mypy` is checking. -
pip install pyspark-stubs fails with a dependency error related to `pyspark` (e.g., 'Requires-Dist: pyspark')
cause Older versions of `pyspark-stubs` might have listed `pyspark` as an optional dependency with specific version requirements, which could conflict if you already have `pyspark` installed with different constraints. The current approach usually avoids this.fixEnsure you are using a recent version of `pyspark-stubs` (3.0.0.post1+). If the issue persists, try installing `pyspark-stubs` without its (potential) `pyspark` dependency by using `--no-deps` if `pyspark` is already installed, then check versions. Usually, this is not needed for modern `pyspark-stubs` versions.
Warnings
- gotcha The version of `pyspark-stubs` should generally match the major version of your `pyspark` installation. Mismatched versions can lead to incorrect type checking results, including missing attributes or incompatible type signatures.
- gotcha `pyspark-stubs` only provides type hint files (`.pyi`); it does not include or install the actual `pyspark` library. Your code will not run if `pyspark` is not installed separately.
- gotcha Installing `pyspark-stubs` has no runtime effect on your PySpark application. Its sole purpose is to provide static type information for tools like MyPy, Pylance, or other IDEs.
Install
-
pip install pyspark-stubs -
pip install 'pyspark-stubs==3.0.*'
Imports
- SparkSession
from pyspark.sql import SparkSession
Quickstart
import os
from pyspark.sql import SparkSession
from typing import List
# Instantiate SparkSession (requires PySpark to be installed and configured)
spark: SparkSession = (SparkSession.builder
.appName("PySparkStubsExample")
.getOrCreate()
)
# Example of using PySpark with type hints
def process_data(data: List[int]) -> List[int]:
# In a real scenario, this would involve Spark RDDs/DataFrames
# This is a simplified example to show type hints in action.
# For a type checker, 'pyspark-stubs' helps validate Spark-specific types.
print(f"Processing data: {data}")
return [x * 2 for x in data]
if __name__ == '__main__':
sample_data: List[int] = [1, 2, 3]
processed_result = process_data(sample_data)
print(f"Processed result: {processed_result}")
# Example with a Spark DataFrame (for demonstration of type support)
# This part requires a running SparkSession and actual PySpark code.
# For type checking, `pyspark-stubs` ensures `spark` is typed correctly.
data_df = spark.createDataFrame([("Alice", 1), ("Bob", 2)], ["name", "age"])
data_df.printSchema()
data_df.show()
spark.stop()