PySpark

4.1.1 verified Tue May 12 auth: no python install: verified quickstart: stale

PySpark is the Python API for Apache Spark, a unified analytics engine for large-scale data processing. It allows users to leverage Spark's powerful distributed computing capabilities, including Spark SQL, DataFrames, Structured Streaming, and MLlib, using familiar Python syntax. The library is actively maintained, with the current version being 4.1.1, and follows the release cadence of the broader Apache Spark project.

pip install pyspark

Common errors

error ModuleNotFoundError: No module named 'pyspark' ↓

cause PySpark is either not installed in the active Python environment or the Python interpreter being used does not have access to the PySpark installation.

fix

Install PySpark using pip: pip install pyspark or ensure your environment variables (like PYTHONPATH) correctly point to your PySpark installation if it's not a standard pip install.

error KeyError: 'SPARK_HOME' ↓

cause The `SPARK_HOME` environment variable, which points to the Apache Spark installation directory, is not set or is not correctly accessible by the PySpark script or interactive session.

fix

Set the SPARK_HOME environment variable to your Spark installation path. For example, on Linux/macOS: export SPARK_HOME=/path/to/spark or in a Python script using import os; os.environ['SPARK_HOME'] = '/path/to/spark'. If using findspark, call findspark.init('/path/to/spark').

error AttributeError: 'DataFrame' object has no attribute 'map' ↓

cause You are attempting to use the `map()` transformation, which is an RDD (Resilient Distributed Dataset) method, directly on a PySpark DataFrame. DataFrames have different, more optimized, higher-level APIs for transformations.

fix

Convert the DataFrame to an RDD using .rdd before applying RDD transformations (e.g., df.rdd.map(...)) or use DataFrame-specific methods like select(), withColumn(), filter(), udf() for column-wise operations, which are generally more efficient.

error java.lang.ClassNotFoundException: org.postgresql.Driver ↓

cause This error typically occurs when Spark tries to load a Java class (like a JDBC driver or a custom data source) that is not available in its classpath. This happens frequently when connecting to databases or using external libraries like Delta Lake without providing the necessary JAR files.

fix

Include the required JAR file(s) in Spark's classpath. When running with spark-submit, use the --jars option (e.g., spark-submit --jars postgresql-42.7.0.jar your_script.py). When creating a SparkSession, configure it with spark.jars or spark.driver.extraClassPath properties.

error AttributeError: 'SparkSession' object has no attribute 'parallelize' ↓

cause `parallelize()` is a method of `SparkContext` used to create an RDD from a Python collection. You are trying to call this method directly on a `SparkSession` object, which does not expose this functionality directly.

fix

Access the SparkContext object from your SparkSession instance to use parallelize(). For example, if your SparkSession object is named spark, use spark.sparkContext.parallelize(...).

Warnings

breaking PySpark 4.0 dropped support for Python 3.8. Ensure your Python environment is 3.10 or higher. ↓

fix Upgrade your Python environment to 3.10 or newer. PySpark 4.0+ supports Python 3.10, 3.11, and 3.12.

breaking The minimum required Pandas version for PySpark 4.0+ was raised. If using the Pandas API on Spark, ensure your Pandas installation meets the new requirement (1.0.5 or higher). ↓

fix Upgrade Pandas in your environment: `pip install 'pandas>=1.0.5'`

breaking In PySpark 4.1, the Pandas API on Spark operates under ANSI mode by default. This might change behavior for certain operations, especially concerning null handling and type conversions. ↓

fix Review your code for potential changes in behavior related to ANSI SQL mode if relying on Pandas API on Spark. Refer to Spark documentation for specifics on ANSI mode implications.

gotcha PySpark operations are lazily evaluated. Transformations (e.g., `filter`, `select`) do not execute immediately; computation only triggers when an action (e.g., `show`, `count`, `collect`, `write`) is called. ↓

fix Understand the lazy evaluation model. Use `df.explain()` to see the execution plan without triggering computation, and be aware that actions force execution.

gotcha Calling `.collect()` on a large DataFrame can pull all distributed data to the driver node, potentially causing OutOfMemory (OOM) errors and crashing the application. ↓

fix Avoid `.collect()` for large datasets. Use `df.show()`, `df.take(N)`, `df.limit(N).toPandas()`, or write to distributed storage for inspecting data or small samples.

gotcha Apache Spark (and thus PySpark) requires a Java Development Kit (JDK) to be installed and the `JAVA_HOME` environment variable to be correctly set. Additionally, Spark's internal startup scripts often rely on common shell utilities (like `bash`). Failure to meet these requirements can prevent PySpark applications from launching, often resulting in a `JAVA_GATEWAY_EXITED` error. ↓

fix Install a compatible JDK (e.g., OpenJDK 8 or 11) and set the `JAVA_HOME` environment variable to point to your JDK installation directory. For minimal environments (e.g., Alpine), ensure that required shell utilities like `bash` are also installed (`apk add bash` for Alpine), as Spark's startup scripts may depend on them.

gotcha Ignoring partitioning strategies for DataFrames can lead to data skew and inefficient shuffles, significantly degrading performance for wide transformations like `groupBy` or `join`. ↓

fix Analyze your data's distribution and use `df.repartition(N, *columns)` or `df.coalesce(N)` before wide transformations to optimize partitioning, especially for high-cardinality columns.

Install

pip install pyspark[sql] pyspark[pandas_on_spark] pyspark[connect]

Install compatibility verified last tested: 2026-05-12

python os / libc variant status wheel install import disk

3.10 alpine (musl) pyspark - - 0.67s 505.1M

3.10 alpine (musl) sql - - 0.93s 859.9M

3.10 slim (glibc) pyspark - - 0.40s 506M

3.10 slim (glibc) sql - - 0.69s 835M

3.11 alpine (musl) pyspark - - 1.02s 511.0M

3.11 alpine (musl) sql - - 1.41s 879.6M

3.11 slim (glibc) pyspark - - 0.90s 512M

3.11 slim (glibc) sql - - 0.80s 854M

3.12 alpine (musl) pyspark - - 0.91s 500.0M

3.12 alpine (musl) sql - - 1.28s 861.6M

3.12 slim (glibc) pyspark - - 0.78s 501M

3.12 slim (glibc) sql - - 1.13s 836M

3.13 alpine (musl) pyspark - - 0.75s 499.2M

3.13 alpine (musl) sql - - 0.90s 860.0M

3.13 slim (glibc) pyspark - - 0.80s 500M

3.13 slim (glibc) sql - - 1.23s 834M

3.9 alpine (musl) pyspark - - 0.57s 483.4M

3.9 alpine (musl) sql - - 0.86s 818.4M

3.9 slim (glibc) pyspark - - 0.39s 484M

3.9 slim (glibc) sql - - 0.91s 793M

Imports

SparkSession
```
from pyspark.sql import SparkSession
```
The primary entry point for PySpark functionality.
functions
```
from pyspark.sql import functions as F
```
Commonly imported for SQL-like functions such as col, sum, when, lit, etc.
Row
```
from pyspark.sql import Row
```
Used for creating DataFrames from lists of Row objects.

types

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

Used for explicitly defining DataFrame schemas.

Quickstart stale last tested: 2026-04-23

This quickstart demonstrates how to initialize a SparkSession, create a DataFrame from Python data, display its schema and content, perform a filtering transformation, and then a grouping and aggregation. It also highlights the importance of setting `JAVA_HOME`.

import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# PySpark requires JAVA_HOME to be set. Ensure it points to your JDK installation.
# For example: os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-11-openjdk-amd64'

# Create a SparkSession - the entry point to Spark functionality
spark = SparkSession.builder \
    .appName("PySparkQuickstart") \
    .getOrCreate()

# Create a simple DataFrame
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3), ("David", 1)]
columns = ["name", "value"]
df = spark.createDataFrame(data, columns)

# Show the DataFrame schema and data
df.printSchema()
df.show()

# Perform a simple transformation (filter) and action (show)
filtered_df = df.filter(col("value") > 1)
print("Filtered DataFrame:")
filtered_df.show()

# Group by 'value' and count occurrences
grouped_df = df.groupBy("value").count()
print("Grouped DataFrame:")
grouped_df.show()

# Stop the SparkSession
spark.stop()