PySpark
PySpark is the Python API for Apache Spark, a unified analytics engine for large-scale data processing. It allows users to leverage Spark's powerful distributed computing capabilities, including Spark SQL, DataFrames, Structured Streaming, and MLlib, using familiar Python syntax. The library is actively maintained, with the current version being 4.1.1, and follows the release cadence of the broader Apache Spark project.
Warnings
- breaking PySpark 4.0 dropped support for Python 3.8. Ensure your Python environment is 3.10 or higher.
- breaking The minimum required Pandas version for PySpark 4.0+ was raised. If using the Pandas API on Spark, ensure your Pandas installation meets the new requirement (1.0.5 or higher).
- breaking In PySpark 4.1, the Pandas API on Spark operates under ANSI mode by default. This might change behavior for certain operations, especially concerning null handling and type conversions.
- gotcha PySpark operations are lazily evaluated. Transformations (e.g., `filter`, `select`) do not execute immediately; computation only triggers when an action (e.g., `show`, `count`, `collect`, `write`) is called.
- gotcha Calling `.collect()` on a large DataFrame can pull all distributed data to the driver node, potentially causing OutOfMemory (OOM) errors and crashing the application.
- gotcha Apache Spark (and thus PySpark) requires a Java Development Kit (JDK) to be installed and the `JAVA_HOME` environment variable to be correctly set, otherwise PySpark applications will fail to launch.
- gotcha Ignoring partitioning strategies for DataFrames can lead to data skew and inefficient shuffles, significantly degrading performance for wide transformations like `groupBy` or `join`.
Install
-
pip install pyspark -
pip install pyspark[sql] pyspark[pandas_on_spark] pyspark[connect]
Imports
- SparkSession
from pyspark.sql import SparkSession
- functions
from pyspark.sql import functions as F
- Row
from pyspark.sql import Row
- types
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
Quickstart
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# PySpark requires JAVA_HOME to be set. Ensure it points to your JDK installation.
# For example: os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-11-openjdk-amd64'
# Create a SparkSession - the entry point to Spark functionality
spark = SparkSession.builder \
.appName("PySparkQuickstart") \
.getOrCreate()
# Create a simple DataFrame
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3), ("David", 1)]
columns = ["name", "value"]
df = spark.createDataFrame(data, columns)
# Show the DataFrame schema and data
df.printSchema()
df.show()
# Perform a simple transformation (filter) and action (show)
filtered_df = df.filter(col("value") > 1)
print("Filtered DataFrame:")
filtered_df.show()
# Group by 'value' and count occurrences
grouped_df = df.groupBy("value").count()
print("Grouped DataFrame:")
grouped_df.show()
# Stop the SparkSession
spark.stop()