PySpark

4.1.1 · active · verified Sat Mar 28

PySpark is the Python API for Apache Spark, a unified analytics engine for large-scale data processing. It allows users to leverage Spark's powerful distributed computing capabilities, including Spark SQL, DataFrames, Structured Streaming, and MLlib, using familiar Python syntax. The library is actively maintained, with the current version being 4.1.1, and follows the release cadence of the broader Apache Spark project.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize a SparkSession, create a DataFrame from Python data, display its schema and content, perform a filtering transformation, and then a grouping and aggregation. It also highlights the importance of setting `JAVA_HOME`.

import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# PySpark requires JAVA_HOME to be set. Ensure it points to your JDK installation.
# For example: os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-11-openjdk-amd64'

# Create a SparkSession - the entry point to Spark functionality
spark = SparkSession.builder \
    .appName("PySparkQuickstart") \
    .getOrCreate()

# Create a simple DataFrame
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3), ("David", 1)]
columns = ["name", "value"]
df = spark.createDataFrame(data, columns)

# Show the DataFrame schema and data
df.printSchema()
df.show()

# Perform a simple transformation (filter) and action (show)
filtered_df = df.filter(col("value") > 1)
print("Filtered DataFrame:")
filtered_df.show()

# Group by 'value' and count occurrences
grouped_df = df.groupBy("value").count()
print("Grouped DataFrame:")
grouped_df.show()

# Stop the SparkSession
spark.stop()

view raw JSON →