PySpark Connect Client

4.1.1 · active · verified Wed Apr 15

The `pyspark-client` is the Python Spark Connect client for Apache Spark, providing a decoupled client-server architecture that enables remote connectivity to Spark clusters using the DataFrame API. It uses gRPC and Apache Arrow for efficient communication. The library is part of the broader Apache Spark project and is actively developed, with releases typically aligning with Apache Spark's minor and major version updates. The current version is 4.1.1, supporting Spark 4.1.1.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to establish a connection to a Spark Connect server and perform basic DataFrame operations. It assumes a Spark Connect server is already running and accessible at the specified URL (defaulting to `sc://localhost:15002`). The `SPARK_CONNECT_SERVER_URL` environment variable can be used to override the connection string.

import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

# Ensure a Spark Connect server is running, e.g., via ./sbin/start-connect-server.sh
# The default address is sc://localhost:15002

# Connect to the Spark Connect server
spark = SparkSession.builder.remote(os.environ.get('SPARK_CONNECT_SERVER_URL', 'sc://localhost:15002')).getOrCreate()

# Create a DataFrame
df = spark.range(10).withColumn("hello", lit("world"))

# Show the DataFrame
df.show()

# Perform a simple operation
result = df.filter(df.id > 5).count()
print(f"Count of rows with id > 5: {result}")

spark.stop()

view raw JSON →