PySpark Connect Client
The `pyspark-client` is the Python Spark Connect client for Apache Spark, providing a decoupled client-server architecture that enables remote connectivity to Spark clusters using the DataFrame API. It uses gRPC and Apache Arrow for efficient communication. The library is part of the broader Apache Spark project and is actively developed, with releases typically aligning with Apache Spark's minor and major version updates. The current version is 4.1.1, supporting Spark 4.1.1.
Warnings
- gotcha Spark Connect operates on a decoupled client-server architecture. This means your client application does not run in the same JVM process as the Spark driver. Consequently, direct access to the underlying Java Virtual Machine (JVM) objects (e.g., `df._jdf`) via Py4J, common in traditional PySpark, is not possible.
- gotcha The `pyspark-client` is a client library only. It does not include or automatically start a Spark cluster or Spark Connect server. You must have a Spark Connect server running and accessible (e.g., via `start-connect-server.sh` from a full Spark distribution) before your client application can connect.
- breaking Starting with Spark 4.0, ANSI SQL mode is enabled by default (`spark.sql.ansi.enabled` set to `true`). This changes how SQL operations handle invalid or undefined results. Operations that previously returned `NULL` (e.g., division by zero, invalid casts) will now throw runtime exceptions.
- breaking PySpark 4.1 (and thus `pyspark-client` 4.1.1) drops support for Python 3.9. Additionally, minimum required versions for `pyarrow` and `pandas` have been raised to `pyarrow>=15.0.0` and `pandas>=2.2.0`.
- gotcha In Spark 4.1, `DataFrame['name']` on the Spark Connect Python Client no longer eagerly validates the column name. This means misspelled or non-existent column names might not raise an error until later in execution.
Install
-
pip install pyspark-client -
pip install pyspark[connect]
Imports
- SparkSession
from pyspark.sql.connect.session import SparkSession
from pyspark.sql import SparkSession
Quickstart
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
# Ensure a Spark Connect server is running, e.g., via ./sbin/start-connect-server.sh
# The default address is sc://localhost:15002
# Connect to the Spark Connect server
spark = SparkSession.builder.remote(os.environ.get('SPARK_CONNECT_SERVER_URL', 'sc://localhost:15002')).getOrCreate()
# Create a DataFrame
df = spark.range(10).withColumn("hello", lit("world"))
# Show the DataFrame
df.show()
# Perform a simple operation
result = df.filter(df.id > 5).count()
print(f"Count of rows with id > 5: {result}")
spark.stop()