{"id":6818,"library":"pyspark-client","title":"PySpark Connect Client","description":"The `pyspark-client` is the Python Spark Connect client for Apache Spark, providing a decoupled client-server architecture that enables remote connectivity to Spark clusters using the DataFrame API. It uses gRPC and Apache Arrow for efficient communication. The library is part of the broader Apache Spark project and is actively developed, with releases typically aligning with Apache Spark's minor and major version updates. The current version is 4.1.1, supporting Spark 4.1.1.","status":"active","version":"4.1.1","language":"en","source_language":"en","source_url":"https://github.com/apache/spark/tree/master/python","tags":["spark","apache","data-processing","distributed-computing","client","spark-connect"],"install":[{"cmd":"pip install pyspark-client","lang":"bash","label":"Basic Install"},{"cmd":"pip install pyspark[connect]","lang":"bash","label":"Full PySpark with Connect"}],"dependencies":[{"reason":"The client is part of the PySpark ecosystem and typically used alongside a full PySpark installation, though `pyspark-client` can be installed standalone for just the client. `pyspark[connect]` provides the client and its dependencies.","package":"pyspark","optional":false},{"reason":"Underpins the communication protocol between the client and the Spark Connect server.","package":"grpcio","optional":true},{"reason":"Used for optimized, columnar data transfer between the Spark Connect server and the client.","package":"pyarrow","optional":true}],"imports":[{"note":"The `SparkSession` is imported from `pyspark.sql` and then configured for remote connection using `.remote()`.","wrong":"from pyspark.sql.connect.session import SparkSession","symbol":"SparkSession","correct":"from pyspark.sql import SparkSession"}],"quickstart":{"code":"import os\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql.functions import lit\n\n# Ensure a Spark Connect server is running, e.g., via ./sbin/start-connect-server.sh\n# The default address is sc://localhost:15002\n\n# Connect to the Spark Connect server\nspark = SparkSession.builder.remote(os.environ.get('SPARK_CONNECT_SERVER_URL', 'sc://localhost:15002')).getOrCreate()\n\n# Create a DataFrame\ndf = spark.range(10).withColumn(\"hello\", lit(\"world\"))\n\n# Show the DataFrame\ndf.show()\n\n# Perform a simple operation\nresult = df.filter(df.id > 5).count()\nprint(f\"Count of rows with id > 5: {result}\")\n\nspark.stop()","lang":"python","description":"This quickstart demonstrates how to establish a connection to a Spark Connect server and perform basic DataFrame operations. It assumes a Spark Connect server is already running and accessible at the specified URL (defaulting to `sc://localhost:15002`). The `SPARK_CONNECT_SERVER_URL` environment variable can be used to override the connection string."},"warnings":[{"fix":"Rewrite code to use standard PySpark DataFrame API methods. Avoid direct JVM object manipulation. Focus on the DataFrame API and logical plans.","message":"Spark Connect operates on a decoupled client-server architecture. This means your client application does not run in the same JVM process as the Spark driver. Consequently, direct access to the underlying Java Virtual Machine (JVM) objects (e.g., `df._jdf`) via Py4J, common in traditional PySpark, is not possible.","severity":"gotcha","affected_versions":"Spark 3.4.0+ (all Spark Connect versions)"},{"fix":"Ensure a Spark Connect server is deployed and running, and configure your client's `SparkSession.builder.remote()` method with the correct connection string (e.g., `sc://localhost:15002`).","message":"The `pyspark-client` is a client library only. It does not include or automatically start a Spark cluster or Spark Connect server. You must have a Spark Connect server running and accessible (e.g., via `start-connect-server.sh` from a full Spark distribution) before your client application can connect.","severity":"gotcha","affected_versions":"All versions of `pyspark-client`"},{"fix":"Review existing SQL queries and DataFrame operations that might rely on `NULL` handling. Implement explicit error handling or adjust queries to conform to ANSI SQL standards. The legacy behavior can be restored by setting `spark.sql.ansi.enabled` to `false`.","message":"Starting with Spark 4.0, ANSI SQL mode is enabled by default (`spark.sql.ansi.enabled` set to `true`). This changes how SQL operations handle invalid or undefined results. Operations that previously returned `NULL` (e.g., division by zero, invalid casts) will now throw runtime exceptions.","severity":"breaking","affected_versions":"Spark 4.0.0+ (including `pyspark-client` 4.0.0+)"},{"fix":"Upgrade your Python environment to 3.10 or later. Update `pyarrow` and `pandas` to their respective minimum required versions or newer. If using `pyspark[connect]`, `pip` will generally handle these dependencies correctly.","message":"PySpark 4.1 (and thus `pyspark-client` 4.1.1) drops support for Python 3.9. Additionally, minimum required versions for `pyarrow` and `pandas` have been raised to `pyarrow>=15.0.0` and `pandas>=2.2.0`.","severity":"breaking","affected_versions":"PySpark 4.1.0+ / `pyspark-client` 4.1.0+"},{"fix":"Set the environment variable `PYSPARK_VALIDATE_COLUMN_NAME_LEGACY=1` to restore the legacy eager validation behavior if desired.","message":"In Spark 4.1, `DataFrame['name']` on the Spark Connect Python Client no longer eagerly validates the column name. This means misspelled or non-existent column names might not raise an error until later in execution.","severity":"gotcha","affected_versions":"Spark 4.1.0+ (for Spark Connect Python Client)"}],"env_vars":null,"last_verified":"2026-04-15T00:00:00.000Z","next_check":"2026-07-14T00:00:00.000Z","problems":[]}