{"id":567,"library":"pyspark","title":"PySpark","description":"PySpark is the Python API for Apache Spark, a unified analytics engine for large-scale data processing. It allows users to leverage Spark's powerful distributed computing capabilities, including Spark SQL, DataFrames, Structured Streaming, and MLlib, using familiar Python syntax. The library is actively maintained, with the current version being 4.1.1, and follows the release cadence of the broader Apache Spark project.","status":"active","version":"4.1.1","language":"python","source_language":"en","source_url":"https://github.com/apache/spark/tree/master/python","tags":["big-data","distributed-computing","data-processing","etl","dataframe","sql","machine-learning"],"install":[{"cmd":"pip install pyspark","lang":"bash","label":"Base installation"},{"cmd":"pip install pyspark[sql] pyspark[pandas_on_spark] pyspark[connect]","lang":"bash","label":"With optional components (SQL, Pandas API, Connect)"}],"dependencies":[{"reason":"Apache Spark, which PySpark interfaces with, requires a compatible JDK (version 8 or 11 recommended) to be installed and JAVA_HOME environment variable set.","package":"Java Development Kit (JDK)","optional":false},{"reason":"PySpark 4.0+ requires Python 3.10 or higher.","package":"Python","optional":false}],"imports":[{"note":"The primary entry point for PySpark functionality.","symbol":"SparkSession","correct":"from pyspark.sql import SparkSession"},{"note":"Commonly imported for SQL-like functions such as col, sum, when, lit, etc.","symbol":"functions","correct":"from pyspark.sql import functions as F"},{"note":"Used for creating DataFrames from lists of Row objects.","symbol":"Row","correct":"from pyspark.sql import Row"},{"note":"Used for explicitly defining DataFrame schemas.","symbol":"types","correct":"from pyspark.sql.types import StructType, StructField, StringType, IntegerType"}],"quickstart":{"code":"import os\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql.functions import col\n\n# PySpark requires JAVA_HOME to be set. Ensure it points to your JDK installation.\n# For example: os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-11-openjdk-amd64'\n\n# Create a SparkSession - the entry point to Spark functionality\nspark = SparkSession.builder \\\n    .appName(\"PySparkQuickstart\") \\\n    .getOrCreate()\n\n# Create a simple DataFrame\ndata = [(\"Alice\", 1), (\"Bob\", 2), (\"Charlie\", 3), (\"David\", 1)]\ncolumns = [\"name\", \"value\"]\ndf = spark.createDataFrame(data, columns)\n\n# Show the DataFrame schema and data\ndf.printSchema()\ndf.show()\n\n# Perform a simple transformation (filter) and action (show)\nfiltered_df = df.filter(col(\"value\") > 1)\nprint(\"Filtered DataFrame:\")\nfiltered_df.show()\n\n# Group by 'value' and count occurrences\ngrouped_df = df.groupBy(\"value\").count()\nprint(\"Grouped DataFrame:\")\ngrouped_df.show()\n\n# Stop the SparkSession\nspark.stop()\n","lang":"python","description":"This quickstart demonstrates how to initialize a SparkSession, create a DataFrame from Python data, display its schema and content, perform a filtering transformation, and then a grouping and aggregation. It also highlights the importance of setting `JAVA_HOME`."},"warnings":[{"fix":"Upgrade your Python environment to 3.10 or newer. PySpark 4.0+ supports Python 3.10, 3.11, and 3.12.","message":"PySpark 4.0 dropped support for Python 3.8. Ensure your Python environment is 3.10 or higher.","severity":"breaking","affected_versions":"4.0.0+"},{"fix":"Upgrade Pandas in your environment: `pip install 'pandas>=1.0.5'`","message":"The minimum required Pandas version for PySpark 4.0+ was raised. If using the Pandas API on Spark, ensure your Pandas installation meets the new requirement (1.0.5 or higher).","severity":"breaking","affected_versions":"4.0.0+"},{"fix":"Review your code for potential changes in behavior related to ANSI SQL mode if relying on Pandas API on Spark. Refer to Spark documentation for specifics on ANSI mode implications.","message":"In PySpark 4.1, the Pandas API on Spark operates under ANSI mode by default. This might change behavior for certain operations, especially concerning null handling and type conversions.","severity":"breaking","affected_versions":"4.1.0+"},{"fix":"Understand the lazy evaluation model. Use `df.explain()` to see the execution plan without triggering computation, and be aware that actions force execution.","message":"PySpark operations are lazily evaluated. Transformations (e.g., `filter`, `select`) do not execute immediately; computation only triggers when an action (e.g., `show`, `count`, `collect`, `write`) is called.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Avoid `.collect()` for large datasets. Use `df.show()`, `df.take(N)`, `df.limit(N).toPandas()`, or write to distributed storage for inspecting data or small samples.","message":"Calling `.collect()` on a large DataFrame can pull all distributed data to the driver node, potentially causing OutOfMemory (OOM) errors and crashing the application.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Install a compatible JDK (e.g., OpenJDK 8 or 11) and set the `JAVA_HOME` environment variable to point to your JDK installation directory. For minimal environments (e.g., Alpine), ensure that required shell utilities like `bash` are also installed (`apk add bash` for Alpine), as Spark's startup scripts may depend on them.","message":"Apache Spark (and thus PySpark) requires a Java Development Kit (JDK) to be installed and the `JAVA_HOME` environment variable to be correctly set. Additionally, Spark's internal startup scripts often rely on common shell utilities (like `bash`). Failure to meet these requirements can prevent PySpark applications from launching, often resulting in a `JAVA_GATEWAY_EXITED` error.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Analyze your data's distribution and use `df.repartition(N, *columns)` or `df.coalesce(N)` before wide transformations to optimize partitioning, especially for high-cardinality columns.","message":"Ignoring partitioning strategies for DataFrames can lead to data skew and inefficient shuffles, significantly degrading performance for wide transformations like `groupBy` or `join`.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-05-12T15:24:08.714Z","next_check":"2026-06-26T00:00:00.000Z","problems":[{"fix":"Install PySpark using pip: `pip install pyspark` or ensure your environment variables (like `PYTHONPATH`) correctly point to your PySpark installation if it's not a standard `pip` install.","cause":"PySpark is either not installed in the active Python environment or the Python interpreter being used does not have access to the PySpark installation.","error":"ModuleNotFoundError: No module named 'pyspark'"},{"fix":"Set the `SPARK_HOME` environment variable to your Spark installation path. For example, on Linux/macOS: `export SPARK_HOME=/path/to/spark` or in a Python script using `import os; os.environ['SPARK_HOME'] = '/path/to/spark'`. If using `findspark`, call `findspark.init('/path/to/spark')`.","cause":"The `SPARK_HOME` environment variable, which points to the Apache Spark installation directory, is not set or is not correctly accessible by the PySpark script or interactive session.","error":"KeyError: 'SPARK_HOME'"},{"fix":"Convert the DataFrame to an RDD using `.rdd` before applying RDD transformations (e.g., `df.rdd.map(...)`) or use DataFrame-specific methods like `select()`, `withColumn()`, `filter()`, `udf()` for column-wise operations, which are generally more efficient.","cause":"You are attempting to use the `map()` transformation, which is an RDD (Resilient Distributed Dataset) method, directly on a PySpark DataFrame. DataFrames have different, more optimized, higher-level APIs for transformations.","error":"AttributeError: 'DataFrame' object has no attribute 'map'"},{"fix":"Include the required JAR file(s) in Spark's classpath. When running with `spark-submit`, use the `--jars` option (e.g., `spark-submit --jars postgresql-42.7.0.jar your_script.py`). When creating a `SparkSession`, configure it with `spark.jars` or `spark.driver.extraClassPath` properties.","cause":"This error typically occurs when Spark tries to load a Java class (like a JDBC driver or a custom data source) that is not available in its classpath. This happens frequently when connecting to databases or using external libraries like Delta Lake without providing the necessary JAR files.","error":"java.lang.ClassNotFoundException: org.postgresql.Driver"},{"fix":"Access the `SparkContext` object from your `SparkSession` instance to use `parallelize()`. For example, if your `SparkSession` object is named `spark`, use `spark.sparkContext.parallelize(...)`.","cause":"`parallelize()` is a method of `SparkContext` used to create an RDD from a Python collection. You are trying to call this method directly on a `SparkSession` object, which does not expose this functionality directly.","error":"AttributeError: 'SparkSession' object has no attribute 'parallelize'"}],"ecosystem":"pypi","meta_description":null,"install_score":100,"install_tag":"verified","quickstart_score":0,"quickstart_tag":"stale","pypi_latest":null,"install_checks":{"last_tested":"2026-05-12","tag":"verified","tag_description":"installs cleanly on critical runtimes, fast import, recently tested","results":[{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.67,"mem_mb":12.8,"disk_size":"505.1M"},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"sql","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.93,"mem_mb":17.8,"disk_size":"859.9M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.4,"mem_mb":12.8,"disk_size":"506M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"sql","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.69,"mem_mb":17.8,"disk_size":"835M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":1.02,"mem_mb":13.9,"disk_size":"511.0M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"sql","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":1.41,"mem_mb":19.2,"disk_size":"879.6M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.9,"mem_mb":13.9,"disk_size":"512M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"sql","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.8,"mem_mb":19.1,"disk_size":"854M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.91,"mem_mb":13.9,"disk_size":"500.0M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"sql","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":1.28,"mem_mb":19.2,"disk_size":"861.6M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.78,"mem_mb":13.9,"disk_size":"501M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"sql","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":1.13,"mem_mb":19.2,"disk_size":"836M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.75,"mem_mb":14.3,"disk_size":"499.2M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"sql","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.9,"mem_mb":19.1,"disk_size":"860.0M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.8,"mem_mb":14.3,"disk_size":"500M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"sql","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":1.23,"mem_mb":19.1,"disk_size":"834M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.57,"mem_mb":12.1,"disk_size":"483.4M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"sql","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.86,"mem_mb":17.3,"disk_size":"818.4M"},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.39,"mem_mb":12.1,"disk_size":"484M"},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"sql","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.91,"mem_mb":17.3,"disk_size":"793M"}]},"quickstart_checks":{"last_tested":"2026-04-23","tag":"stale","tag_description":"widespread failures or data too old to trust","results":[{"runtime":"python:3.10-alpine","exit_code":-1},{"runtime":"python:3.10-slim","exit_code":-1},{"runtime":"python:3.11-alpine","exit_code":-1},{"runtime":"python:3.11-slim","exit_code":-1},{"runtime":"python:3.12-alpine","exit_code":-1},{"runtime":"python:3.12-slim","exit_code":-1},{"runtime":"python:3.13-alpine","exit_code":-1},{"runtime":"python:3.13-slim","exit_code":-1},{"runtime":"python:3.9-alpine","exit_code":-1},{"runtime":"python:3.9-slim","exit_code":-1}]}}