Findspark

2.0.1 · active · verified Fri Apr 10

Findspark is a Python library that simplifies the process of making Apache PySpark importable in standard Python environments, like Jupyter notebooks or IDEs. It automatically locates a Spark installation on the system (using `SPARK_HOME` or common paths) and adds the necessary PySpark and Py4J directories to `sys.path`. The current version, 2.0.1, was released on February 11, 2022, and focuses on stability and bug fixes. It's maintained as a utility for PySpark users.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize findspark and then import PySpark to create a SparkSession. It's crucial that `findspark.init()` is called before `import pyspark`.

import os
import findspark

# Optional: Set SPARK_HOME if it's not already an environment variable
# findspark will try to auto-detect if not set.
# Example: os.environ['SPARK_HOME'] = '/opt/spark'

findspark.init()

try:
    import pyspark
    from pyspark.sql import SparkSession
    print("PySpark is now importable.")

    spark = SparkSession.builder.appName("FindsparkTest").master("local[*]").getOrCreate()
    print(f"SparkContext version: {spark.sparkContext.version}")
    print("SparkSession created successfully.")
    spark.stop()
except ImportError as e:
    print(f"Error importing PySpark: {e}")
    print("Please ensure Spark is installed and SPARK_HOME is correctly configured.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

view raw JSON →