{"id":2502,"library":"findspark","title":"Findspark","description":"Findspark is a Python library that simplifies the process of making Apache PySpark importable in standard Python environments, like Jupyter notebooks or IDEs. It automatically locates a Spark installation on the system (using `SPARK_HOME` or common paths) and adds the necessary PySpark and Py4J directories to `sys.path`. The current version, 2.0.1, was released on February 11, 2022, and focuses on stability and bug fixes. It's maintained as a utility for PySpark users.","status":"active","version":"2.0.1","language":"en","source_language":"en","source_url":"https://github.com/minrk/findspark","tags":["spark","pyspark","environment","setup","jupyter"],"install":[{"cmd":"pip install findspark","lang":"bash","label":"Install latest version"}],"dependencies":[],"imports":[{"note":"findspark.init() must be called before pyspark is imported to correctly set up sys.path.","wrong":"import pyspark\nimport findspark\nfindspark.init()","symbol":"findspark","correct":"import findspark\nfindspark.init()"}],"quickstart":{"code":"import os\nimport findspark\n\n# Optional: Set SPARK_HOME if it's not already an environment variable\n# findspark will try to auto-detect if not set.\n# Example: os.environ['SPARK_HOME'] = '/opt/spark'\n\nfindspark.init()\n\ntry:\n    import pyspark\n    from pyspark.sql import SparkSession\n    print(\"PySpark is now importable.\")\n\n    spark = SparkSession.builder.appName(\"FindsparkTest\").master(\"local[*]\").getOrCreate()\n    print(f\"SparkContext version: {spark.sparkContext.version}\")\n    print(\"SparkSession created successfully.\")\n    spark.stop()\nexcept ImportError as e:\n    print(f\"Error importing PySpark: {e}\")\n    print(\"Please ensure Spark is installed and SPARK_HOME is correctly configured.\")\nexcept Exception as e:\n    print(f\"An unexpected error occurred: {e}\")\n","lang":"python","description":"This quickstart demonstrates how to initialize findspark and then import PySpark to create a SparkSession. It's crucial that `findspark.init()` is called before `import pyspark`."},"warnings":[{"fix":"Always place `import findspark; findspark.init()` at the very beginning of your script or notebook, prior to any `import pyspark` statements.","message":"`findspark.init()` must be called *before* any `import pyspark` statement. If `pyspark` is imported first, `findspark.init()` will do nothing (since `findspark` 2.0.0) or fail to configure `sys.path` correctly, leading to `ModuleNotFoundError` or similar issues when trying to use Spark functionalities.","severity":"gotcha","affected_versions":">=2.0.0"},{"fix":"Ensure `SPARK_HOME` is correctly set and points to a valid Spark installation. Verify the presence of `python/lib/py4j-*.zip` and `python/pyspark.zip` within your `SPARK_HOME` directory. You can manually specify the path using `findspark.init('/path/to/spark_home')`.","message":"Incorrect `SPARK_HOME` or incomplete Spark installation can cause errors. `findspark` relies on the `SPARK_HOME` environment variable or specific default paths. If Spark is not found or `SPARK_HOME` points to an installation missing crucial components (e.g., `py4j` JARs), `findspark.init()` may raise a `ValueError` or `IndexError`.","severity":"gotcha","affected_versions":"All"},{"fix":"Always restart your Python kernel or shell session after modifying `SPARK_HOME` or calling `findspark.init()` with a new `spark_home` if you intend to switch Spark versions or installations.","message":"Changing the `SPARK_HOME` environment variable or specifying a new `spark_home` path in `findspark.init()` within a long-running Python session (e.g., Jupyter notebook) typically requires restarting the Python kernel for the changes to take effect, especially if `pyspark` has already been loaded.","severity":"gotcha","affected_versions":"All"},{"fix":"Prefer setting `SPARK_HOME` as an environment variable explicitly in your environment setup (e.g., `~/.bashrc`, `~/.profile`, virtual environment activation scripts) rather than relying on `findspark` to persist changes. Manually manage your shell or IPython profiles if persistence is required.","message":"The `edit_rc=True` and `edit_profile=True` arguments in `findspark.init()` modify shell configuration files (`~/.bashrc`) or IPython profiles. While intended for convenience, these methods can lead to unintended side effects, make environments less reproducible, or cause issues if the Spark installation moves. Since `findspark` 2.0.0, the related internal methods were made private.","severity":"deprecated","affected_versions":">=2.0.0 (internal methods deprecated), All (usage gotcha)"}],"env_vars":null,"last_verified":"2026-04-10T00:00:00.000Z","next_check":"2026-07-09T00:00:00.000Z"}