{"id":10245,"library":"spark-sklearn","title":"Spark-sklearn: Scikit-learn on Spark","description":"spark-sklearn provides integration tools for running scikit-learn's GridSearchCV and RandomizedSearchCV on Apache Spark clusters. It leverages Spark for distributed computation of model training, allowing users to scale hyperparameter tuning. The library is currently at version 0.3.0, with its last release in 2017, and appears to be in an abandoned state with no active development or maintenance.","status":"abandoned","version":"0.3.0","language":"en","source_language":"en","source_url":"https://github.com/databricks/spark-sklearn","tags":["spark","scikit-learn","distributed-ml","machine-learning","hyperparameter-tuning","databricks"],"install":[{"cmd":"pip install spark-sklearn pyspark","lang":"bash","label":"Install spark-sklearn with PySpark"}],"dependencies":[{"reason":"Required to interact with Apache Spark. spark-sklearn officially supports Spark 2.x.","package":"pyspark","optional":false},{"reason":"The core machine learning library spark-sklearn integrates with. Officially supports scikit-learn 0.18.x.","package":"scikit-learn","optional":false}],"imports":[{"symbol":"GridSearchCV","correct":"from spark_sklearn import GridSearchCV"},{"symbol":"RandomizedSearchCV","correct":"from spark_sklearn import RandomizedSearchCV"},{"note":"SparkContext comes from PySpark, not spark-sklearn directly.","wrong":"from spark_sklearn import SparkContext","symbol":"SparkContext","correct":"from pyspark import SparkContext"}],"quickstart":{"code":"import os\nfrom pyspark import SparkContext\nfrom spark_sklearn import GridSearchCV\nfrom sklearn.svm import SVC\nfrom sklearn.datasets import make_classification\nfrom sklearn.model_selection import train_test_split\n\n# Initialize SparkContext\n# For local testing, 'local[*]' works. For a cluster, set SPARK_MASTER env var.\nif os.environ.get('SPARK_MASTER') is None:\n    os.environ['SPARK_MASTER'] = 'local[*]'\n\nsc = None\ntry:\n    sc = SparkContext(appName=\"SparkSklearnExample\")\n\n    # Generate some synthetic data\n    X, y = make_classification(n_samples=1000, n_features=20, random_state=42)\n    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n\n    # Define the estimator and parameter grid\n    estimator = SVC(gamma='auto', random_state=42)\n    param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}\n\n    # Use Spark-backed GridSearchCV\n    clf = GridSearchCV(sc, estimator, param_grid, cv=3)\n    clf.fit(X_train, y_train)\n\n    print(\"Best parameters found:\", clf.best_params_)\n    print(\"Best cross-validation score:\", clf.best_score_)\n    print(\"Test set accuracy:\", clf.score(X_test, y_test))\n\nexcept Exception as e:\n    print(f\"An error occurred: {e}\")\nfinally:\n    if sc:\n        sc.stop()","lang":"python","description":"This quickstart demonstrates how to use spark-sklearn's GridSearchCV to perform hyperparameter tuning for a scikit-learn SVC model, distributing the computation across a Spark cluster (or locally). It covers SparkContext initialization, data preparation, defining the estimator and parameter grid, fitting the model, and retrieving results."},"warnings":[{"fix":"Consider using native Spark MLlib for distributed machine learning, or alternative frameworks designed for distributed scikit-learn (e.g., dask-ml for smaller clusters/different paradigms) if compatibility issues arise.","message":"Project is abandoned and unmaintained. The last commit was in 2017, meaning it does not receive bug fixes, security updates, or compatibility patches for newer Python, Spark, or scikit-learn versions.","severity":"breaking","affected_versions":"0.3.0 and older"},{"fix":"Downgrade your Spark and scikit-learn installations to the officially supported versions or migrate to more actively maintained distributed ML solutions.","message":"Strict compatibility with older Spark and scikit-learn versions. spark-sklearn officially supports Spark 2.x and scikit-learn 0.18.x. Using it with newer versions will likely lead to runtime errors or unexpected behavior.","severity":"breaking","affected_versions":"All versions, when used with Spark > 2.x or scikit-learn > 0.18.x"},{"fix":"Monitor Spark UI for serialization/deserialization times. For optimal performance with large datasets on Spark, consider using Spark MLlib, which operates natively on Spark DataFrames.","message":"Potential performance overhead due to data serialization/deserialization. Data is often converted between Spark RDD/DataFrame and scikit-learn's numpy arrays, which can incur significant overhead for very large datasets.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Understand that its scope is narrow. For other distributed scikit-learn tasks, look into libraries like dask-ml, or for full Spark integration, use Spark MLlib.","message":"Limited functionality to GridSearchCV and RandomizedSearchCV. spark-sklearn does not provide broader integration with other scikit-learn functionalities or a direct bridge to Spark's native MLlib estimators.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-17T00:00:00.000Z","next_check":"2026-07-16T00:00:00.000Z","problems":[{"fix":"Run `pip install spark-sklearn` to install the package.","cause":"The spark-sklearn library is not installed in your Python environment.","error":"ModuleNotFoundError: No module named 'spark_sklearn'"},{"fix":"Run `pip install pyspark` to install PySpark. Ensure your PySpark version is compatible with your Spark installation.","cause":"The PySpark library, a core dependency for spark-sklearn, is not installed.","error":"ModuleNotFoundError: No module named 'pyspark'"},{"fix":"Verify your `JAVA_HOME` environment variable points to a compatible Java Development Kit (JDK) (e.g., Java 8 for Spark 2.x). Ensure `SPARK_HOME` is set correctly and Spark binaries are accessible. Check Spark logs for more specific errors.","cause":"This typically indicates an issue with the Spark environment setup, such as an incorrect Java version, insufficient memory, or problems finding Spark binaries.","error":"Java gateway process exited before sending its port number."}]}