{"library":"spark-sklearn","title":"Spark-sklearn: Scikit-learn on Spark","type":"library","description":"spark-sklearn provides integration tools for running scikit-learn's GridSearchCV and RandomizedSearchCV on Apache Spark clusters. It leverages Spark for distributed computation of model training, allowing users to scale hyperparameter tuning. The library is currently at version 0.3.0, with its last release in 2017, and appears to be in an abandoned state with no active development or maintenance.","language":"python","status":"abandoned","last_verified":"Fri Apr 17","install":{"commands":["pip install spark-sklearn pyspark"],"cli":null},"imports":["from spark_sklearn import GridSearchCV","from spark_sklearn import RandomizedSearchCV","from pyspark import SparkContext"],"auth":{"required":false,"env_vars":[]},"links":{"homepage":null,"github":"https://github.com/databricks/spark-sklearn","docs":null,"changelog":null,"pypi":"https://pypi.org/project/spark-sklearn/","npm":null,"openapi_spec":null,"status_page":null,"smithery":null},"quickstart":{"code":"import os\nfrom pyspark import SparkContext\nfrom spark_sklearn import GridSearchCV\nfrom sklearn.svm import SVC\nfrom sklearn.datasets import make_classification\nfrom sklearn.model_selection import train_test_split\n\n# Initialize SparkContext\n# For local testing, 'local[*]' works. For a cluster, set SPARK_MASTER env var.\nif os.environ.get('SPARK_MASTER') is None:\n    os.environ['SPARK_MASTER'] = 'local[*]'\n\nsc = None\ntry:\n    sc = SparkContext(appName=\"SparkSklearnExample\")\n\n    # Generate some synthetic data\n    X, y = make_classification(n_samples=1000, n_features=20, random_state=42)\n    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n\n    # Define the estimator and parameter grid\n    estimator = SVC(gamma='auto', random_state=42)\n    param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}\n\n    # Use Spark-backed GridSearchCV\n    clf = GridSearchCV(sc, estimator, param_grid, cv=3)\n    clf.fit(X_train, y_train)\n\n    print(\"Best parameters found:\", clf.best_params_)\n    print(\"Best cross-validation score:\", clf.best_score_)\n    print(\"Test set accuracy:\", clf.score(X_test, y_test))\n\nexcept Exception as e:\n    print(f\"An error occurred: {e}\")\nfinally:\n    if sc:\n        sc.stop()","lang":"python","description":"This quickstart demonstrates how to use spark-sklearn's GridSearchCV to perform hyperparameter tuning for a scikit-learn SVC model, distributing the computation across a Spark cluster (or locally). It covers SparkContext initialization, data preparation, defining the estimator and parameter grid, fitting the model, and retrieving results.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":null}