{"library":"pyspark-huggingface","title":"PySpark Hugging Face Data Source","description":"pyspark-huggingface is a Spark Data Source for seamlessly accessing 🤗 Hugging Face Datasets as Spark DataFrames. It enables streaming datasets from the Hub, applying projection and predicate filters, and saving Spark DataFrames back to Hugging Face as Parquet files with fast, deduplicated uploads. It supports authentication via `huggingface-cli login` or tokens, and is compatible with Spark 4 (with auto-import) as well as backporting functionality for Spark 3.5, 3.4, and 3.3. The current version is 2.1.0 and it is actively maintained.","language":"python","status":"active","last_verified":"Sun May 17","install":{"commands":["pip install pyspark-huggingface"],"cli":null},"imports":["import pyspark_huggingface"],"auth":{"required":false,"env_vars":[]},"quickstart":{"code":"from pyspark.sql import SparkSession\nimport os\n\n# Initialize Spark Session\nspark = SparkSession.builder \\\n    .appName(\"HuggingFaceSpark\") \\\n    .getOrCreate()\n\n# For Spark 3.x, explicitly import to enable the data source:\n# import pyspark_huggingface \n\n# Read a public dataset from Hugging Face\n# Replace 'hf_token_xxxx' with your actual token if accessing private/gated datasets\nhf_token = os.environ.get('HF_TOKEN', '') # Use a real token for private datasets\n\nprint(\"Reading stanfordnlp/imdb dataset...\")\ndf = spark.read \\\n    .format(\"huggingface\") \\\n    .option(\"token\", hf_token) \\\n    .load(\"stanfordnlp/imdb\")\n\nprint(\"Schema:\")\ndf.printSchema()\n\nprint(\"First 5 rows:\")\ndf.show(5, truncate=False)\n\n# Example of saving a DataFrame to Hugging Face (requires write token and dataset repo name)\n# try:\n#     print(\"Saving a sample DataFrame to Hugging Face...\")\n#     sample_data = [(\"hello\", \"world\"), (\"spark\", \"huggingface\")]\n#     sample_df = spark.createDataFrame(sample_data, [\"col1\", \"col2\"])\n#     sample_df.write \\\n#         .format(\"huggingface\") \\\n#         .option(\"token\", hf_token) \\\n#         .mode(\"overwrite\") \\\n#         .save(\"your_username/your_dataset_name\")\n#     print(\"DataFrame saved successfully.\")\n# except Exception as e:\n#     print(f\"Could not save DataFrame to Hugging Face: {e}\")\n\nspark.stop()","lang":"python","description":"This quickstart initializes a Spark session and demonstrates how to read a public Hugging Face dataset (stanfordnlp/imdb) into a PySpark DataFrame. It also shows where to configure authentication for private datasets or write operations. For Spark 3.x users, an explicit `import pyspark_huggingface` might be necessary.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":{"tag":null,"tag_description":null,"last_tested":"2026-05-17","installed_version":"2.1.0","pypi_latest":"2.1.0","is_stale":false,"summary":{"python_range":"3.10–3.9","success_rate":100,"avg_install_s":16.5,"avg_import_s":null,"wheel_type":"wheel"},"results":[{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"pyspark-huggingface","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"397.6M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"pyspark-huggingface","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":17.3,"import_time_s":null,"mem_mb":null,"disk_size":"368M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"pyspark-huggingface","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"418.7M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"pyspark-huggingface","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":16,"import_time_s":null,"mem_mb":null,"disk_size":"389M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"pyspark-huggingface","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"402.4M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"pyspark-huggingface","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":14.8,"import_time_s":null,"mem_mb":null,"disk_size":"372M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"pyspark-huggingface","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"401.5M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"pyspark-huggingface","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":14.6,"import_time_s":null,"mem_mb":null,"disk_size":"371M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"pyspark-huggingface","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"386.2M"},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"pyspark-huggingface","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":20,"import_time_s":null,"mem_mb":null,"disk_size":"366M"}]}}