{"id":7605,"library":"pyspark-huggingface","title":"PySpark Hugging Face Data Source","description":"pyspark-huggingface is a Spark Data Source for seamlessly accessing 🤗 Hugging Face Datasets as Spark DataFrames. It enables streaming datasets from the Hub, applying projection and predicate filters, and saving Spark DataFrames back to Hugging Face as Parquet files with fast, deduplicated uploads. It supports authentication via `huggingface-cli login` or tokens, and is compatible with Spark 4 (with auto-import) as well as backporting functionality for Spark 3.5, 3.4, and 3.3. The current version is 2.1.0 and it is actively maintained.","status":"active","version":"2.1.0","language":"en","source_language":"en","source_url":"https://github.com/huggingface/pyspark_huggingface","tags":["pyspark","spark","huggingface","datasets","data-source","etl","distributed-computing"],"install":[{"cmd":"pip install pyspark-huggingface","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Core dependency for Apache Spark integration and DataFrame operations.","package":"pyspark"},{"reason":"Required for authentication with the Hugging Face Hub and general Hub interactions.","package":"huggingface_hub","optional":false},{"reason":"Used under the hood for efficient reading and writing of Arrow/Parquet data formats.","package":"pyarrow","optional":false}],"imports":[{"note":"For PySpark 3.x, an explicit import of `pyspark_huggingface` is required to enable the 'huggingface' data source format. For PySpark 4+, this is automatically handled upon installation. Interaction is primarily via `spark.read.format(\"huggingface\")` and `spark.write.format(\"huggingface\")`.","wrong":"No direct classes are typically imported for DataSource usage.","symbol":"pyspark_huggingface","correct":"import pyspark_huggingface"}],"quickstart":{"code":"from pyspark.sql import SparkSession\nimport os\n\n# Initialize Spark Session\nspark = SparkSession.builder \\\n    .appName(\"HuggingFaceSpark\") \\\n    .getOrCreate()\n\n# For Spark 3.x, explicitly import to enable the data source:\n# import pyspark_huggingface \n\n# Read a public dataset from Hugging Face\n# Replace 'hf_token_xxxx' with your actual token if accessing private/gated datasets\nhf_token = os.environ.get('HF_TOKEN', '') # Use a real token for private datasets\n\nprint(\"Reading stanfordnlp/imdb dataset...\")\ndf = spark.read \\\n    .format(\"huggingface\") \\\n    .option(\"token\", hf_token) \\\n    .load(\"stanfordnlp/imdb\")\n\nprint(\"Schema:\")\ndf.printSchema()\n\nprint(\"First 5 rows:\")\ndf.show(5, truncate=False)\n\n# Example of saving a DataFrame to Hugging Face (requires write token and dataset repo name)\n# try:\n#     print(\"Saving a sample DataFrame to Hugging Face...\")\n#     sample_data = [(\"hello\", \"world\"), (\"spark\", \"huggingface\")]\n#     sample_df = spark.createDataFrame(sample_data, [\"col1\", \"col2\"])\n#     sample_df.write \\\n#         .format(\"huggingface\") \\\n#         .option(\"token\", hf_token) \\\n#         .mode(\"overwrite\") \\\n#         .save(\"your_username/your_dataset_name\")\n#     print(\"DataFrame saved successfully.\")\n# except Exception as e:\n#     print(f\"Could not save DataFrame to Hugging Face: {e}\")\n\nspark.stop()","lang":"python","description":"This quickstart initializes a Spark session and demonstrates how to read a public Hugging Face dataset (stanfordnlp/imdb) into a PySpark DataFrame. It also shows where to configure authentication for private datasets or write operations. For Spark 3.x users, an explicit `import pyspark_huggingface` might be necessary."},"warnings":[{"fix":"Add `import pyspark_huggingface` at the beginning of your Spark application when using PySpark 3.x.","message":"When using `pyspark-huggingface` with PySpark 3.x (versions 3.3, 3.4, 3.5), you *must* explicitly `import pyspark_huggingface` in your code to enable the 'huggingface' data source format. This is not needed for PySpark 4+ as it's auto-imported.","severity":"gotcha","affected_versions":"<=3.5"},{"fix":"Authenticate using `huggingface-cli login` in your environment or pass your Hugging Face token directly via the `.option(\"token\", \"hf_xxxx\")` to the `spark.read` or `spark.write` calls, or set the `HF_TOKEN` environment variable.","message":"To read private/gated Hugging Face datasets or to write Spark DataFrames to the Hugging Face Hub, you need to authenticate. Not providing a valid token will lead to access errors.","severity":"gotcha","affected_versions":"All"},{"fix":"Use `.option(\"filters\", '[(\"column_name\", \">\", value)]')` and `.option(\"columns\", '[\"col1\", \"col2\"]')` when loading the dataset. Example: `.option(\"filters\", '[(\"language_score\", \">\", 0.99)]').option(\"columns\", '[\"text\", \"language_score\"]')`.","message":"For optimal performance, especially with large Parquet datasets, apply filters and select columns during the `spark.read.format('huggingface').option(...)` stage. This leverages Parquet metadata to skip unnecessary data, reducing I/O and processing.","severity":"gotcha","affected_versions":"All"},{"fix":"Set the `HF_HUB_CACHE` environment variable to a path with sufficient available disk space *before* importing any `transformers` or `datasets` libraries. For example: `os.environ['HF_HUB_CACHE'] = \"/path/to/large/storage\"`.","message":"Downloading large models or datasets can quickly exhaust disk space if the default Hugging Face cache directory (often in `/root/.cache/huggingface/hub`) is on a small root partition.","severity":"gotcha","affected_versions":"All"},{"fix":"Ensure your `datasets` library (a common underlying dependency) is updated to version 2.15.0 or later, as this version includes improvements to handle such rate limits. `pip install --upgrade datasets`.","message":"When uploading very large datasets with many shards to the Hugging Face Hub, you might encounter `HfHubHTTPError: 429 Client Error: Too Many Requests` due to hourly quotas.","severity":"gotcha","affected_versions":"<2.15.0 of `datasets` library"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Change the import statement to `from huggingface_hub import list_datasets`.","cause":"The `list_datasets` function, and similar Hub-interaction utilities, were moved from the `datasets` library to the `huggingface_hub` library. This is a common confusion when following older examples or tutorials.","error":"ImportError: cannot import name 'list_datasets' from 'datasets'"},{"fix":"Before importing Hugging Face libraries, set the `HF_HUB_CACHE` environment variable to a directory with ample free space. Example: `import os; os.environ['HF_HUB_CACHE'] = '/path/to/your/large_disk_storage'`.","cause":"Hugging Face libraries default to caching downloaded models and datasets in a location (often `/root/.cache`) that might have limited storage, especially in containerized or shared environments.","error":"Not enough free disk space to download the file. The expected file size is: XXXX MB. The target location /root/.cache/huggingface/hub only has XXX MB free disk space."},{"fix":"Ensure that your Spark cluster's `HADOOP_CONF_DIR` is correctly set, and that `libhadoop.so` (and other native libraries) are available on the `LD_LIBRARY_PATH` of the Spark executor nodes. Verify that PyArrow and Spark can connect to HDFS outside of `pyspark-huggingface` context. Check for Kerberos authentication issues if applicable.","cause":"This error typically occurs when PySpark tasks, particularly those involving `datasets.from_spark()` or similar operations that interact with distributed file systems like HDFS (via PyArrow), cannot properly access native Hadoop libraries or configured storage in a YARN or similar cluster environment.","error":"org.apache.spark.api.python.PythonException: Traceback (most recent call last): ... WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable ... WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. ... Lost task 0.0 in stage ... (TID ...) ..."}]}