PySpark Hugging Face Data Source
pyspark-huggingface is a Spark Data Source for seamlessly accessing 🤗 Hugging Face Datasets as Spark DataFrames. It enables streaming datasets from the Hub, applying projection and predicate filters, and saving Spark DataFrames back to Hugging Face as Parquet files with fast, deduplicated uploads. It supports authentication via `huggingface-cli login` or tokens, and is compatible with Spark 4 (with auto-import) as well as backporting functionality for Spark 3.5, 3.4, and 3.3. The current version is 2.1.0 and it is actively maintained.
Common errors
-
ImportError: cannot import name 'list_datasets' from 'datasets'
cause The `list_datasets` function, and similar Hub-interaction utilities, were moved from the `datasets` library to the `huggingface_hub` library. This is a common confusion when following older examples or tutorials.fixChange the import statement to `from huggingface_hub import list_datasets`. -
Not enough free disk space to download the file. The expected file size is: XXXX MB. The target location /root/.cache/huggingface/hub only has XXX MB free disk space.
cause Hugging Face libraries default to caching downloaded models and datasets in a location (often `/root/.cache`) that might have limited storage, especially in containerized or shared environments.fixBefore importing Hugging Face libraries, set the `HF_HUB_CACHE` environment variable to a directory with ample free space. Example: `import os; os.environ['HF_HUB_CACHE'] = '/path/to/your/large_disk_storage'`. -
org.apache.spark.api.python.PythonException: Traceback (most recent call last): ... WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable ... WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. ... Lost task 0.0 in stage ... (TID ...) ...
cause This error typically occurs when PySpark tasks, particularly those involving `datasets.from_spark()` or similar operations that interact with distributed file systems like HDFS (via PyArrow), cannot properly access native Hadoop libraries or configured storage in a YARN or similar cluster environment.fixEnsure that your Spark cluster's `HADOOP_CONF_DIR` is correctly set, and that `libhadoop.so` (and other native libraries) are available on the `LD_LIBRARY_PATH` of the Spark executor nodes. Verify that PyArrow and Spark can connect to HDFS outside of `pyspark-huggingface` context. Check for Kerberos authentication issues if applicable.
Warnings
- gotcha When using `pyspark-huggingface` with PySpark 3.x (versions 3.3, 3.4, 3.5), you *must* explicitly `import pyspark_huggingface` in your code to enable the 'huggingface' data source format. This is not needed for PySpark 4+ as it's auto-imported.
- gotcha To read private/gated Hugging Face datasets or to write Spark DataFrames to the Hugging Face Hub, you need to authenticate. Not providing a valid token will lead to access errors.
- gotcha For optimal performance, especially with large Parquet datasets, apply filters and select columns during the `spark.read.format('huggingface').option(...)` stage. This leverages Parquet metadata to skip unnecessary data, reducing I/O and processing.
- gotcha Downloading large models or datasets can quickly exhaust disk space if the default Hugging Face cache directory (often in `/root/.cache/huggingface/hub`) is on a small root partition.
- gotcha When uploading very large datasets with many shards to the Hugging Face Hub, you might encounter `HfHubHTTPError: 429 Client Error: Too Many Requests` due to hourly quotas.
Install
-
pip install pyspark-huggingface
Imports
- pyspark_huggingface
No direct classes are typically imported for DataSource usage.
import pyspark_huggingface
Quickstart
from pyspark.sql import SparkSession
import os
# Initialize Spark Session
spark = SparkSession.builder \
.appName("HuggingFaceSpark") \
.getOrCreate()
# For Spark 3.x, explicitly import to enable the data source:
# import pyspark_huggingface
# Read a public dataset from Hugging Face
# Replace 'hf_token_xxxx' with your actual token if accessing private/gated datasets
hf_token = os.environ.get('HF_TOKEN', '') # Use a real token for private datasets
print("Reading stanfordnlp/imdb dataset...")
df = spark.read \
.format("huggingface") \
.option("token", hf_token) \
.load("stanfordnlp/imdb")
print("Schema:")
df.printSchema()
print("First 5 rows:")
df.show(5, truncate=False)
# Example of saving a DataFrame to Hugging Face (requires write token and dataset repo name)
# try:
# print("Saving a sample DataFrame to Hugging Face...")
# sample_data = [("hello", "world"), ("spark", "huggingface")]
# sample_df = spark.createDataFrame(sample_data, ["col1", "col2"])
# sample_df.write \
# .format("huggingface") \
# .option("token", hf_token) \
# .mode("overwrite") \
# .save("your_username/your_dataset_name")
# print("DataFrame saved successfully.")
# except Exception as e:
# print(f"Could not save DataFrame to Hugging Face: {e}")
spark.stop()