PySpark Hugging Face Data Source

2.1.0 · active · verified Thu Apr 16

pyspark-huggingface is a Spark Data Source for seamlessly accessing 🤗 Hugging Face Datasets as Spark DataFrames. It enables streaming datasets from the Hub, applying projection and predicate filters, and saving Spark DataFrames back to Hugging Face as Parquet files with fast, deduplicated uploads. It supports authentication via `huggingface-cli login` or tokens, and is compatible with Spark 4 (with auto-import) as well as backporting functionality for Spark 3.5, 3.4, and 3.3. The current version is 2.1.0 and it is actively maintained.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart initializes a Spark session and demonstrates how to read a public Hugging Face dataset (stanfordnlp/imdb) into a PySpark DataFrame. It also shows where to configure authentication for private datasets or write operations. For Spark 3.x users, an explicit `import pyspark_huggingface` might be necessary.

from pyspark.sql import SparkSession
import os

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("HuggingFaceSpark") \
    .getOrCreate()

# For Spark 3.x, explicitly import to enable the data source:
# import pyspark_huggingface 

# Read a public dataset from Hugging Face
# Replace 'hf_token_xxxx' with your actual token if accessing private/gated datasets
hf_token = os.environ.get('HF_TOKEN', '') # Use a real token for private datasets

print("Reading stanfordnlp/imdb dataset...")
df = spark.read \
    .format("huggingface") \
    .option("token", hf_token) \
    .load("stanfordnlp/imdb")

print("Schema:")
df.printSchema()

print("First 5 rows:")
df.show(5, truncate=False)

# Example of saving a DataFrame to Hugging Face (requires write token and dataset repo name)
# try:
#     print("Saving a sample DataFrame to Hugging Face...")
#     sample_data = [("hello", "world"), ("spark", "huggingface")]
#     sample_df = spark.createDataFrame(sample_data, ["col1", "col2"])
#     sample_df.write \
#         .format("huggingface") \
#         .option("token", hf_token) \
#         .mode("overwrite") \
#         .save("your_username/your_dataset_name")
#     print("DataFrame saved successfully.")
# except Exception as e:
#     print(f"Could not save DataFrame to Hugging Face: {e}")

spark.stop()

view raw JSON →