PySpark Hugging Face Data Source

library 2.1.0 ·python

✓ verified Jul 3, 2026

pyspark-huggingface is a Spark Data Source for seamlessly accessing 🤗 Hugging Face Datasets as Spark DataFrames. It enables streaming datasets from the Hub, applying projection and predicate filters, and saving Spark DataFrames back to Hugging Face as Parquet files with fast, deduplicated uploads. It supports authentication via `huggingface-cli login` or tokens, and is compatible with Spark 4 (with auto-import) as well as backporting functionality for Spark 3.5, 3.4, and 3.3. The current version is 2.1.0 and it is actively maintained.

Traffic · last 30 days ↑0% vs prev 7d · indexed Thu Apr 16 · updated Sat Jul 11

total hits 22

actors 5 distinct systems

last hit 9d ago ByteDance

GPTBot

ByteDance

Script

Humans

top countries 🇺🇸 United States · 🇸🇬 Singapore · 🇨🇦 Canada · VN · 🇫🇷 France

Resources

packagepypi.org/project/pyspark-huggingface/ ↗

API endpoints

full doc /v1/registry/pyspark-huggingface

install /v1/registry/pyspark-huggingface/install

compatibility /v1/registry/pyspark-huggingface/compatibility