PySpark Hugging Face Data Source
JSON βpyspark-huggingface is a Spark Data Source for seamlessly accessing π€ Hugging Face Datasets as Spark DataFrames. It enables streaming datasets from the Hub, applying projection and predicate filters, and saving Spark DataFrames back to Hugging Face as Parquet files with fast, deduplicated uploads. It supports authentication via `huggingface-cli login` or tokens, and is compatible with Spark 4 (with auto-import) as well as backporting functionality for Spark 3.5, 3.4, and 3.3. The current version is 2.1.0 and it is actively maintained.
Traffic Β· last 30 days β22% vs prev 7d
total hits 22
actors 7 distinct systems
last hit 2d ago ByteDance
top countries πΊπΈ United States Β· π«π· France Β· πΈπ¬ Singapore Β· π¨π¦ Canada Β· π©πͺ Germany
Resources
API endpoints
full doc /v1/registry/pyspark-huggingface
compatibility /v1/registry/pyspark-huggingface/compatibility