Hugging Face Datasets
HuggingFace library for loading, processing, and sharing datasets for ML. Provides load_dataset() for one-line access to 100k+ public datasets on the Hub, plus local file loading (CSV, JSON, Parquet, Arrow, audio, image, etc.). Built on Apache Arrow for memory-efficient, zero-copy data access. Package name on PyPI is 'datasets' (not 'huggingface-datasets'). Import name is also 'datasets'. CRITICAL: datasets 4.0 (July 2025) removed dataset loading scripts and trust_remote_code entirely. Many older community datasets relying on .py loading scripts now fail with datasets>=4.
Warnings
- breaking datasets 4.0 (July 2025) removed all support for dataset loading scripts (.py files) and the trust_remote_code parameter. Datasets that relied on custom .py loaders now raise: 'RuntimeError: Dataset scripts are no longer supported, but found X.py'. This breaks many community datasets (hotpotqa, common_voice, superb, gaia-benchmark, etc.).
- breaking trust_remote_code parameter is entirely removed in datasets 4.0. Passing it raises an error rather than being silently ignored. Code with trust_remote_code=True will break on import or call.
- breaking pyarrow version constraints are strict. datasets pins to specific pyarrow ranges. In environments with multiple packages requiring pyarrow, version conflicts cause ImportError or silent data corruption. datasets and pyarrow must be upgraded together.
- gotcha map() with num_proc>1 (multiprocessing) uses dill for serialization. Lambda functions and closures that reference non-serializable objects (open file handles, locks, etc.) will silently fail or hang. No clear error is raised.
- gotcha load_dataset() caches datasets to disk by default in HF_HOME. Re-running always returns the cached version. In CI or when dataset content changes on the Hub, stale cached versions are silently returned.
- gotcha Gated datasets (some Common Voice, medical, legal datasets) require authentication. load_dataset() raises a 401 or confusing FileNotFoundError if HF_TOKEN is not set or the license has not been accepted.
- gotcha Package name on PyPI is 'datasets' — not 'huggingface-datasets'. pip install huggingface-datasets installs an old, unrelated stub package. Import is also 'from datasets import ...' not 'from huggingface_datasets import ...'.
Install
-
pip install datasets -
pip install datasets[audio] -
pip install datasets[vision] -
pip install datasets[torch] -
pip install 'datasets<4'
Imports
- load_dataset
from datasets import load_dataset
- Dataset / DatasetDict
from datasets import Dataset, DatasetDict
- Audio / Image features
from datasets import Audio, Image
Quickstart
from datasets import load_dataset
# Load public dataset from Hub
ds = load_dataset("rajpurkar/squad") # returns DatasetDict
print(ds) # DatasetDict with 'train' and 'validation' splits
print(ds['train'][0]) # first example
# Streaming mode (no full download)
streaming_ds = load_dataset("rajpurkar/squad", split="train", streaming=True)
for example in streaming_ds.take(3):
print(example['question'])
# Load local files
local_ds = load_dataset("json", data_files="./my_data.jsonl", split="train")
# map() for preprocessing
def tokenize(example):
return {"tokens": example["text"].split()}
processed = ds['train'].map(tokenize, batched=False)
# Convert to PyTorch tensors
ds['train'].set_format(type='torch', columns=['input_ids', 'attention_mask'])
# Create dataset from dict
from datasets import Dataset
custom_ds = Dataset.from_dict({"text": ["hello", "world"], "label": [0, 1]})