Hugging Face Datasets

4.6.0 · active · verified Sat Feb 28

HuggingFace library for loading, processing, and sharing datasets for ML. Provides load_dataset() for one-line access to 100k+ public datasets on the Hub, plus local file loading (CSV, JSON, Parquet, Arrow, audio, image, etc.). Built on Apache Arrow for memory-efficient, zero-copy data access. Package name on PyPI is 'datasets' (not 'huggingface-datasets'). Import name is also 'datasets'. CRITICAL: datasets 4.0 (July 2025) removed dataset loading scripts and trust_remote_code entirely. Many older community datasets relying on .py loading scripts now fail with datasets>=4.

Warnings

Install

Imports

Quickstart

load_dataset() downloads and caches to HF_HOME. Streaming mode avoids full download for large datasets. map() with batched=True is significantly faster for large datasets. set_format() enables framework-specific tensor output without copying data.

from datasets import load_dataset

# Load public dataset from Hub
ds = load_dataset("rajpurkar/squad")  # returns DatasetDict
print(ds)  # DatasetDict with 'train' and 'validation' splits
print(ds['train'][0])  # first example

# Streaming mode (no full download)
streaming_ds = load_dataset("rajpurkar/squad", split="train", streaming=True)
for example in streaming_ds.take(3):
    print(example['question'])

# Load local files
local_ds = load_dataset("json", data_files="./my_data.jsonl", split="train")

# map() for preprocessing
def tokenize(example):
    return {"tokens": example["text"].split()}

processed = ds['train'].map(tokenize, batched=False)

# Convert to PyTorch tensors
ds['train'].set_format(type='torch', columns=['input_ids', 'attention_mask'])

# Create dataset from dict
from datasets import Dataset
custom_ds = Dataset.from_dict({"text": ["hello", "world"], "label": [0, 1]})

view raw JSON →