Hugging Face Datasets

raw JSON →
4.6.0 verified Mon May 11 auth: no python install: stale quickstart: stale

HuggingFace library for loading, processing, and sharing datasets for ML. Provides load_dataset() for one-line access to 100k+ public datasets on the Hub, plus local file loading (CSV, JSON, Parquet, Arrow, audio, image, etc.). Built on Apache Arrow for memory-efficient, zero-copy data access. Package name on PyPI is 'datasets' (not 'huggingface-datasets'). Import name is also 'datasets'. CRITICAL: datasets 4.0 (July 2025) removed dataset loading scripts and trust_remote_code entirely. Many older community datasets relying on .py loading scripts now fail with datasets>=4.

pip install datasets
error Exception occurred: Dataset scripts are no longer supported.
cause The `datasets` library version 4.0.0 and above removed support for loading datasets via Python scripts and the `trust_remote_code` argument, which many older community datasets relied on.
fix
Downgrade datasets to a version older than 4.0.0 (pip install 'datasets<4.0.0') or ask the dataset author to convert the dataset to a standard format like Parquet.
error FileNotFoundError: Couldn't find a dataset script at [path] or any data file in the same directory. Couldn't find '[dataset_name]' on the Hugging Face Hub either.
cause The `load_dataset` function cannot locate the specified dataset, either because the path to local files is incorrect, the dataset name on the Hugging Face Hub is misspelled, the dataset is private and the user is not logged in, or there's an issue with the cache.
fix
Double-check the dataset name/path, ensure local files exist at the specified location, log in to Hugging Face (huggingface-cli login) if accessing private datasets, or clear the datasets cache and try again. For local files, explicitly specify the format (e.g., load_dataset('csv', data_files='my_data.csv')).
error ValueError: Couldn't cast [schema details] because column names don't match.
cause The schema inferred by `load_dataset` from the data files does not match an expected schema, or when loading CSV/JSON files, the library struggles to automatically determine column structure, especially with complex delimiters or malformed files. An 'Invalid pattern' `ValueError` can also occur with incorrect glob patterns for `data_files`.
fix
For casting errors, explicitly define the features argument in load_dataset with the correct schema, or for CSV/JSON, manually provide column_names to load_dataset. For 'Invalid pattern', correct the glob pattern or use simpler file paths.
error ModuleNotFoundError: No module named 'datasets'
cause The `datasets` library is not installed in the current Python environment, or there is a naming conflict (e.g., a local file named `datasets.py` shadows the installed library).
fix
Install the library using pip install datasets. If already installed, check for shadowing files in your project directory and rename them. Ensure your virtual environment is activated if applicable.
error ModuleNotFoundError: No module named 'datasets.tasks'
cause An older version of `datasets` might have used `datasets.tasks` for certain functionalities, which has since been removed or refactored. This often happens when custom scripts or older examples try to import from this specific (now non-existent) submodule.
fix
Remove or update the problematic import statement (e.g., from datasets.tasks import TextClassification). The datasets.tasks module is generally not meant for direct user import in recent versions; task-related information is usually handled differently within the library's features. Upgrade datasets and huggingface_hub to their latest versions.
breaking datasets 4.0 (July 2025) removed all support for dataset loading scripts (.py files) and the trust_remote_code parameter. Datasets that relied on custom .py loaders now raise: 'RuntimeError: Dataset scripts are no longer supported, but found X.py'. This breaks many community datasets (hotpotqa, common_voice, superb, gaia-benchmark, etc.).
fix Either: (1) pin datasets<4 as a workaround, or (2) find a Parquet-backed version of the dataset on the Hub, or (3) ask the dataset author to migrate to standard Parquet format. Passing trust_remote_code=True no longer silences the error — it raises a separate error.
breaking trust_remote_code parameter is entirely removed in datasets 4.0. Passing it raises an error rather than being silently ignored. Code with trust_remote_code=True will break on import or call.
fix Remove trust_remote_code from all load_dataset() calls when using datasets>=4.
breaking pyarrow version constraints are strict. datasets pins to specific pyarrow ranges. In environments with multiple packages requiring pyarrow, version conflicts cause ImportError or silent data corruption. datasets and pyarrow must be upgraded together.
fix Always upgrade together: pip install -U datasets pyarrow. Check compatibility in the datasets changelog for your target version.
gotcha map() with num_proc>1 (multiprocessing) uses dill for serialization. Lambda functions and closures that reference non-serializable objects (open file handles, locks, etc.) will silently fail or hang. No clear error is raised.
fix Use named functions instead of lambdas for map(). Avoid referencing non-serializable objects inside map functions. Use batched=True for large datasets to avoid per-example overhead.
gotcha load_dataset() caches datasets to disk by default in HF_HOME. Re-running always returns the cached version. In CI or when dataset content changes on the Hub, stale cached versions are silently returned.
fix Pass download_mode='force_redownload' to bypass cache. Or delete the cached dataset from ~/.cache/huggingface/datasets/.
gotcha Gated datasets (some Common Voice, medical, legal datasets) require authentication. load_dataset() raises a 401 or confusing FileNotFoundError if HF_TOKEN is not set or the license has not been accepted.
fix Set HF_TOKEN env var and accept the dataset license on huggingface.co before calling load_dataset().
gotcha Package name on PyPI is 'datasets' — not 'huggingface-datasets'. pip install huggingface-datasets installs an old, unrelated stub package. Import is also 'from datasets import ...' not 'from huggingface_datasets import ...'.
fix pip install datasets. from datasets import load_dataset.
breaking Installing `datasets` with certain extras (e.g., `[audio]`, `[vision]`, `[text]`) can pull in dependencies (like `scikit-learn`, `numba`, `soxr`, `soundfile`'s underlying C libraries) that require C/C++ compilers (e.g., gcc) and Python development headers to be present on the system for successful installation. This often leads to build failures (e.g., 'Unknown compiler(s)', 'subprocess-exited-with-error') in minimal environments like Alpine Linux or when using a base Docker image without development tools.
fix Ensure C/C++ build tools and Python development headers are installed in your environment before installing `datasets` with extras. For Alpine Linux, run `apk add build-base python3-dev`. For Debian/Ubuntu-based images, run `apt-get update && apt-get install -y build-essential python3-dev`.
pip install datasets[audio]
pip install datasets[vision]
pip install datasets[torch]
pip install 'datasets<4'
python os / libc variant status wheel install import disk
3.10 alpine (musl) 'datasets<4' - - 3.46s 390.8M
3.10 alpine (musl) datasets - - 3.45s 391.3M
3.10 alpine (musl) audio - - - -
3.10 alpine (musl) torch - - - -
3.10 alpine (musl) vision - - 3.49s 410.8M
3.10 slim (glibc) 'datasets<4' - - 9.10s 363M
3.10 slim (glibc) datasets - - 9.27s 363M
3.10 slim (glibc) audio - - - -
3.10 slim (glibc) torch - - - -
3.10 slim (glibc) vision - - 7.99s 383M
3.11 alpine (musl) 'datasets<4' - - 4.69s 411.6M
3.11 alpine (musl) datasets - - 4.79s 412.2M
3.11 alpine (musl) audio - - - -
3.11 alpine (musl) torch - - - -
3.11 alpine (musl) vision - - 4.76s 432.3M
3.11 slim (glibc) 'datasets<4' - - 12.85s 383M
3.11 slim (glibc) datasets - - 12.96s 384M
3.11 slim (glibc) audio - - - -
3.11 slim (glibc) torch - - - -
3.11 slim (glibc) vision - - 10.34s 404M
3.12 alpine (musl) 'datasets<4' - - 4.31s 395.4M
3.12 alpine (musl) datasets - - 4.52s 396.0M
3.12 alpine (musl) audio - - - -
3.12 alpine (musl) torch - - - -
3.12 alpine (musl) vision - - 4.38s 416.0M
3.12 slim (glibc) 'datasets<4' - - 14.07s 367M
3.12 slim (glibc) datasets - - 13.79s 367M
3.12 slim (glibc) audio - - - -
3.12 slim (glibc) torch - - - -
3.12 slim (glibc) vision - - 11.13s 388M
3.13 alpine (musl) 'datasets<4' - - 4.16s 394.3M
3.13 alpine (musl) datasets - - 4.07s 394.9M
3.13 alpine (musl) audio - - - -
3.13 alpine (musl) torch - - - -
3.13 alpine (musl) vision - - 8.91s 414.9M
3.13 slim (glibc) 'datasets<4' - - 12.79s 365M
3.13 slim (glibc) datasets - - 13.39s 366M
3.13 slim (glibc) audio - - - -
3.13 slim (glibc) torch - - - -
3.13 slim (glibc) vision - - 10.14s 387M
3.9 alpine (musl) 'datasets<4' - - 3.51s 384.9M
3.9 alpine (musl) datasets - - 3.36s 385.3M
3.9 alpine (musl) audio - - - -
3.9 alpine (musl) torch - - - -
3.9 alpine (musl) vision - - 3.38s 402.7M
3.9 slim (glibc) 'datasets<4' - - 8.12s 365M
3.9 slim (glibc) datasets - - 2.89s 365M
3.9 slim (glibc) audio - - - -
3.9 slim (glibc) torch - - - -
3.9 slim (glibc) vision - - 3.03s 383M

load_dataset() downloads and caches to HF_HOME. Streaming mode avoids full download for large datasets. map() with batched=True is significantly faster for large datasets. set_format() enables framework-specific tensor output without copying data.

from datasets import load_dataset

# Load public dataset from Hub
ds = load_dataset("rajpurkar/squad")  # returns DatasetDict
print(ds)  # DatasetDict with 'train' and 'validation' splits
print(ds['train'][0])  # first example

# Streaming mode (no full download)
streaming_ds = load_dataset("rajpurkar/squad", split="train", streaming=True)
for example in streaming_ds.take(3):
    print(example['question'])

# Load local files
local_ds = load_dataset("json", data_files="./my_data.jsonl", split="train")

# map() for preprocessing
def tokenize(example):
    return {"tokens": example["text"].split()}

processed = ds['train'].map(tokenize, batched=False)

# Convert to PyTorch tensors
ds['train'].set_format(type='torch', columns=['input_ids', 'attention_mask'])

# Create dataset from dict
from datasets import Dataset
custom_ds = Dataset.from_dict({"text": ["hello", "world"], "label": [0, 1]})