{"id":15,"library":"datasets","title":"Hugging Face Datasets","description":"HuggingFace library for loading, processing, and sharing datasets for ML. Provides load_dataset() for one-line access to 100k+ public datasets on the Hub, plus local file loading (CSV, JSON, Parquet, Arrow, audio, image, etc.). Built on Apache Arrow for memory-efficient, zero-copy data access. Package name on PyPI is 'datasets' (not 'huggingface-datasets'). Import name is also 'datasets'. CRITICAL: datasets 4.0 (July 2025) removed dataset loading scripts and trust_remote_code entirely. Many older community datasets relying on .py loading scripts now fail with datasets>=4.","status":"active","version":"4.6.0","language":"python","source_language":"en","source_url":"https://github.com/huggingface/datasets","tags":["datasets","huggingface","load-dataset","arrow","parquet","nlp","data-loading","streaming","preprocessing","rag"],"install":[{"cmd":"pip install datasets","lang":"bash","label":"Core install"},{"cmd":"pip install datasets[audio]","lang":"bash","label":"With audio support (librosa, soundfile)"},{"cmd":"pip install datasets[vision]","lang":"bash","label":"With image support (Pillow)"},{"cmd":"pip install datasets[torch]","lang":"bash","label":"With PyTorch tensor format support"},{"cmd":"pip install 'datasets<4'","lang":"bash","label":"Pin to v3.x if you need trust_remote_code or .py loading scripts"}],"dependencies":[{"reason":"Required. All datasets are backed by Arrow tables. Version constraints are strict — pyarrow version mismatches cause import errors.","package":"pyarrow","optional":false},{"reason":"Required. Used for dataset discovery, download, and Hub interactions.","package":"huggingface-hub","optional":false},{"reason":"Required. Used for serializing map() functions for multiprocessing.","package":"dill","optional":false},{"reason":"Required. Used for parallel map() processing.","package":"multiprocess","optional":false}],"imports":[{"note":"Package name is 'datasets' on PyPI, not 'huggingface-datasets'. pip install datasets. from datasets import load_dataset.","wrong":"import huggingface-datasets","symbol":"load_dataset","correct":"from datasets import load_dataset"},{"note":"load_dataset() returns DatasetDict (multiple splits) or Dataset (single split). Access splits: ds['train'], ds['test'].","symbol":"Dataset / DatasetDict","correct":"from datasets import Dataset, DatasetDict"},{"note":"Cast columns to Audio() or Image() features to enable automatic decoding. Requires datasets[audio] or datasets[vision] extras.","symbol":"Audio / Image features","correct":"from datasets import Audio, Image"}],"quickstart":{"code":"from datasets import load_dataset\n\n# Load public dataset from Hub\nds = load_dataset(\"rajpurkar/squad\")  # returns DatasetDict\nprint(ds)  # DatasetDict with 'train' and 'validation' splits\nprint(ds['train'][0])  # first example\n\n# Streaming mode (no full download)\nstreaming_ds = load_dataset(\"rajpurkar/squad\", split=\"train\", streaming=True)\nfor example in streaming_ds.take(3):\n    print(example['question'])\n\n# Load local files\nlocal_ds = load_dataset(\"json\", data_files=\"./my_data.jsonl\", split=\"train\")\n\n# map() for preprocessing\ndef tokenize(example):\n    return {\"tokens\": example[\"text\"].split()}\n\nprocessed = ds['train'].map(tokenize, batched=False)\n\n# Convert to PyTorch tensors\nds['train'].set_format(type='torch', columns=['input_ids', 'attention_mask'])\n\n# Create dataset from dict\nfrom datasets import Dataset\ncustom_ds = Dataset.from_dict({\"text\": [\"hello\", \"world\"], \"label\": [0, 1]})","lang":"python","description":"load_dataset() downloads and caches to HF_HOME. Streaming mode avoids full download for large datasets. map() with batched=True is significantly faster for large datasets. set_format() enables framework-specific tensor output without copying data."},"warnings":[{"fix":"Either: (1) pin datasets<4 as a workaround, or (2) find a Parquet-backed version of the dataset on the Hub, or (3) ask the dataset author to migrate to standard Parquet format. Passing trust_remote_code=True no longer silences the error — it raises a separate error.","message":"datasets 4.0 (July 2025) removed all support for dataset loading scripts (.py files) and the trust_remote_code parameter. Datasets that relied on custom .py loaders now raise: 'RuntimeError: Dataset scripts are no longer supported, but found X.py'. This breaks many community datasets (hotpotqa, common_voice, superb, gaia-benchmark, etc.).","severity":"breaking","affected_versions":">=4.0.0"},{"fix":"Remove trust_remote_code from all load_dataset() calls when using datasets>=4.","message":"trust_remote_code parameter is entirely removed in datasets 4.0. Passing it raises an error rather than being silently ignored. Code with trust_remote_code=True will break on import or call.","severity":"breaking","affected_versions":">=4.0.0"},{"fix":"Always upgrade together: pip install -U datasets pyarrow. Check compatibility in the datasets changelog for your target version.","message":"pyarrow version constraints are strict. datasets pins to specific pyarrow ranges. In environments with multiple packages requiring pyarrow, version conflicts cause ImportError or silent data corruption. datasets and pyarrow must be upgraded together.","severity":"breaking","affected_versions":"all"},{"fix":"Use named functions instead of lambdas for map(). Avoid referencing non-serializable objects inside map functions. Use batched=True for large datasets to avoid per-example overhead.","message":"map() with num_proc>1 (multiprocessing) uses dill for serialization. Lambda functions and closures that reference non-serializable objects (open file handles, locks, etc.) will silently fail or hang. No clear error is raised.","severity":"gotcha","affected_versions":"all"},{"fix":"Pass download_mode='force_redownload' to bypass cache. Or delete the cached dataset from ~/.cache/huggingface/datasets/.","message":"load_dataset() caches datasets to disk by default in HF_HOME. Re-running always returns the cached version. In CI or when dataset content changes on the Hub, stale cached versions are silently returned.","severity":"gotcha","affected_versions":"all"},{"fix":"Set HF_TOKEN env var and accept the dataset license on huggingface.co before calling load_dataset().","message":"Gated datasets (some Common Voice, medical, legal datasets) require authentication. load_dataset() raises a 401 or confusing FileNotFoundError if HF_TOKEN is not set or the license has not been accepted.","severity":"gotcha","affected_versions":"all"},{"fix":"pip install datasets. from datasets import load_dataset.","message":"Package name on PyPI is 'datasets' — not 'huggingface-datasets'. pip install huggingface-datasets installs an old, unrelated stub package. Import is also 'from datasets import ...' not 'from huggingface_datasets import ...'.","severity":"gotcha","affected_versions":"all"},{"fix":"Ensure C/C++ build tools and Python development headers are installed in your environment before installing `datasets` with extras. For Alpine Linux, run `apk add build-base python3-dev`. For Debian/Ubuntu-based images, run `apt-get update && apt-get install -y build-essential python3-dev`.","message":"Installing `datasets` with certain extras (e.g., `[audio]`, `[vision]`, `[text]`) can pull in dependencies (like `scikit-learn`, `numba`, `soxr`, `soundfile`'s underlying C libraries) that require C/C++ compilers (e.g., gcc) and Python development headers to be present on the system for successful installation. This often leads to build failures (e.g., 'Unknown compiler(s)', 'subprocess-exited-with-error') in minimal environments like Alpine Linux or when using a base Docker image without development tools.","severity":"breaking","affected_versions":"all"}],"env_vars":null,"last_verified":"2026-05-11T18:50:03.952Z","next_check":"2026-05-28T00:00:00.000Z","problems":[{"fix":"Downgrade `datasets` to a version older than 4.0.0 (`pip install 'datasets<4.0.0'`) or ask the dataset author to convert the dataset to a standard format like Parquet.","cause":"The `datasets` library version 4.0.0 and above removed support for loading datasets via Python scripts and the `trust_remote_code` argument, which many older community datasets relied on.","error":"Exception occurred: Dataset scripts are no longer supported."},{"fix":"Double-check the dataset name/path, ensure local files exist at the specified location, log in to Hugging Face (`huggingface-cli login`) if accessing private datasets, or clear the `datasets` cache and try again. For local files, explicitly specify the format (e.g., `load_dataset('csv', data_files='my_data.csv')`).","cause":"The `load_dataset` function cannot locate the specified dataset, either because the path to local files is incorrect, the dataset name on the Hugging Face Hub is misspelled, the dataset is private and the user is not logged in, or there's an issue with the cache.","error":"FileNotFoundError: Couldn't find a dataset script at [path] or any data file in the same directory. Couldn't find '[dataset_name]' on the Hugging Face Hub either."},{"fix":"For casting errors, explicitly define the `features` argument in `load_dataset` with the correct schema, or for CSV/JSON, manually provide `column_names` to `load_dataset`. For 'Invalid pattern', correct the glob pattern or use simpler file paths.","cause":"The schema inferred by `load_dataset` from the data files does not match an expected schema, or when loading CSV/JSON files, the library struggles to automatically determine column structure, especially with complex delimiters or malformed files. An 'Invalid pattern' `ValueError` can also occur with incorrect glob patterns for `data_files`.","error":"ValueError: Couldn't cast [schema details] because column names don't match."},{"fix":"Install the library using `pip install datasets`. If already installed, check for shadowing files in your project directory and rename them. Ensure your virtual environment is activated if applicable.","cause":"The `datasets` library is not installed in the current Python environment, or there is a naming conflict (e.g., a local file named `datasets.py` shadows the installed library).","error":"ModuleNotFoundError: No module named 'datasets'"},{"fix":"Remove or update the problematic import statement (e.g., `from datasets.tasks import TextClassification`). The `datasets.tasks` module is generally not meant for direct user import in recent versions; task-related information is usually handled differently within the library's features. Upgrade `datasets` and `huggingface_hub` to their latest versions.","cause":"An older version of `datasets` might have used `datasets.tasks` for certain functionalities, which has since been removed or refactored. This often happens when custom scripts or older examples try to import from this specific (now non-existent) submodule.","error":"ModuleNotFoundError: No module named 'datasets.tasks'"}],"ecosystem":"pypi","meta_description":null,"install_score":35,"install_tag":"stale","quickstart_score":0,"quickstart_tag":"stale","pypi_latest":null,"install_checks":{"last_tested":"2026-05-11","tag":"stale","tag_description":"widespread failures or data too old to trust","results":[{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":3.46,"mem_mb":64,"disk_size":"390.8M"},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":3.45,"mem_mb":64.7,"disk_size":"391.3M"},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"audio","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"torch","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"vision","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":3.49,"mem_mb":64.7,"disk_size":"410.8M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":9.1,"mem_mb":63.4,"disk_size":"363M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":9.27,"mem_mb":64,"disk_size":"363M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"audio","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"torch","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"vision","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":7.99,"mem_mb":64,"disk_size":"383M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":4.69,"mem_mb":71.8,"disk_size":"411.6M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":4.79,"mem_mb":72.4,"disk_size":"412.2M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"audio","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"torch","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"vision","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":4.76,"mem_mb":72.4,"disk_size":"432.3M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":12.85,"mem_mb":71.2,"disk_size":"383M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":12.96,"mem_mb":71.8,"disk_size":"384M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"audio","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"torch","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"vision","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":10.34,"mem_mb":71.8,"disk_size":"404M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":4.31,"mem_mb":70.4,"disk_size":"395.4M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":4.52,"mem_mb":71.2,"disk_size":"396.0M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"audio","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"torch","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"vision","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":4.38,"mem_mb":71.2,"disk_size":"416.0M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":14.07,"mem_mb":69.8,"disk_size":"367M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":13.79,"mem_mb":70.6,"disk_size":"367M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"audio","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"torch","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"vision","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":11.13,"mem_mb":70.6,"disk_size":"388M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":4.16,"mem_mb":70.5,"disk_size":"394.3M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":4.07,"mem_mb":71.3,"disk_size":"394.9M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"audio","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"torch","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"vision","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":8.91,"mem_mb":70.7,"disk_size":"414.9M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":12.79,"mem_mb":70,"disk_size":"365M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":13.39,"mem_mb":70.7,"disk_size":"366M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"audio","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"torch","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"vision","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":10.14,"mem_mb":70.7,"disk_size":"387M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":3.51,"mem_mb":64.2,"disk_size":"384.9M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":3.36,"mem_mb":64.3,"disk_size":"385.3M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"audio","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"torch","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"vision","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":3.38,"mem_mb":64.3,"disk_size":"402.7M"},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":8.12,"mem_mb":63,"disk_size":"365M"},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":2.89,"mem_mb":64.3,"disk_size":"365M"},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"audio","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"torch","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"vision","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":3.03,"mem_mb":64.3,"disk_size":"383M"}]},"quickstart_checks":{"last_tested":"2026-05-11","tag":"stale","tag_description":"widespread failures or data too old to trust","results":[{"runtime":"python:3.10-alpine","exit_code":-1},{"runtime":"python:3.10-slim","exit_code":1},{"runtime":"python:3.11-alpine","exit_code":-1},{"runtime":"python:3.11-slim","exit_code":1},{"runtime":"python:3.12-alpine","exit_code":-1},{"runtime":"python:3.12-slim","exit_code":1},{"runtime":"python:3.13-alpine","exit_code":1},{"runtime":"python:3.13-slim","exit_code":1},{"runtime":"python:3.9-alpine","exit_code":-1},{"runtime":"python:3.9-slim","exit_code":1}]}}