img2dataset

raw JSON →
1.47.0 verified Mon Apr 27 auth: no python

img2dataset is a high-performance library to download and convert a set of image URLs into an image dataset, supporting various output formats such as parquet, webdataset, and local folders. Current version is 1.47.0, with frequent releases every few weeks.

pip install img2dataset
error img2dataset.errors.DatasetDownloaderError: Failed to download ...
cause The URL is malformed, the server returns a non-200 status, or the connection times out.
fix
Check that the URL is valid and accessible. Increase timeout parameter or set retry_delay and max_retries.
error FileNotFoundError: [Errno 2] No such file or directory: 'urls.csv'
cause The path to the URL list file is incorrect or the file does not exist.
fix
Provide an absolute path or ensure the file exists relative to the current working directory.
gotcha If you use `output_format='parquet'` and have many urls, ensure you have enough disk space for intermediate files. The library may write temporary files to the system temp directory, which could fill up /tmp.
fix Set the environment variable `TMPDIR` to a directory with sufficient space.
breaking In version 1.40.0, the `image_size` parameter was replaced by `resize_mode` and `resize_size`. Existing code using `image_size` will break.
fix Replace `image_size=256` with `resize_mode='center_crop'` and `resize_size=256`.
gotcha When using `output_format='webdataset'`, the output consists of tar files with a fixed number of samples per tar (default 10000). If you have a small dataset, you might get only one tar file. This can cause issues with some PyTorch DataLoaders expecting shards.
fix Use `output_format='files'` for small datasets or adjust `webdataset_num_samples_per_shard`.
deprecated The `input_format` parameter with value 'list' is deprecated since version 1.35.0. Use 'csv' or 'text' instead.
fix Change `input_format='list'` to `input_format='csv'` and adjust input file format accordingly.

Download images from a CSV file with columns 'url' (and optional 'caption'). Saves as individual files in output_folder.

from img2dataset import download
download(
    url_list="https://example.com/image_list.csv",
    output_folder="/path/to/output",
    thread_count=4,
    output_format="files",
    input_format="csv",
    url_col="url",
    caption_col=None,
    process_count=1,
)