{"id":21455,"library":"img2dataset","title":"img2dataset","description":"img2dataset is a high-performance library to download and convert a set of image URLs into an image dataset, supporting various output formats such as parquet, webdataset, and local folders. Current version is 1.47.0, with frequent releases every few weeks.","status":"active","version":"1.47.0","language":"python","source_language":"en","source_url":"https://github.com/rom1504/img2dataset","tags":["dataset","image-download","webdataset","parquet","deep-learning"],"install":[{"cmd":"pip install img2dataset","lang":"bash","label":"Install from PyPI"}],"dependencies":[],"imports":[{"note":"Main entry point function.","wrong":"","symbol":"download","correct":"from img2dataset import download"}],"quickstart":{"code":"from img2dataset import download\ndownload(\n    url_list=\"https://example.com/image_list.csv\",\n    output_folder=\"/path/to/output\",\n    thread_count=4,\n    output_format=\"files\",\n    input_format=\"csv\",\n    url_col=\"url\",\n    caption_col=None,\n    process_count=1,\n)\n","lang":"python","description":"Download images from a CSV file with columns 'url' (and optional 'caption'). Saves as individual files in output_folder."},"warnings":[{"fix":"Set the environment variable `TMPDIR` to a directory with sufficient space.","message":"If you use `output_format='parquet'` and have many urls, ensure you have enough disk space for intermediate files. The library may write temporary files to the system temp directory, which could fill up /tmp.","severity":"gotcha","affected_versions":"all"},{"fix":"Replace `image_size=256` with `resize_mode='center_crop'` and `resize_size=256`.","message":"In version 1.40.0, the `image_size` parameter was replaced by `resize_mode` and `resize_size`. Existing code using `image_size` will break.","severity":"breaking","affected_versions":">=1.40.0"},{"fix":"Use `output_format='files'` for small datasets or adjust `webdataset_num_samples_per_shard`.","message":"When using `output_format='webdataset'`, the output consists of tar files with a fixed number of samples per tar (default 10000). If you have a small dataset, you might get only one tar file. This can cause issues with some PyTorch DataLoaders expecting shards.","severity":"gotcha","affected_versions":"all"},{"fix":"Change `input_format='list'` to `input_format='csv'` and adjust input file format accordingly.","message":"The `input_format` parameter with value 'list' is deprecated since version 1.35.0. Use 'csv' or 'text' instead.","severity":"deprecated","affected_versions":">=1.35.0"}],"env_vars":null,"last_verified":"2026-04-27T00:00:00.000Z","next_check":"2026-07-26T00:00:00.000Z","problems":[{"fix":"Check that the URL is valid and accessible. Increase `timeout` parameter or set `retry_delay` and `max_retries`.","cause":"The URL is malformed, the server returns a non-200 status, or the connection times out.","error":"img2dataset.errors.DatasetDownloaderError: Failed to download ..."},{"fix":"Provide an absolute path or ensure the file exists relative to the current working directory.","cause":"The path to the URL list file is incorrect or the file does not exist.","error":"FileNotFoundError: [Errno 2] No such file or directory: 'urls.csv'"}],"ecosystem":"pypi","meta_description":null,"install_score":null,"install_tag":null,"quickstart_score":null,"quickstart_tag":null}