Pooch: A friend to fetch your data files
Pooch is a Python library designed to simplify the management and fetching of data files. It automatically downloads files from remote servers (supporting HTTP, FTP, Zenodo, and Figshare) only when they are needed, stores them in a local cache, and ensures data integrity through SHA256 hash checks. This makes it ideal for Python libraries distributing sample datasets or for scientists managing research data. The current version is 1.9.0, with a release cadence that typically spans several months to a year between minor versions, reflecting ongoing development and stability.
Common errors
-
ImportError: Missing optional dependency 'pooch' required for scipy.datasets module. Please use pip or conda to install 'pooch'.
cause This error occurs when a library that relies on Pooch, such as SciPy, attempts to use its data fetching capabilities but Pooch is not installed in the Python environment.fixInstall Pooch using pip or conda: `pip install pooch` or `conda install pooch` -
ImportError: cannot import name 'file_hash' from 'pooch.utils'
cause This error arises because the `file_hash` function, previously accessible via `pooch.utils`, was moved to the top-level `pooch` namespace in version 1.5, deprecating the old import path.fixUpdate your import statement from `from pooch.utils import file_hash` to `from pooch import file_hash`. -
ValueError: File 'FILENAME.zip' not found in data archive https://zenodo.org/record/RECORD_ID (doi:DOI_ID).
cause This error typically occurs when Pooch attempts to download a file from a Zenodo (or similar DOI-based) repository, but the specified `filename` in the Pooch registry does not exactly match the file name or path within the remote archive. This can happen particularly with files located in subdirectories within the Zenodo deposit.fixEnsure the filename specified in your Pooch registry (or `fetch` call) precisely matches the full path to the file *within* the Zenodo archive, including any subdirectories (e.g., `subdirectory/filename.zip`). Refer to the Zenodo record to verify the exact file path. If the issue is with `doi:10.5281/zenodo.7347607/Wild-Minds` being treated as a DOI, ensure the DOI points to the record, and the path to the file is correct. -
ValueError: Missing package 'tqdm' required for progress bars.
cause This error is raised when a Pooch downloader (like `HTTPDownloader` or `FTPDownloader`) is configured to display a progress bar (e.g., `progressbar=True`) but the optional `tqdm` library, which provides the progress bar functionality, is not installed.fixInstall the `tqdm` library: `pip install tqdm` or `conda install tqdm`.
Warnings
- gotcha Incorrect SHA256 hash will trigger re-downloads or errors. Always ensure the hash in your registry exactly matches the file's content.
- gotcha When using `base_url` with versioning, ensure your remote data repository structure reflects the version string provided to `pooch.create()`.
- gotcha If `Pooch` downloads a file but its hash doesn't match the registry, it will raise an exception (usually `ValueError`), indicating possible data corruption or an outdated hash.
- gotcha Using an environment variable to override the cache `path` (e.g., `MYPACKAGE_DATA_DIR`) can lead to unexpected behavior if not managed carefully, as it might point to a non-existent or inaccessible directory.
Install
-
pip install pooch -
conda install conda-forge::pooch
Imports
- create
import pooch my_pooch = pooch.create(...)
- os_cache
import pooch cache_path = pooch.os_cache('my_project_name')
Quickstart
import pooch
import os
# A dummy URL for demonstration. In a real scenario, this would point to your hosted data.
# For a runnable example, we'll use a small file from fatiando/pooch's actual data.
data_url = "https://github.com/fatiando/pooch/raw/v1.9.0/pooch/tests/data/tiny-data.txt"
data_hash = "sha256:d48d4841b5d197607a9b0c7a522533c095311e3895e5330a9e25d2c510800b50"
# Configure a new Pooch instance
# We use a temporary directory for this example to avoid cluttering the actual cache.
# In a real application, you'd likely use pooch.os_cache("your_app_name")
# Create a temporary directory for the cache
# This is a workaround for the quickstart to be self-contained and runnable without permissions issues.
# In a real library, use pooch.os_cache() to get the system default cache dir.
cache_dir = os.environ.get('POOCH_TEST_CACHE', None)
if not cache_dir:
import tempfile
temp_dir = tempfile.TemporaryDirectory()
cache_dir = temp_dir.name
else:
temp_dir = None # Manage cleanup later if not using TemporaryDirectory
registry = {"tiny-data.txt": data_hash}
my_pooch = pooch.create(
path=cache_dir,
base_url="https://github.com/fatiando/pooch/raw/{version}/pooch/tests/data/",
version="v1.9.0", # Match the version of the data you want to fetch
registry=registry
)
# Fetch the data file
file_path = my_pooch.fetch("tiny-data.txt")
print(f"Data file downloaded to: {file_path}")
with open(file_path, "r") as f:
content = f.read()
print(f"Content of the file: {content.strip()}")
# Clean up the temporary directory if it was created
if temp_dir:
temp_dir.cleanup()