Pooch: A friend to fetch your data files
Pooch is a Python library designed to simplify the management and fetching of data files. It automatically downloads files from remote servers (supporting HTTP, FTP, Zenodo, and Figshare) only when they are needed, stores them in a local cache, and ensures data integrity through SHA256 hash checks. This makes it ideal for Python libraries distributing sample datasets or for scientists managing research data. The current version is 1.9.0, with a release cadence that typically spans several months to a year between minor versions, reflecting ongoing development and stability.
Warnings
- gotcha Incorrect SHA256 hash will trigger re-downloads or errors. Always ensure the hash in your registry exactly matches the file's content.
- gotcha When using `base_url` with versioning, ensure your remote data repository structure reflects the version string provided to `pooch.create()`.
- gotcha If `Pooch` downloads a file but its hash doesn't match the registry, it will raise an exception (usually `ValueError`), indicating possible data corruption or an outdated hash.
- gotcha Using an environment variable to override the cache `path` (e.g., `MYPACKAGE_DATA_DIR`) can lead to unexpected behavior if not managed carefully, as it might point to a non-existent or inaccessible directory.
Install
-
pip install pooch -
conda install conda-forge::pooch
Imports
- create
import pooch my_pooch = pooch.create(...)
- os_cache
import pooch cache_path = pooch.os_cache('my_project_name')
Quickstart
import pooch
import os
# A dummy URL for demonstration. In a real scenario, this would point to your hosted data.
# For a runnable example, we'll use a small file from fatiando/pooch's actual data.
data_url = "https://github.com/fatiando/pooch/raw/v1.9.0/pooch/tests/data/tiny-data.txt"
data_hash = "sha256:d48d4841b5d197607a9b0c7a522533c095311e3895e5330a9e25d2c510800b50"
# Configure a new Pooch instance
# We use a temporary directory for this example to avoid cluttering the actual cache.
# In a real application, you'd likely use pooch.os_cache("your_app_name")
# Create a temporary directory for the cache
# This is a workaround for the quickstart to be self-contained and runnable without permissions issues.
# In a real library, use pooch.os_cache() to get the system default cache dir.
cache_dir = os.environ.get('POOCH_TEST_CACHE', None)
if not cache_dir:
import tempfile
temp_dir = tempfile.TemporaryDirectory()
cache_dir = temp_dir.name
else:
temp_dir = None # Manage cleanup later if not using TemporaryDirectory
registry = {"tiny-data.txt": data_hash}
my_pooch = pooch.create(
path=cache_dir,
base_url="https://github.com/fatiando/pooch/raw/{version}/pooch/tests/data/",
version="v1.9.0", # Match the version of the data you want to fetch
registry=registry
)
# Fetch the data file
file_path = my_pooch.fetch("tiny-data.txt")
print(f"Data file downloaded to: {file_path}")
with open(file_path, "r") as f:
content = f.read()
print(f"Content of the file: {content.strip()}")
# Clean up the temporary directory if it was created
if temp_dir:
temp_dir.cleanup()