Pooch: A friend to fetch your data files

1.9.0 · active · verified Sun Apr 05

Pooch is a Python library designed to simplify the management and fetching of data files. It automatically downloads files from remote servers (supporting HTTP, FTP, Zenodo, and Figshare) only when they are needed, stores them in a local cache, and ensures data integrity through SHA256 hash checks. This makes it ideal for Python libraries distributing sample datasets or for scientists managing research data. The current version is 1.9.0, with a release cadence that typically spans several months to a year between minor versions, reflecting ongoing development and stability.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to set up a `Pooch` instance, register a data file with its URL and SHA256 hash, and then fetch it. `Pooch` will automatically download the file if it's not present or if its hash doesn't match, otherwise it will return the path to the cached file. The example uses a temporary directory for the cache for demonstration purposes; in a production library, `pooch.os_cache('your_library_name')` is recommended for persistent caching.

import pooch
import os

# A dummy URL for demonstration. In a real scenario, this would point to your hosted data.
# For a runnable example, we'll use a small file from fatiando/pooch's actual data.
data_url = "https://github.com/fatiando/pooch/raw/v1.9.0/pooch/tests/data/tiny-data.txt"
data_hash = "sha256:d48d4841b5d197607a9b0c7a522533c095311e3895e5330a9e25d2c510800b50"

# Configure a new Pooch instance
# We use a temporary directory for this example to avoid cluttering the actual cache.
# In a real application, you'd likely use pooch.os_cache("your_app_name")

# Create a temporary directory for the cache
# This is a workaround for the quickstart to be self-contained and runnable without permissions issues.
# In a real library, use pooch.os_cache() to get the system default cache dir.
cache_dir = os.environ.get('POOCH_TEST_CACHE', None)
if not cache_dir:
    import tempfile
    temp_dir = tempfile.TemporaryDirectory()
    cache_dir = temp_dir.name
else:
    temp_dir = None # Manage cleanup later if not using TemporaryDirectory


registry = {"tiny-data.txt": data_hash}

my_pooch = pooch.create(
    path=cache_dir,
    base_url="https://github.com/fatiando/pooch/raw/{version}/pooch/tests/data/",
    version="v1.9.0", # Match the version of the data you want to fetch
    registry=registry
)

# Fetch the data file
file_path = my_pooch.fetch("tiny-data.txt")

print(f"Data file downloaded to: {file_path}")

with open(file_path, "r") as f:
    content = f.read()
    print(f"Content of the file: {content.strip()}")

# Clean up the temporary directory if it was created
if temp_dir:
    temp_dir.cleanup()

view raw JSON →