Intake
Intake is a lightweight Python package for finding, investigating, loading, and distributing data. It provides a common API for loading data from a wide variety of sources (e.g., CSV, NetCDF, SQL, HDF5, Parquet, Zarr) and enables the creation and management of data catalogs. The current version is 2.0.9, and the project is in a stable maintenance phase for its 2.x series, with less frequent but significant updates.
Warnings
- breaking Major API changes occurred between Intake 1.x and 2.x, particularly concerning how drivers are accessed and catalog specifications are defined. Directly using `intake.source.<driver>.SourceClass` is deprecated in favor of `intake.open_<format>(...)` functions.
- gotcha Intake relies heavily on a plugin system for specific data formats and remote storage. If you try to open a file type (e.g., Parquet, SQL) or access a remote system (e.g., S3) without the corresponding `intake-<plugin_name>` package installed, you will encounter errors.
- gotcha Confusing `intake.open_catalog()` with direct source opening functions like `intake.open_csv()`. `open_catalog` is for loading YAML catalog files (which can contain multiple sources), whereas `open_csv` (and similar) directly open a single data file without a catalog.
Install
-
pip install intake -
pip install intake[parquet,s3,sql]
Imports
- open_catalog
import intake catalog = intake.open_catalog('my_catalog.yaml') - open_csv
import intake df = intake.open_csv('data.csv').read()
Quickstart
import intake
# Open a public example catalog
catalog = intake.open_catalog("https://raw.githubusercontent.com/intake/intake-examples/master/catalogs/us_states.yml")
# Access a data source from the catalog
df = catalog.states.read()
print(df.head())