Data Package
The `datapackage-py` library is a Python implementation of the Data Package standard, focusing on utilities to create, read, and validate data packages as defined by frictionlessdata.io specifications. It provides a simple API for interacting with `datapackage.json` files and their associated resources. The current version is 1.15.4, and it follows a minor release cadence as needed for bug fixes and small enhancements.
Common errors
-
jsonschema.exceptions.ValidationError: 'resources' is a required property
cause The `datapackage.json` descriptor is missing the mandatory 'resources' key, or it's malformed.fixEnsure your `datapackage.json` adheres to the Data Package specification, including a valid 'resources' array. Check for typos or structural errors in the JSON. -
FileNotFoundError: [Errno 2] No such file or directory: 'datapackage.json'
cause The `Package()` constructor could not find the `datapackage.json` file at the specified path.fixVerify that the path provided to `Package()` is correct and that the `datapackage.json` file exists in that location. For relative paths, ensure your script's current working directory is as expected. -
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 0: invalid start byte
cause The resource file (e.g., CSV) is not encoded in UTF-8, but `datapackage` is attempting to read it as such by default.fixSpecify the correct encoding in the resource descriptor in `datapackage.json` (e.g., `'encoding': 'ISO-8859-1'`) or explicitly when loading data if available for the particular data source.
Warnings
- breaking Major API changes were introduced in version 1.0.0, which affect how package and resource metadata/data are accessed. Methods like `get_resources()`, `get_metadata()`, and `get_data()` were replaced by properties or renamed methods.
- gotcha Users often confuse `datapackage-py` with the broader `frictionless` framework. `datapackage-py` strictly implements the Data Package standard, while `frictionless` (v4+) is a much larger data framework that includes data package capabilities but offers a different API and more extensive features (e.g., pipelines, schemas, validation for various data sources).
- gotcha When loading remote resources, `datapackage` relies on `requests`. If you encounter SSL errors (e.g., `SSLCertVerificationError`), it's often an environment-specific issue rather than a library bug.
Install
-
pip install datapackage
Imports
- Package
from datapackage import Package
- Resource
from datapackage import Resource
- ValidationError
from datapackage.exceptions import ValidationError
from datapackage import ValidationError
Quickstart
from datapackage import Package, Resource
import json
import os
# Create a simple data package descriptor in memory
descriptor = {
'name': 'my-data-package',
'resources': [
{
'name': 'cities',
'path': 'cities.csv',
'profile': 'tabular-data-resource',
'schema': {
'fields': [
{'name': 'id', 'type': 'integer'},
{'name': 'name', 'type': 'string'}
]
}
}
]
}
# Simulate a file on disk (or create it for real)
csv_data = "id,name\n1,London\n2,Paris"
with open('cities.csv', 'w') as f:
f.write(csv_data)
with open('datapackage.json', 'w') as f:
json.dump(descriptor, f, indent=2)
# Load the data package
package = Package('datapackage.json')
# Validate the package
if package.valid:
print(f"Package '{package.name}' is valid!")
else:
print("Package validation errors:")
for error in package.errors:
print(f"- {error}")
# Get a resource
cities_resource = package.get_resource('cities')
# Read data from the resource
print("\nCities data (keyed=True):")
for row in cities_resource.read(keyed=True):
print(row)
# Clean up created files
os.remove('cities.csv')
os.remove('datapackage.json')