Maggma Data Pipeline Framework
Maggma is a framework to build scientific data processing pipelines, handling data from diverse sources like databases, Azure Blobs, and local files, up to REST APIs. It provides core abstractions, `Store` and `Builder`, for modular ETL-like operations. The `Store` interface often mimics PyMongo syntax, enabling consistent data access across different backends. Actively developed by the Materials Project, it is currently at version 0.72.1 and requires Python 3.9+.
Warnings
- breaking The `maggma.api` module has been deprecated and will be migrated. This could significantly impact projects relying on Maggma's built-in API functionalities.
- gotcha Maggma's `Store` classes provide a unified interface that resembles PyMongo. However, not all `Store` implementations (e.g., FileStore, S3Store) support the full breadth of PyMongo's query capabilities or advanced features like aggregation pipelines. Over-reliance on PyMongo-specific syntax with non-Mongo backends can lead to unexpected behavior or unsupported operations.
- gotcha Using `MemoryStore` is suitable for testing and quick examples, but it is not persistent. Any data added to a `MemoryStore` will be lost when the Python interpreter closes or the `Store` object is garbage collected.
- gotcha Documents added to a `Store` must have a unique identifier, specified by the `key` argument during `Store` initialization (defaulting to `task_id`). If duplicates are inserted with the same key and `upsert=True`, the old document will be overwritten. If `upsert=False`, it may lead to errors depending on the store implementation.
- breaking Maggma, particularly components like `OpenDataStore`, has reported compatibility issues with `numpy` version 2.0. This can lead to unexpected errors or broken functionality.
Install
-
pip install maggma
Imports
- MemoryStore
from maggma.stores import MemoryStore
- MongoStore
from maggma.stores import MongoStore
- Builder
from maggma.builders import Builder
- Store
from maggma.core import Store
Quickstart
import os
from maggma.stores import MemoryStore
# Sample data
turtles = [
{"name": "Leonardo", "color": "blue", "tool": "sword"},
{"name": "Donatello", "color": "purple", "tool": "staff"},
{"name": "Michelangelo", "color": "orange", "tool": "nunchuks"},
{"name": "Raphael", "color": "red", "tool": "sai"}
]
# Create a MemoryStore (in-memory, data not persistent)
# 'key' argument specifies the unique identifier for documents
store = MemoryStore(key="name")
# Connect to the store (for MemoryStore, this just initializes it)
store.connect()
# Add data to the store using update
# upsert=True means insert if not found, update if found
store.update(turtles, key_field='name', upsert=True)
# Query the store
print(f"Total documents: {store.count()}")
print(f"Blue turtle: {store.query(criteria={'color': 'blue'}).current()}")
# Find distinct values
print(f"Distinct colors: {list(store.distinct(field='color'))}")
# Close the store connection (important for persistent stores)
store.close()
# Example of using a persistent store (e.g., MongoStore)
# Requires a MongoDB instance running and pymongo installed.
# uri = os.environ.get('MONGO_URI', 'mongodb://localhost:27017/test_db')
# from maggma.stores import MongoStore
# mongo_store = MongoStore(collection_name='my_collection', database_name='test_db', host=uri, key='name')
# try:
# mongo_store.connect()
# mongo_store.update(turtles, key_field='name', upsert=True)
# print(f"MongoStore count: {mongo_store.count()}")
# finally:
# mongo_store.close()