Maggma Data Pipeline Framework

0.72.1 · active · verified Wed Apr 15

Maggma is a framework to build scientific data processing pipelines, handling data from diverse sources like databases, Azure Blobs, and local files, up to REST APIs. It provides core abstractions, `Store` and `Builder`, for modular ETL-like operations. The `Store` interface often mimics PyMongo syntax, enabling consistent data access across different backends. Actively developed by the Materials Project, it is currently at version 0.72.1 and requires Python 3.9+.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates the core concepts of Maggma: defining data as a list of dictionaries, creating a `Store` (using `MemoryStore` for simplicity), connecting to it, adding data using the `update` method, and querying data. It highlights the use of a `key` field for unique document identification. A commented-out example for `MongoStore` is included to illustrate persistent storage.

import os
from maggma.stores import MemoryStore

# Sample data
turtles = [
    {"name": "Leonardo", "color": "blue", "tool": "sword"},
    {"name": "Donatello", "color": "purple", "tool": "staff"},
    {"name": "Michelangelo", "color": "orange", "tool": "nunchuks"},
    {"name": "Raphael", "color": "red", "tool": "sai"}
]

# Create a MemoryStore (in-memory, data not persistent)
# 'key' argument specifies the unique identifier for documents
store = MemoryStore(key="name")

# Connect to the store (for MemoryStore, this just initializes it)
store.connect()

# Add data to the store using update
# upsert=True means insert if not found, update if found
store.update(turtles, key_field='name', upsert=True)

# Query the store
print(f"Total documents: {store.count()}")
print(f"Blue turtle: {store.query(criteria={'color': 'blue'}).current()}")

# Find distinct values
print(f"Distinct colors: {list(store.distinct(field='color'))}")

# Close the store connection (important for persistent stores)
store.close()

# Example of using a persistent store (e.g., MongoStore)
# Requires a MongoDB instance running and pymongo installed.
# uri = os.environ.get('MONGO_URI', 'mongodb://localhost:27017/test_db')
# from maggma.stores import MongoStore
# mongo_store = MongoStore(collection_name='my_collection', database_name='test_db', host=uri, key='name')
# try:
#     mongo_store.connect()
#     mongo_store.update(turtles, key_field='name', upsert=True)
#     print(f"MongoStore count: {mongo_store.count()}")
# finally:
#     mongo_store.close()

view raw JSON →