DocArray

0.41.0 · active · verified Fri Apr 17

DocArray is a Python library that provides a data structure for multimodal data. It is designed to work efficiently with unstructured data like text, images, and audio, often used in machine learning and vector database contexts. The current version is 0.41.0, with frequent patch and minor releases, typically on a monthly to bi-monthly cadence.

Common errors

Warnings

Install

Imports

Quickstart

Define custom document schemas using `BaseDoc` and type hints (including `NdArray` for numerical arrays/embeddings), then create instances of single documents or collections using `DocList`.

from docarray import BaseDoc, DocList
from docarray.typing import NdArray
import numpy as np

# 1. Define your custom document schema using BaseDoc
class MyDocument(BaseDoc):
    text: str
    image_embedding: NdArray[128] # Define an embedding field with fixed dimensions

# 2. Create a single document instance
doc = MyDocument(text='hello world', image_embedding=np.random.rand(128))
print(f"Created document with text: {doc.text}")

# 3. Create a collection of documents using DocList
docs = DocList[MyDocument]([
    MyDocument(text='document one', image_embedding=np.random.rand(128)),
    MyDocument(text='document two', image_embedding=np.random.rand(128)),
])
print(f"DocList contains {len(docs)} documents.")

# 4. Access individual documents and their fields
print(f"First document's text: {docs[0].text}")
print(f"Second document's embedding shape: {docs[1].image_embedding.shape}")

view raw JSON →