DocArray
DocArray is a Python library that provides a data structure for multimodal data. It is designed to work efficiently with unstructured data like text, images, and audio, often used in machine learning and vector database contexts. The current version is 0.41.0, with frequent patch and minor releases, typically on a monthly to bi-monthly cadence.
Common errors
-
ImportError: cannot import name 'Document' from 'docarray'
cause You are attempting to import the legacy `Document` class which is no longer part of the primary `docarray` namespace for new projects. It has been superseded by `BaseDoc`.fixReplace `from docarray import Document` with `from docarray import BaseDoc` when defining your document schemas. -
AttributeError: 'BaseDoc' object has no attribute 'tags'
cause Features like `.tags` or `.chunks` were specific to the legacy `Document` class. `BaseDoc` objects are Pydantic models, so custom fields are defined directly.fixIf you need a 'tags' field, define it explicitly in your `BaseDoc` schema: `class MyDoc(BaseDoc): tags: List[str]`. -
TypeError: Object of type DocList is not JSON serializable
cause Attempting to directly serialize a `DocList` instance using `json.dumps()` without first converting it to a JSON-compatible format like a string or dictionary.fixUse the built-in `to_json()` method of `DocList` to get a JSON string, then process it. Example: `json_string = my_doclist.to_json()`. -
pydantic.error_wrappers.ValidationError: 1 validation error for MyDocument
cause Your `BaseDoc` model validation failed, often due to providing a value of the wrong type or shape for a field, e.g., passing a list when an `NdArray` is expected.fixCheck the detailed error message for the specific field causing the validation error. Ensure data types and shapes match your `BaseDoc` schema definitions (e.g., `NdArray[128]` expects a NumPy array of shape (128,)).
Warnings
- breaking The `docarray.Document` class is a legacy API that has been deprecated since v0.30.0. Using it with newer DocArray features or for new projects will lead to missing functionality or errors. The current API uses `docarray.BaseDoc` for document definitions and `docarray.DocList` for collections.
- breaking The `to_json()` method for `DocList` and `DocVec` changed its return type from a dictionary (`dict`) to a JSON-formatted string (`str`) to ensure consistency across serialization methods.
- gotcha DocArray supports both Pydantic v1 and v2. However, if you upgrade your project's Pydantic dependency to v2, you may need to adapt your `BaseDoc` definitions to align with Pydantic v2's API changes (e.g., for `Field` usage, `default_factory`).
- gotcha A bug in `from_dataframe` when used with `numpy>=1.26.1` caused issues due to changes in NumPy's versioning semantics. This was patched in a subsequent release.
Install
-
pip install docarray -
pip install 'docarray[full]'
Imports
- BaseDoc
from docarray import Document
from docarray import BaseDoc
- DocList
from docarray import DocList
- DocVec
from docarray import DocVec
- NdArray
from docarray.typing import NdArray
Quickstart
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
import numpy as np
# 1. Define your custom document schema using BaseDoc
class MyDocument(BaseDoc):
text: str
image_embedding: NdArray[128] # Define an embedding field with fixed dimensions
# 2. Create a single document instance
doc = MyDocument(text='hello world', image_embedding=np.random.rand(128))
print(f"Created document with text: {doc.text}")
# 3. Create a collection of documents using DocList
docs = DocList[MyDocument]([
MyDocument(text='document one', image_embedding=np.random.rand(128)),
MyDocument(text='document two', image_embedding=np.random.rand(128)),
])
print(f"DocList contains {len(docs)} documents.")
# 4. Access individual documents and their fields
print(f"First document's text: {docs[0].text}")
print(f"Second document's embedding shape: {docs[1].image_embedding.shape}")