bioc - Processing BioC, Brat, and PubTator with Python
bioc is a Python library designed for processing and manipulating data in BioC XML/JSON, Brat standoff, and PubTator formats. It provides an API that facilitates reading, writing, and working with these common bioinformatics text mining annotation formats. Currently at version 2.1, the library undergoes releases with a focus on supporting the latest Python versions and format specifications.
Common errors
-
AttributeError: module 'bioc' has no attribute 'dump'
cause Attempting to use `bioc.dump` or `bioc.load` directly in `bioc` versions 2.x or later.fixFor XML operations, use `from bioc import biocxml` and then call `biocxml.dump()` or `biocxml.load()`. -
ModuleNotFoundError: No module named 'biocxml'
cause Trying to import `biocxml` directly as a top-level package or without `from bioc`.fixThe `biocxml` module is part of the `bioc` package. Use `from bioc import biocxml`. -
SyntaxError: invalid syntax (when running on Python 2.x)
cause Running `bioc` code (especially versions 1.2.1+) with a Python 2 interpreter.fixUpgrade your Python environment to Python 3.6 or higher. `bioc` no longer supports Python 2.
Warnings
- breaking Direct top-level import of BioC XML functions (e.g., `bioc.dump`, `bioc.load`, `bioc.dumps`, `bioc.loads`) was removed in version 2.0. These functions are now part of the `biocxml` submodule.
- breaking Python 2.x is no longer supported. Version 1.2.1 removed support for Python 2, and subsequent versions are Python 3.6+ only.
- gotcha The PyPI project metadata still lists the 'Development Status' as '1 - Planning' (as of v2.1). This is misleading as the library has undergone multiple releases and is actively maintained for production use.
Install
-
pip install bioc
Imports
- biocxml
import bioc
from bioc import biocxml
- brat
from bioc import brat
- pubtator
from bioc import pubtator
- BioCCollection
from bioc import BioCCollection
from bioc.bioc import BioCCollection
Quickstart
from bioc import biocxml, bioc
# Create a simple BioC Collection
collection = bioc.BioCCollection()
collection.date = '2023-01-01'
collection.source = 'Example'
document = bioc.BioCDocument()
document.id = '123'
passage = bioc.BioCPassage()
passage.offset = 0
passage.text = 'This is a test sentence.'
annotation = bioc.BioCAnnotation()
annotation.id = 'T1'
annotation.text = 'test sentence'
annotation.add_location(bioc.BioCLocation(offset=10, length=13))
passage.add_annotation(annotation)
document.add_passage(passage)
collection.add_document(document)
# Serialize to a BioC XML string
xml_string = biocxml.dumps(collection, pretty_print=True)
print('--- BioC XML ---')
print(xml_string)
# Deserialize from a BioC XML string
loaded_collection = biocxml.loads(xml_string)
print('\n--- Loaded Collection ID ---')
for doc in loaded_collection.documents:
print(doc.id)