bioc - Processing BioC, Brat, and PubTator with Python

2.1 · active · verified Thu Apr 16

bioc is a Python library designed for processing and manipulating data in BioC XML/JSON, Brat standoff, and PubTator formats. It provides an API that facilitates reading, writing, and working with these common bioinformatics text mining annotation formats. Currently at version 2.1, the library undergoes releases with a focus on supporting the latest Python versions and format specifications.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to create a basic BioC collection programmatically, add a document with a passage and an annotation, and then serialize it to a BioC XML string using `biocxml.dumps`. It also shows how to deserialize an XML string back into a BioC collection using `biocxml.loads`.

from bioc import biocxml, bioc

# Create a simple BioC Collection
collection = bioc.BioCCollection()
collection.date = '2023-01-01'
collection.source = 'Example'

document = bioc.BioCDocument()
document.id = '123'

passage = bioc.BioCPassage()
passage.offset = 0
passage.text = 'This is a test sentence.'

annotation = bioc.BioCAnnotation()
annotation.id = 'T1'
annotation.text = 'test sentence'
annotation.add_location(bioc.BioCLocation(offset=10, length=13))
passage.add_annotation(annotation)

document.add_passage(passage)
collection.add_document(document)

# Serialize to a BioC XML string
xml_string = biocxml.dumps(collection, pretty_print=True)
print('--- BioC XML ---')
print(xml_string)

# Deserialize from a BioC XML string
loaded_collection = biocxml.loads(xml_string)
print('\n--- Loaded Collection ID ---')
for doc in loaded_collection.documents:
    print(doc.id)

view raw JSON →