CoNLL-U Parser
The `conllu` library (version 6.0.0) is a Python parser for the CoNLL-U format, converting CoNLL-U formatted strings into a nested Python dictionary structure. CoNLL-U is frequently used as an output format for natural language processing tasks. It is actively maintained with a moderate release cadence and has no external dependencies.
Warnings
- breaking Version 5.0 (and newer, including 6.0.0) requires Python 3.8 or higher. Projects running on Python 3.6 or 3.7 must upgrade their Python version or pin `conllu` to a version older than 5.0.
- gotcha In version 3.0, the field names `xpostag` and `upostag` were changed to `xpos` and `upos` respectively, to align with Universal Dependencies 2.0. While `conllu` provides aliasing for backward compatibility, it's recommended to update code to use `xpos` and `upos` for clarity and future compatibility.
- breaking Updating from very old versions (e.g., 0.1 to 1.0) involved significant breaking changes to the API. Users migrating from such old versions should consult the release notes for a comprehensive upgrade guide.
Install
-
pip install conllu
Imports
- parse
from conllu import parse
- parse_tree
from conllu import parse_tree
Quickstart
from conllu import parse
data = """
# text = The quick brown fox jumps over the lazy dog.
1 The the DET DT Definite=Def|PronType=Art 4 det _ _
2 quick quick ADJ JJ Degree=Pos 4 amod _ _
3 brown brown ADJ JJ Degree=Pos 4 amod _ _
4 fox fox NOUN NN Number=Sing 5 nsubj _ _
5 jumps jump VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
6 over over ADP IN _ 9 case _ _
7 the the DET DT Definite=Def|PronType=Art 9 det _ _
8 lazy lazy ADJ JJ Degree=Pos 9 amod _ _
9 dog dog NOUN NN Number=Sing 5 nmod _ SpaceAfter=No
10 . . PUNCT . _ 5 punct _ _
"""
sentences = parse(data)
# Accessing tokens and metadata
sentence = sentences[0]
print(f"Sentence text: {sentence.metadata.get('text')}")
for token in sentence:
print(f"ID: {token['id']}, Form: {token['form']}, UPos: {token['upos']}")