Python tools for Universal Dependencies
udtools (version 0.2.7) provides a suite of Python tools for working with Universal Dependencies (UD) data. It offers functionalities for reading, writing, querying, and transforming CoNLL-U files, as well as integrating with UDPipe. The library is actively maintained with an irregular release cadence, focusing on facilitating linguistic research and processing of dependency parsed text.
Common errors
-
ModuleNotFoundError: No module named 'udtools.CoNLLUDocument'
cause Attempting to import a class directly from the top-level `udtools` package when it resides in a submodule (e.g., `udtools.conllu`).fixSpecify the full submodule path for the import, e.g., `from udtools.conllu import CoNLLUDocument`. -
FileNotFoundError: [Errno 2] No such file or directory: 'path/to/my_file.conllu'
cause The path provided to `CoNLLUDocument.from_file()` does not point to an existing or accessible CoNLL-U file.fixVerify that the file path is correct, the file exists, and your application has the necessary read permissions for the file and its directory. -
udtools.conllu.CoNLLUError: Invalid CoNLL-U format: Line X does not conform to the specification.
cause The CoNLL-U file being parsed contains syntax errors, missing fields, or incorrect formatting on a specific line, violating the CoNLL-U standard.fixExamine line X (and surrounding lines) in the problematic CoNLL-U file. Correct any format errors, such as incorrect number of tab-separated fields, invalid character encoding, or malformed ID/column values.
Warnings
- gotcha As a pre-1.0 library (version 0.x.x), `udtools` API might not strictly adhere to semantic versioning. Minor releases could introduce breaking changes or significant modifications to existing functionalities.
- gotcha Processing malformed CoNLL-U files can lead to `udtools.conllu.CoNLLUError` exceptions or silent data corruption. The library expects strict adherence to the CoNLL-U format.
- gotcha Loading very large CoNLL-U documents entirely into memory using `CoNLLUDocument.from_file()` or `CoNLLUDocument.from_string()` can consume significant system RAM, potentially leading to `MemoryError`.
Install
-
pip install udtools
Imports
- CoNLLUDocument
import udtools.CoNLLUDocument
from udtools.conllu import CoNLLUDocument
- collapse_compounds
from udtools import collapse_compounds
from udtools.transform import collapse_compounds
- Sentence
from udtools import Sentence
from udtools.conllu import Sentence
Quickstart
from udtools.conllu import CoNLLUDocument, Sentence, Token
from udtools.transform import collapse_compounds
# Create a sample CoNLL-U document from a string
conllu_string = """
# sent_id = 1
# text = This is an example.
1 This this PRON DT Number=Sing|PronType=Dem 3 nsubj _ _
2 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 3 cop _ _
3 an a DET DT Definite=Ind|PronType=Art 4 det _ _
4 example example NOUN NN Number=Sing 0 root _ SpaceAfter=No
5 . . PUNCT . _ 4 punct _ _
"""
doc = CoNLLUDocument.from_string(conllu_string)
print("Original document:")
print(doc.to_string())
# Example transformation (collapse_compounds might not change this simple example)
collapsed_doc = collapse_compounds(doc)
print("\nDocument after collapse_compounds (no change for this simple example):")
print(collapsed_doc.to_string())
# Demonstrate adding a new sentence
new_sentence = Sentence()
new_sentence.tokens.append(Token(id="1", form="Hello", lemma="hello", upos="INTJ"))
new_sentence.tokens.append(Token(id="2", form=".", lemma=".", upos="PUNCT"))
doc.sentences.append(new_sentence)
print("\nDocument with a new sentence added:")
print(doc.to_string())