PyVi

raw JSON →
0.1.1 verified Mon Apr 27 auth: no python maintenance

PyVi is a Python toolkit for Vietnamese language processing, providing tokenization (word segmentation), part-of-speech tagging, and named entity recognition. Currently at version 0.1.1, the project appears to be in maintenance mode with no recent updates.

pip install pyvi
error ModuleNotFoundError: No module named 'pyvi'
cause PyVi is not installed.
fix
Run: pip install pyvi
error AttributeError: module 'pyvi' has no attribute 'ViTokenizer'
cause Incorrect import statement.
fix
Use: from pyvi import ViTokenizer
error FileNotFoundError: [Errno 2] No such file or directory: 'pyvi/ViTokenizer/data/tokenized_data.pkl'
cause Package data files missing; likely a broken installation.
fix
Reinstall pyvi: pip install --force-reinstall pyvi
gotcha The tokenizer combines syllables with underscores (e.g., 'Học_sinh'). Do not split on spaces before feeding to downstream tasks without handling underscores.
fix Use tokenizer output directly or replace underscores with spaces as needed.
deprecated The package has not been updated since 2020. Models may be outdated compared to newer approaches.
fix Consider alternatives like underthesea or VnCoreNLP for better accuracy.

Tokenize Vietnamese text and perform POS tagging.

from pyvi import ViTokenizer, ViPosTagger

text = 'Học sinh học sinh học'
tokens = ViTokenizer.tokenize(text)
print(tokens)
# Output: Học_sinh học_sinh học

# POS tagging
words = tokens.split()
tags = ViPosTagger.postagging(words)
print(tags)
# Output: (['Học_sinh', 'học_sinh', 'học'], ['N', 'N', 'V'])