PyVi
raw JSON → 0.1.1 verified Mon Apr 27 auth: no python maintenance
PyVi is a Python toolkit for Vietnamese language processing, providing tokenization (word segmentation), part-of-speech tagging, and named entity recognition. Currently at version 0.1.1, the project appears to be in maintenance mode with no recent updates.
pip install pyvi Common errors
error ModuleNotFoundError: No module named 'pyvi' ↓
cause PyVi is not installed.
fix
Run: pip install pyvi
error AttributeError: module 'pyvi' has no attribute 'ViTokenizer' ↓
cause Incorrect import statement.
fix
Use: from pyvi import ViTokenizer
error FileNotFoundError: [Errno 2] No such file or directory: 'pyvi/ViTokenizer/data/tokenized_data.pkl' ↓
cause Package data files missing; likely a broken installation.
fix
Reinstall pyvi: pip install --force-reinstall pyvi
Warnings
gotcha The tokenizer combines syllables with underscores (e.g., 'Học_sinh'). Do not split on spaces before feeding to downstream tasks without handling underscores. ↓
fix Use tokenizer output directly or replace underscores with spaces as needed.
deprecated The package has not been updated since 2020. Models may be outdated compared to newer approaches. ↓
fix Consider alternatives like underthesea or VnCoreNLP for better accuracy.
Imports
- Tokenizer wrong
from pyvi import tokenizecorrectfrom pyvi import ViTokenizer - POSTagger wrong
from pyvi import pos_tagcorrectfrom pyvi import ViPOSTagger
Quickstart
from pyvi import ViTokenizer, ViPosTagger
text = 'Học sinh học sinh học'
tokens = ViTokenizer.tokenize(text)
print(tokens)
# Output: Học_sinh học_sinh học
# POS tagging
words = tokens.split()
tags = ViPosTagger.postagging(words)
print(tags)
# Output: (['Học_sinh', 'học_sinh', 'học'], ['N', 'N', 'V'])