soynlp
raw JSON → 0.0.493 verified Mon Apr 27 auth: no python deprecated
An unsupervised Korean Natural Language Processing toolkit for tokenization, stemming, part-of-speech tagging, and noun extraction. The current version is 0.0.493. Development has stalled since 2020; the repository is archived and no longer maintained as of version 0.1.1 (which is not on PyPI).
pip install soynlp Common errors
error ModuleNotFoundError: No module named 'soynlp.words' ↓
cause Incorrect import path; the correct module is 'soynlp.word' (singular).
fix
Use 'from soynlp.word import WordExtractor' instead.
error TypeError: 'WordExtractor' object is not iterable ↓
cause Passing the WordExtractor object directly as a tokenizer score dictionary; expected a dict.
fix
Extract scores: scores = {word:score.cohesion_forward for word, score in word_extractor.extract().items()}
error ValueError: No corpus was trained ↓
cause Calling extract() without calling train() first on a corpus.
fix
Call word_extractor.train(corpus) before word_extractor.extract()
Warnings
deprecated The repository is archived on GitHub (last release 0.1.1, not on PyPI). PyPI version 0.0.493 is several years old and will not receive updates. ↓
fix Consider migrating to modern Korean NLP libraries such as Kiwi (kiwipiepy), KoNLPy, or Hugging Face tokenizers.
breaking In some versions, import paths changed. Using 'soynlp.words' (with an 's') will fail; use 'soynlp.word' instead. ↓
fix Replace 'from soynlp.words import WordExtractor' with 'from soynlp.word import WordExtractor'.
gotcha The LTokenzier sorce dictionary must be a dict mapping word to a float score (e.g., cohesion_forward). Passing raw WordExtractor output will cause a TypeError. ↓
fix Extract scores as shown in quickstart: scores = {word:score.cohesion_forward for word, score in scores.items()}.
gotcha DoublespaceLineCorpus expects double-space separated tokens per line, not arbitrary text. Using raw text lines will produce garbage corpus sentences. ↓
fix Ensure input file has tokens separated by two spaces, or preprocess accordingly.
Imports
- soynlp.noun.LRNounExtractor_v2
from soynlp.noun import LRNounExtractor_v2 - soynlp.tokenizer.MaxScoreTokenizer
from soynlp.tokenizer import MaxScoreTokenizer - soynlp.normalize
from soynlp import normalize - soynlp.words.WordExtractor
from soynlp.word import WordExtractor
Quickstart
from soynlp import DoublespaceLineCorpus
from soynlp.word import WordExtractor
from soynlp.tokenizer import LTokenizer
corpus = DoublespaceLineCorpus('dataset.txt', iter_sent=True)
word_extractor = WordExtractor(max_iter_learning_steps=100)
word_extractor.train(corpus)
scores = word_extractor.extract()
scores = {word:score.cohesion_forward for word, score in scores.items()}
tokenizer = LTokenizer(scores=scores)
text = '한국어 자연어 처리'
print(tokenizer.tokenize(text))