soynlp

raw JSON →
0.0.493 verified Mon Apr 27 auth: no python deprecated

An unsupervised Korean Natural Language Processing toolkit for tokenization, stemming, part-of-speech tagging, and noun extraction. The current version is 0.0.493. Development has stalled since 2020; the repository is archived and no longer maintained as of version 0.1.1 (which is not on PyPI).

pip install soynlp
error ModuleNotFoundError: No module named 'soynlp.words'
cause Incorrect import path; the correct module is 'soynlp.word' (singular).
fix
Use 'from soynlp.word import WordExtractor' instead.
error TypeError: 'WordExtractor' object is not iterable
cause Passing the WordExtractor object directly as a tokenizer score dictionary; expected a dict.
fix
Extract scores: scores = {word:score.cohesion_forward for word, score in word_extractor.extract().items()}
error ValueError: No corpus was trained
cause Calling extract() without calling train() first on a corpus.
fix
Call word_extractor.train(corpus) before word_extractor.extract()
deprecated The repository is archived on GitHub (last release 0.1.1, not on PyPI). PyPI version 0.0.493 is several years old and will not receive updates.
fix Consider migrating to modern Korean NLP libraries such as Kiwi (kiwipiepy), KoNLPy, or Hugging Face tokenizers.
breaking In some versions, import paths changed. Using 'soynlp.words' (with an 's') will fail; use 'soynlp.word' instead.
fix Replace 'from soynlp.words import WordExtractor' with 'from soynlp.word import WordExtractor'.
gotcha The LTokenzier sorce dictionary must be a dict mapping word to a float score (e.g., cohesion_forward). Passing raw WordExtractor output will cause a TypeError.
fix Extract scores as shown in quickstart: scores = {word:score.cohesion_forward for word, score in scores.items()}.
gotcha DoublespaceLineCorpus expects double-space separated tokens per line, not arbitrary text. Using raw text lines will produce garbage corpus sentences.
fix Ensure input file has tokens separated by two spaces, or preprocess accordingly.

Basic unsupervised tokenization using word extraction scores.

from soynlp import DoublespaceLineCorpus
from soynlp.word import WordExtractor
from soynlp.tokenizer import LTokenizer

corpus = DoublespaceLineCorpus('dataset.txt', iter_sent=True)
word_extractor = WordExtractor(max_iter_learning_steps=100)
word_extractor.train(corpus)
scores = word_extractor.extract()
scores = {word:score.cohesion_forward for word, score in scores.items()}
tokenizer = LTokenizer(scores=scores)
text = '한국어 자연어 처리'
print(tokenizer.tokenize(text))