soynlp

0.0.493 verified Mon Apr 27 auth: no python deprecated

An unsupervised Korean Natural Language Processing toolkit for tokenization, stemming, part-of-speech tagging, and noun extraction. The current version is 0.0.493. Development has stalled since 2020; the repository is archived and no longer maintained as of version 0.1.1 (which is not on PyPI).

pip install soynlp

Common errors

error ModuleNotFoundError: No module named 'soynlp.words' ↓

cause Incorrect import path; the correct module is 'soynlp.word' (singular).

fix

Use 'from soynlp.word import WordExtractor' instead.

error TypeError: 'WordExtractor' object is not iterable ↓

cause Passing the WordExtractor object directly as a tokenizer score dictionary; expected a dict.

fix

Extract scores: scores = {word:score.cohesion_forward for word, score in word_extractor.extract().items()}

error ValueError: No corpus was trained ↓

cause Calling extract() without calling train() first on a corpus.

fix

Call word_extractor.train(corpus) before word_extractor.extract()

Warnings

deprecated The repository is archived on GitHub (last release 0.1.1, not on PyPI). PyPI version 0.0.493 is several years old and will not receive updates. ↓

fix Consider migrating to modern Korean NLP libraries such as Kiwi (kiwipiepy), KoNLPy, or Hugging Face tokenizers.

breaking In some versions, import paths changed. Using 'soynlp.words' (with an 's') will fail; use 'soynlp.word' instead. ↓

fix Replace 'from soynlp.words import WordExtractor' with 'from soynlp.word import WordExtractor'.

gotcha The LTokenzier sorce dictionary must be a dict mapping word to a float score (e.g., cohesion_forward). Passing raw WordExtractor output will cause a TypeError. ↓

fix Extract scores as shown in quickstart: scores = {word:score.cohesion_forward for word, score in scores.items()}.

gotcha DoublespaceLineCorpus expects double-space separated tokens per line, not arbitrary text. Using raw text lines will produce garbage corpus sentences. ↓

fix Ensure input file has tokens separated by two spaces, or preprocess accordingly.

Imports

soynlp.noun.LRNounExtractor_v2
```
from soynlp.noun import LRNounExtractor_v2
```
Correct import for noun extraction
soynlp.tokenizer.MaxScoreTokenizer
```
from soynlp.tokenizer import MaxScoreTokenizer
```
Correct import for tokenization
soynlp.normalize
```
from soynlp import normalize
```
Correct import for text normalization (repeat char, emoticon, etc.)
soynlp.words.WordExtractor
```
from soynlp.word import WordExtractor
```
Correct import for word extraction (note: 'word' not 'words')

Quickstart

Basic unsupervised tokenization using word extraction scores.

from soynlp import DoublespaceLineCorpus
from soynlp.word import WordExtractor
from soynlp.tokenizer import LTokenizer

corpus = DoublespaceLineCorpus('dataset.txt', iter_sent=True)
word_extractor = WordExtractor(max_iter_learning_steps=100)
word_extractor.train(corpus)
scores = word_extractor.extract()
scores = {word:score.cohesion_forward for word, score in scores.items()}
tokenizer = LTokenizer(scores=scores)
text = '한국어 자연어 처리'
print(tokenizer.tokenize(text))