Kiwi, the Korean Tokenizer for Python
Kiwipiepy is a fast and accurate Korean morphological analyzer (tokenizer) for Python, wrapping the high-performance C++ library Kiwi. It supports various features like part-of-speech tagging, named entity recognition, dialect analysis, and typo correction. The library is actively maintained with frequent updates, often aligning with the core Kiwi library's releases.
Common errors
-
ModuleNotFoundError: No module named 'kiwipiepy'
cause The kiwipiepy library is not installed in the current Python environment.fixRun `pip install kiwipiepy` to install the library. -
TypeError: Kiwi.__init__() got an unexpected keyword argument 'oov_handling'
cause Attempting to pass `oov_handling` as an argument to the `Kiwi` constructor in versions 0.23.0 or later.fixMove the `oov_handling` argument to the `kiwi.tokenize()` method: `kiwi.tokenize(text, oov_handling='new_strategy')`. -
TypeError: Kiwi.__init__() got an unexpected keyword argument 'typos'
cause Attempting to pass typo correction options (`typos`, `match_typo_with_stem`, etc.) to the `Kiwi` constructor in versions 0.23.0 or later.fixMove typo correction options to the `kiwi.tokenize()` method: `kiwi.tokenize(text, typos=True, match_typo_with_stem=True)`. -
ValueError: invalid model_type 'knlm'
cause Using an outdated or unrecognized `model_type` when initializing `Kiwi`.fixRemove the `model_type` argument to use the default, or use a currently supported model type like `model_type='sbg'` or `model_type='ngram'`. -
segmentation fault (core dumped)
cause While many segfaults were fixed in later versions (e.g., v0.20.1, v0.20.4, v0.22.0) related to specific inputs, pretokenized spans, or typo correction, some specific edge cases might still trigger them, often involving complex inputs or concurrent dictionary modifications.fixEnsure you are on the latest `kiwipiepy` version. If the issue persists, simplify the input, avoid concurrent dictionary modifications, or report the specific input that causes the crash to the library maintainers.
Warnings
- breaking The `oov_handling` parameter has moved from the `Kiwi` constructor to the `tokenize()` method and now supports new strategies. Old code passing `oov_handling` to `Kiwi()` will break.
- breaking Typo correction options like `typos`, `match_typo_with_stem` have moved from the `Kiwi` constructor to the `tokenize()` method. Passing them during initialization will result in a `TypeError`.
- deprecated The `knlm` and `sbg` (older, smaller) model types are no longer the default options. Specifying `model_type='knlm'` or `model_type='sbg'` (older) might lead to warnings or unexpected behavior.
- gotcha While v0.22.0 improved multithread safety for `Kiwi` objects, concurrent modifications to user dictionaries or other internal states shared across threads using a single `Kiwi` instance can still lead to unexpected behavior or race conditions. Creating a `Kiwi` instance per thread is generally safer for heavy concurrent use cases.
- gotcha In earlier versions, operations like `Kiwi.join()` could potentially fail or lead to incorrect results if the `Kiwi` instance or its associated `MorphemeSet` was modified or deleted after tokenization, due to lingering references.
Install
-
pip install kiwipiepy
Imports
- Kiwi
from kiwipiepy import Kiwi
Quickstart
from kiwipiepy import Kiwi
# Initialize the Kiwi tokenizer
kiwi = Kiwi()
# Analyze a Korean sentence
text = "안녕하세요 한국어 형태소 분석기 키위입니다."
result = kiwi.tokenize(text)
# Print the analysis result
for token in result:
print(f"Token: {token.form}, Tag: {token.tag}, Start: {token.start}, Len: {token.len}")
# Example with additional options (e.g., split complex words)
text_complex = "그녀는책을읽었다"
result_complex = kiwi.tokenize(text_complex, split_complex=True)
print("\nComplex word analysis:")
for token in result_complex:
print(f"Token: {token.form}, Tag: {token.tag}")