PyThaiNLP
PyThaiNLP is a Python library for natural language processing (NLP) of the Thai language. It provides standard NLP functions like word and sentence segmentation, part-of-speech tagging, transliteration, and various utilities. The library is actively maintained, with version 5.3.4 as the current stable release, and new minor updates for the 5.x series are still being released, with a major 6.0 release expected to introduce breaking changes.
Warnings
- breaking The upcoming PyThaiNLP 6.0 release is expected to introduce breaking changes. The minimum required Python version for PyThaiNLP 5.x and upcoming 6.x is Python 3.9+.
- deprecated Environment variables `PYTHAINLP_DATA_DIR` and `PYTHAINLP_READ_MODE` are deprecated. Use `PYTHAINLP_DATA` to specify the data directory and `PYTHAINLP_READ_ONLY` for read-only mode. Setting both deprecated and new versions simultaneously will raise a `ValueError`.
- gotcha PyThaiNLP lazy-loads word lists and other resources. This can result in a "cold start" delay during the first function call, especially for tokenizers. Subsequent calls will perform at full speed.
- gotcha Installing optional dependencies like `PyICU` (for the `icu` extra) on Windows can be challenging. It may require finding pre-built wheel packages or setting the `ICU_VERSION` environment variable for a source build. Additionally, `python-crfsuite` (a dependency for some features) has known build issues with Python 3.10+.
- gotcha When using PyThaiNLP in distributed computing environments (e.g., Apache Spark), the `PYTHAINLP_DATA` environment variable must be set *inside* the function that will be distributed to worker nodes, not in the driver program. The default data directory (`~/pythainlp-data`) may not be writable on executor nodes, leading to `PermissionError`.
Install
-
pip install pythainlp -
pip install "pythainlp[compact]" -
pip install "pythainlp[full]"
Imports
- word_tokenize
from pythainlp.tokenize import word_tokenize
- sent_tokenize
from pythainlp.tokenize import sent_tokenize
- is_thai
from pythainlp.morpheme import is_thai
Quickstart
from pythainlp.tokenize import word_tokenize
text = "ฉันรักภาษาไทย"
tokens = word_tokenize(text)
print(tokens)
# Output example: ['ฉัน', 'รัก', 'ภาษาไทย']
sentences = sent_tokenize("สวัสดีครับ. สบายดีไหมครับ?")
print(sentences)
# Output example: ['สวัสดีครับ.', 'สบายดีไหมครับ?']