{"id":4217,"library":"pythainlp","title":"PyThaiNLP","description":"PyThaiNLP is a Python library for natural language processing (NLP) of the Thai language. It provides standard NLP functions like word and sentence segmentation, part-of-speech tagging, transliteration, and various utilities. The library is actively maintained, with version 5.3.4 as the current stable release, and new minor updates for the 5.x series are still being released, with a major 6.0 release expected to introduce breaking changes.","status":"active","version":"5.3.4","language":"en","source_language":"en","source_url":"https://github.com/PyThaiNLP/pythainlp","tags":["nlp","thai","tokenization","sentiment-analysis","machine-translation","linguistic-analysis"],"install":[{"cmd":"pip install pythainlp","lang":"bash","label":"Base Installation"},{"cmd":"pip install \"pythainlp[compact]\"","lang":"bash","label":"Recommended: Stable and small subset of dependencies"},{"cmd":"pip install \"pythainlp[full]\"","lang":"bash","label":"Install all optional dependencies (may introduce conflicts)"}],"dependencies":[{"reason":"Required for some functionalities on Windows systems.","package":"tzdata","optional":true},{"reason":"Required for 'icu' extra; installation on Windows can be complex (needs pre-built wheels or manual build).","package":"PyICU","optional":true}],"imports":[{"symbol":"word_tokenize","correct":"from pythainlp.tokenize import word_tokenize"},{"symbol":"sent_tokenize","correct":"from pythainlp.tokenize import sent_tokenize"},{"note":"Moved from `pythainlp.util` to `pythainlp.morpheme` in PyThaiNLP 5.0.","wrong":"from pythainlp.util import is_thai","symbol":"is_thai","correct":"from pythainlp.morpheme import is_thai"}],"quickstart":{"code":"from pythainlp.tokenize import word_tokenize\n\ntext = \"ฉันรักภาษาไทย\"\ntokens = word_tokenize(text)\nprint(tokens)\n# Output example: ['ฉัน', 'รัก', 'ภาษาไทย']\n\nsentences = sent_tokenize(\"สวัสดีครับ. สบายดีไหมครับ?\")\nprint(sentences)\n# Output example: ['สวัสดีครับ.', 'สบายดีไหมครับ?']","lang":"python","description":"This quickstart demonstrates basic word and sentence tokenization using PyThaiNLP's default engine. Many other tokenization engines are available and can be specified with the `engine` parameter (e.g., `engine=\"icu\"`)."},"warnings":[{"fix":"Review the migration guide for PyThaiNLP 6.0 when it's released and ensure your environment uses Python 3.9 or newer.","message":"The upcoming PyThaiNLP 6.0 release is expected to introduce breaking changes. The minimum required Python version for PyThaiNLP 5.x and upcoming 6.x is Python 3.9+.","severity":"breaking","affected_versions":"5.x (upcoming 6.0)"},{"fix":"Update environment variable usage: `PYTHAINLP_DATA` instead of `PYTHAINLP_DATA_DIR`, and `PYTHAINLP_READ_ONLY` instead of `PYTHAINLP_READ_MODE`.","message":"Environment variables `PYTHAINLP_DATA_DIR` and `PYTHAINLP_READ_MODE` are deprecated. Use `PYTHAINLP_DATA` to specify the data directory and `PYTHAINLP_READ_ONLY` for read-only mode. Setting both deprecated and new versions simultaneously will raise a `ValueError`.","severity":"deprecated","affected_versions":"4.x+"},{"fix":"Be aware of this initial delay in performance-sensitive applications. Consider pre-loading necessary models or data if consistent, immediate response times are critical.","message":"PyThaiNLP lazy-loads word lists and other resources. This can result in a \"cold start\" delay during the first function call, especially for tokenizers. Subsequent calls will perform at full speed.","severity":"gotcha","affected_versions":"5.x+"},{"fix":"For PyICU on Windows, check `https://www.lfd.uci.edu/~gohlke/pythonlibs/` for pre-built wheels. For `python-crfsuite` on Python 3.10+, refer to PyThaiNLP's FAQ for workarounds or ensure you're using a compatible Python version or installation method for that dependency.","message":"Installing optional dependencies like `PyICU` (for the `icu` extra) on Windows can be challenging. It may require finding pre-built wheel packages or setting the `ICU_VERSION` environment variable for a source build. Additionally, `python-crfsuite` (a dependency for some features) has known build issues with Python 3.10+.","severity":"gotcha","affected_versions":"All versions, specifically on Windows or Python 3.10+"},{"fix":"Set `PYTHAINLP_DATA` to a writable local directory (e.g., `./pythainlp-data`) within the distributed function on each worker node before any data access.","message":"When using PyThaiNLP in distributed computing environments (e.g., Apache Spark), the `PYTHAINLP_DATA` environment variable must be set *inside* the function that will be distributed to worker nodes, not in the driver program. The default data directory (`~/pythainlp-data`) may not be writable on executor nodes, leading to `PermissionError`.","severity":"gotcha","affected_versions":"All versions in distributed environments"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}