{"id":6459,"library":"spacy-pkuseg","title":"spaCy Chinese Word Segmentation (pkuseg)","description":"spacy-pkuseg is a Chinese word segmentation toolkit for spaCy, forked from pkuseg-python. It provides a `PkusegSegmenter` component to integrate robust Chinese segmentation directly into spaCy's NLP pipeline. The current stable version is 1.0.1, with releases primarily focused on Python and core dependency (like NumPy) compatibility updates.","status":"active","version":"1.0.1","language":"en","source_language":"en","source_url":"https://github.com/explosion/spacy-pkuseg","tags":["spaCy","NLP","Chinese","segmentation","pkuseg"],"install":[{"cmd":"pip install spacy-pkuseg","lang":"bash","label":"Install spacy-pkuseg"}],"dependencies":[{"reason":"Core dependency for pipeline integration.","package":"spacy"},{"reason":"Underlying segmentation library.","package":"pkuseg"},{"reason":"Required for numerical operations; specific versions can cause breakage.","package":"numpy"}],"imports":[{"note":"The library was forked and renamed from `pkuseg-python` to `spacy-pkuseg`.","wrong":"from pkuseg_python import PkusegSegmenter","symbol":"PkusegSegmenter","correct":"from spacy_pkuseg import PkusegSegmenter"}],"quickstart":{"code":"import spacy\n\n# Make sure to install a spaCy model, e.g., python -m spacy download en_core_web_sm\nnlp = spacy.load(\"en_core_web_sm\")\n\n# Add the pkuseg component to the pipeline\n# The default model is 'spacy_ontonotes'\nnlp.add_pipe(\"spacy_pkuseg\", last=True)\n\n# To specify a different model or user dictionary:\n# nlp.add_pipe(\"spacy_pkuseg\", config={\n#     \"model\": \"web\", \n#     \"user_dict\": \"path/to/your_dict.txt\"\n# }, last=True)\n\ntext = \"北京大学地球与空间科学学院\"\ndoc = nlp(text)\n\nprint(f\"Original text: {text}\")\nprint(f\"Tokens: {[token.text for token in doc]}\")","lang":"python","description":"This quickstart demonstrates how to load a spaCy model, add the `spacy_pkuseg` component to the pipeline, and process Chinese text to get word-segmented tokens. Ensure a spaCy model is installed first."},"warnings":[{"fix":"If using spacy-pkuseg v1.0.0 or later, ensure NumPy is v2.0 or higher. If using an older version of spacy-pkuseg, pin NumPy to `<2.0` (e.g., `pip install 'numpy<2.0'`).","message":"Numpy 2.0 compatibility breakage: spacy-pkuseg v1.0.0 and later require NumPy>=2.0. Earlier versions (<1.0.0) are incompatible with NumPy 2.0 due to binary interface changes.","severity":"breaking","affected_versions":"<1.0.0"},{"fix":"Update import statements to `from spacy_pkuseg import PkusegSegmenter`. Explicitly specify the desired model (e.g., `config={'model': 'web'}`) if you relied on a different default. Custom user dicts saved with older versions (before v0.0.30's fix) might need to be re-created.","message":"Fork and renaming from `pkuseg-python`: The package `spacy-pkuseg` (from v0.0.26) is a fork. The import path changed from `pkuseg` to `spacy_pkuseg`. The default model also changed, and serialization for custom user dictionaries switched from `pickle` to `msgpack` (fixed for custom dicts in v0.0.30).","severity":"breaking","affected_versions":"Users migrating from `pkuseg-python` or `spacy-pkuseg<0.0.26`"},{"fix":"Always add the `spacy_pkuseg` component to the end of the pipeline using `nlp.add_pipe(\"spacy_pkuseg\", last=True)` to ensure it acts as the primary tokenizer.","message":"Incorrect pipeline integration: Placing the `spacy_pkuseg` component incorrectly in the spaCy pipeline can lead to unexpected tokenization results or errors, especially if another tokenization component runs first.","severity":"gotcha","affected_versions":"All versions"},{"fix":"If the default `spacy_ontonotes` model is not suitable, explicitly specify the desired model when adding the component: `nlp.add_pipe(\"spacy_pkuseg\", config={\"model\": \"web\"}, last=True)`.","message":"Default model and explicit selection: `spacy-pkuseg` defaults to the `spacy_ontonotes` model if not specified. Users expecting a different model (e.g., 'web', 'news') might not get desired results without explicit configuration.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-15T00:00:00.000Z","next_check":"2026-07-14T00:00:00.000Z"}