spaCy Chinese Word Segmentation (pkuseg)
spacy-pkuseg is a Chinese word segmentation toolkit for spaCy, forked from pkuseg-python. It provides a `PkusegSegmenter` component to integrate robust Chinese segmentation directly into spaCy's NLP pipeline. The current stable version is 1.0.1, with releases primarily focused on Python and core dependency (like NumPy) compatibility updates.
Warnings
- breaking Numpy 2.0 compatibility breakage: spacy-pkuseg v1.0.0 and later require NumPy>=2.0. Earlier versions (<1.0.0) are incompatible with NumPy 2.0 due to binary interface changes.
- breaking Fork and renaming from `pkuseg-python`: The package `spacy-pkuseg` (from v0.0.26) is a fork. The import path changed from `pkuseg` to `spacy_pkuseg`. The default model also changed, and serialization for custom user dictionaries switched from `pickle` to `msgpack` (fixed for custom dicts in v0.0.30).
- gotcha Incorrect pipeline integration: Placing the `spacy_pkuseg` component incorrectly in the spaCy pipeline can lead to unexpected tokenization results or errors, especially if another tokenization component runs first.
- gotcha Default model and explicit selection: `spacy-pkuseg` defaults to the `spacy_ontonotes` model if not specified. Users expecting a different model (e.g., 'web', 'news') might not get desired results without explicit configuration.
Install
-
pip install spacy-pkuseg
Imports
- PkusegSegmenter
from spacy_pkuseg import PkusegSegmenter
Quickstart
import spacy
# Make sure to install a spaCy model, e.g., python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")
# Add the pkuseg component to the pipeline
# The default model is 'spacy_ontonotes'
nlp.add_pipe("spacy_pkuseg", last=True)
# To specify a different model or user dictionary:
# nlp.add_pipe("spacy_pkuseg", config={
# "model": "web",
# "user_dict": "path/to/your_dict.txt"
# }, last=True)
text = "北京大学地球与空间科学学院"
doc = nlp(text)
print(f"Original text: {text}")
print(f"Tokens: {[token.text for token in doc]}")