spaCy Chinese Word Segmentation (pkuseg)

1.0.1 · active · verified Wed Apr 15

spacy-pkuseg is a Chinese word segmentation toolkit for spaCy, forked from pkuseg-python. It provides a `PkusegSegmenter` component to integrate robust Chinese segmentation directly into spaCy's NLP pipeline. The current stable version is 1.0.1, with releases primarily focused on Python and core dependency (like NumPy) compatibility updates.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to load a spaCy model, add the `spacy_pkuseg` component to the pipeline, and process Chinese text to get word-segmented tokens. Ensure a spaCy model is installed first.

import spacy

# Make sure to install a spaCy model, e.g., python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

# Add the pkuseg component to the pipeline
# The default model is 'spacy_ontonotes'
nlp.add_pipe("spacy_pkuseg", last=True)

# To specify a different model or user dictionary:
# nlp.add_pipe("spacy_pkuseg", config={
#     "model": "web", 
#     "user_dict": "path/to/your_dict.txt"
# }, last=True)

text = "北京大学地球与空间科学学院"
doc = nlp(text)

print(f"Original text: {text}")
print(f"Tokens: {[token.text for token in doc]}")

view raw JSON →