Lindera WASM with Jieba Dictionary (Bundler)

2.3.4 · active · verified Tue Apr 21

lindera-wasm-jieba-bundler is a specialized npm package that provides a WebAssembly-based morphological analysis library for Chinese language text, specifically utilizing the Jieba dictionary. It is part of the Lindera project, which offers high-performance text segmentation by compiling Rust code to WebAssembly. The current stable version series is `3.x`, with `3.0.5` being the latest release, focusing on safety and refactoring. This particular package is optimized for use within JavaScript bundler environments like Webpack or Rollup, providing a compact and efficient solution for client-side or server-side (via bundlers) Chinese text processing. Lindera's key differentiators include its Rust-based performance, WASM portability across various JavaScript runtimes (browser, Node.js via bundlers), and its modular approach with separate packages for different dictionaries and target environments (web, nodejs, bundler). Release cadence appears to be active, with several minor updates within the 3.0.x series recently.

Common errors

Warnings

Install

Imports

Quickstart

This example demonstrates how to initialize the Lindera WASM module, build a tokenizer with the embedded Jieba dictionary, and perform morphological analysis on Chinese text in a bundler environment.

import __wbg_init, { TokenizerBuilder, type Token } from 'lindera-wasm-jieba-bundler';

async function main() {
    // Initialize the WebAssembly module. This must be awaited.
    await __wbg_init();

    // Create a new TokenizerBuilder instance.
    const builder = new TokenizerBuilder();

    // Specify the Jieba dictionary to use.
    // 'embedded://jieba' refers to the dictionary bundled with this package.
    builder.setDictionary("embedded://jieba");

    // Set the tokenization mode. 'normal' is a common default.
    builder.setMode("normal");

    // Build the tokenizer.
    const tokenizer = builder.build();

    // Text to tokenize.
    const textToAnalyze = "上海东方明珠广播电视塔";

    // Perform tokenization.
    const tokens: Token[] = tokenizer.tokenize(textToAnalyze);

    console.log(`Tokens for: "${textToAnalyze}"`);
    tokens.forEach(token => {
        // Each token has a surface form and detailed information.
        console.log(`- ${token.surface}: ${token.details.join(" | ")}`);
    });

    // Example with another Chinese sentence
    const anotherText = "我爱北京天安门";
    const moreTokens: Token[] = tokenizer.tokenize(anotherText);
    console.log(`\nTokens for: "${anotherText}"`);
    moreTokens.forEach(token => {
        console.log(`- ${token.surface}: ${token.details.join(" | ")}`);
    });
}

main().catch(console.error);

view raw JSON →