Lindera WASM with Jieba Dictionary (Bundler)
lindera-wasm-jieba-bundler is a specialized npm package that provides a WebAssembly-based morphological analysis library for Chinese language text, specifically utilizing the Jieba dictionary. It is part of the Lindera project, which offers high-performance text segmentation by compiling Rust code to WebAssembly. The current stable version series is `3.x`, with `3.0.5` being the latest release, focusing on safety and refactoring. This particular package is optimized for use within JavaScript bundler environments like Webpack or Rollup, providing a compact and efficient solution for client-side or server-side (via bundlers) Chinese text processing. Lindera's key differentiators include its Rust-based performance, WASM portability across various JavaScript runtimes (browser, Node.js via bundlers), and its modular approach with separate packages for different dictionaries and target environments (web, nodejs, bundler). Release cadence appears to be active, with several minor updates within the 3.0.x series recently.
Common errors
-
TypeError: TokenizerBuilder is not a constructor
cause The WebAssembly module was not initialized correctly, or the `__wbg_init()` promise was not awaited, leading to `TokenizerBuilder` not being fully available.fixEnsure you have `await __wbg_init();` called before attempting to instantiate `TokenizerBuilder`. -
Error: Failed to fetch dictionary
cause The dictionary identifier passed to `setDictionary()` does not match the dictionary bundled with the package or is misspelled.fixVerify that `builder.setDictionary("embedded://jieba")` is used for the `lindera-wasm-jieba-bundler` package. Other dictionary names will not work. -
SyntaxError: require is not defined (for bundler or web packages)
cause Attempting to use CommonJS `require()` syntax with packages intended for ESM (`import`) or bundler environments, especially after `v3.0.0` which consolidated ESM usage.fixSwitch to ES module `import` statements (e.g., `import { TokenizerBuilder } from 'lindera-wasm-jieba-bundler';`) and ensure your build environment supports ESM.
Warnings
- breaking Version 3.0.0 introduced significant changes, including the removal of the direct Node.js WASM target and a renaming of npm packages. Users migrating from `v2.x` to `v3.x` should review the new package naming conventions (e.g., `-web`, `-nodejs`, `-bundler`) and adapt their imports accordingly.
- gotcha Confusing the `-web`, `-nodejs`, and `-bundler` packages for different environments can lead to runtime errors or suboptimal performance. Each package is optimized for its target environment.
- gotcha The `__wbg_init()` function, which initializes the WebAssembly module, must be called and awaited before any other Lindera WASM functionality can be used. Forgetting to await it will lead to runtime errors.
- gotcha Specifying the correct dictionary via `builder.setDictionary("embedded://<dictionary-name>")` is crucial. Using the wrong identifier (e.g., `"embedded://ipadic"` for a `jieba` package) will cause dictionary loading failures.
Install
-
npm install lindera-wasm-jieba-bundler -
yarn add lindera-wasm-jieba-bundler -
pnpm add lindera-wasm-jieba-bundler
Imports
- __wbg_init
const __wbg_init = require('lindera-wasm-jieba-bundler');import __wbg_init from 'lindera-wasm-jieba-bundler';
- TokenizerBuilder
import TokenizerBuilder from 'lindera-wasm-jieba-bundler'; // Not a default export import { TokenizerBuilder } from 'lindera-wasm-jieba-bundler/lindera_wasm';import { TokenizerBuilder } from 'lindera-wasm-jieba-bundler'; - Token
import type { Token } from 'lindera-wasm-jieba-bundler';
Quickstart
import __wbg_init, { TokenizerBuilder, type Token } from 'lindera-wasm-jieba-bundler';
async function main() {
// Initialize the WebAssembly module. This must be awaited.
await __wbg_init();
// Create a new TokenizerBuilder instance.
const builder = new TokenizerBuilder();
// Specify the Jieba dictionary to use.
// 'embedded://jieba' refers to the dictionary bundled with this package.
builder.setDictionary("embedded://jieba");
// Set the tokenization mode. 'normal' is a common default.
builder.setMode("normal");
// Build the tokenizer.
const tokenizer = builder.build();
// Text to tokenize.
const textToAnalyze = "上海东方明珠广播电视塔";
// Perform tokenization.
const tokens: Token[] = tokenizer.tokenize(textToAnalyze);
console.log(`Tokens for: "${textToAnalyze}"`);
tokens.forEach(token => {
// Each token has a surface form and detailed information.
console.log(`- ${token.surface}: ${token.details.join(" | ")}`);
});
// Example with another Chinese sentence
const anotherText = "我爱北京天安门";
const moreTokens: Token[] = tokenizer.tokenize(anotherText);
console.log(`\nTokens for: "${anotherText}"`);
moreTokens.forEach(token => {
console.log(`- ${token.surface}: ${token.details.join(" | ")}`);
});
}
main().catch(console.error);