Unicode Codepoint Database Parser
The `codepoints` package provides a parser for the Unicode Character Database (UCD) files, producing a large array of JavaScript objects, each representing a Unicode codepoint with extensive properties like name, category, block, script, bidi class, and various casing and decomposition mappings. The current stable version is 1.3.0, and the package is primarily intended for use in build scripts, not directly in production applications, due to its significant memory footprint and unoptimized parsing speed. It bundles a default UCD, but also allows specifying a custom UCD path. For real-world applications requiring Unicode data, the project maintainers recommend using modules that provide precompiled and compressed data, such as `unicode-properties`. It does not follow a strict release cadence and has seen infrequent updates, reflecting its stable but specialized role.
Common errors
-
TypeError: require is not defined
cause Attempting to use `require()` in an ECMAScript Module (ESM) context without proper transpilation or configuration, such as in a file with `type: 'module'` in `package.json` or a `.mjs` file.fixEither convert your module to CommonJS (e.g., by ensuring `type: 'commonjs'` in `package.json` or using a `.cjs` extension) or use dynamic `import()` for loading, if supported by the package, although `codepoints` is designed for `require()`. -
Error: Cannot find module 'codepoints' or 'codepoints/parser'
cause The package `codepoints` was not installed, or Node.js cannot resolve its path. This often happens if `npm install` was not run, or if the module is being imported from an incorrect location.fixRun `npm install codepoints` in your project directory. Verify that the `node_modules` directory contains the `codepoints` package. Ensure the import path is correct and relative to your project structure.
Warnings
- gotcha This package is explicitly designed for 'BUILD SCRIPTS ONLY'. It is not recommended for use in production applications due to performance and memory limitations. The parsers are not optimized for speed, and the resulting codepoint array consumes a substantial amount of memory.
- gotcha The output is a 'giant array of codepoint objects' which can consume a huge amount of memory. Loading the entire Unicode database into memory is suitable for offline processing or build-time data generation but not for typical runtime usage in resource-constrained environments.
- gotcha The `codepoints` package is primarily CommonJS-oriented, as indicated by its `require()` examples. Directly using ES Module `import` syntax might lead to interoperability issues or errors in some Node.js environments if not properly configured (e.g., using `type: 'module'` in `package.json` with appropriate transpilation or loaders).
Install
-
npm install codepoints -
yarn add codepoints -
pnpm add codepoints
Imports
- codepoints
import codepoints from 'codepoints';
const codepoints = require('codepoints'); - parser
import parser from 'codepoints/parser';
const parser = require('codepoints/parser'); - CodepointData
// No direct import for type; data is an array of objects. // Example accessing a property: // const category = codepoints[65].category;
Quickstart
const parser = require('codepoints/parser');
const path = require('path');
const fs = require('fs');
// In a real build script, you would download and extract the UCD yourself.
// For this example, we'll simulate a UCD directory.
const mockUCDPath = path.join(__dirname, 'mock-ucd');
if (!fs.existsSync(mockUCDPath)) {
fs.mkdirSync(mockUCDPath);
// Simulate a minimal UnicodeData.txt for the parser to find
fs.writeFileSync(path.join(mockUCDPath, 'UnicodeData.txt'), '0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;;;\n0062;LATIN SMALL LETTER B;Ll;0;L;;;;;N;;;0042;;0042;');
}
// Parse a custom version of the UCD from the specified directory
try {
const codepointData = parser(mockUCDPath);
console.log(`Total codepoints parsed: ${codepointData.length}`);
console.log('Data for LATIN CAPITAL LETTER A (U+0041):', {
code: codepointData[0x41].code,
name: codepointData[0x41].name,
category: codepointData[0x41].category,
lowercase: codepointData[0x41].lowercase
});
console.log('Data for LATIN SMALL LETTER B (U+0062):', {
code: codepointData[0x62].code,
name: codepointData[0x62].name,
category: codepointData[0x62].category,
uppercase: codepointData[0x62].uppercase
});
} catch (error) {
console.error('Error parsing UCD:', error.message);
console.log('Ensure that the mock-ucd directory contains necessary UCD files, e.g., UnicodeData.txt');
}
// Clean up mock UCD directory
fs.rmSync(mockUCDPath, { recursive: true, force: true });