Grapheme Splitter
grapheme-splitter is a JavaScript library designed to accurately segment strings into user-perceived characters, known as extended grapheme clusters, as defined by Unicode Standard Annex #29 (UAX #29) Default Grapheme Cluster Boundaries. It addresses fundamental issues in JavaScript's native string handling, where `String.length` and simple character iteration can misrepresent visual character counts due to multi-codepoint emojis (e.g., `🏳️🌈`), combining marks (like in German 'ü', Spanish 'ñ', or Hindi text), and 'Zalgo' text. Unlike `String.normalize()` or libraries like `punycode.js`, `grapheme-splitter` provides a comprehensive solution for these complex Unicode cases. The current stable version is 1.0.4, indicating a mature and stable codebase with an infrequent release cadence focused on maintenance rather than rapid feature additions. Its key differentiator is precise adherence to Unicode grapheme cluster rules, making it essential for text processing, input field validation, and display logic in internationalized applications.
Common errors
-
myString.length is incorrect for emojis and international text
cause JavaScript's `String.length` counts UTF-16 code units, not user-perceived characters (grapheme clusters). Emojis and combining marks often consist of multiple code units.fixUse `const splitter = new GraphemeSplitter(); const correctLength = splitter.countGraphemes(myString);` -
String.prototype.slice() or substring() produces malformed characters
cause Slicing a string by JavaScript character index can cut through a multi-codepoint grapheme cluster (e.g., an emoji or a base character with a combining mark), resulting in a broken visual character.fixTransform the string into an array of grapheme clusters first: `const splitter = new GraphemeSplitter(); const graphemes = splitter.splitGraphemes(myString); const slicedGraphemes = graphemes.slice(0, N); const result = slicedGraphemes.join('');` -
Iterating over string (for...of) or Array.from() yields incorrect 'characters'
cause While `for...of` and `Array.from()` iterate over Unicode *code points*, a single user-perceived character (grapheme cluster) can be composed of multiple code points (e.g., 'A' + combining acute accent).fixTo iterate over user-perceived characters, use `for (const grapheme of splitter.iterateGraphemes(myString))` or `splitter.splitGraphemes(myString).forEach(...)`.
Warnings
- gotcha Relying on `String.length` or simple iteration (e.g., `for...of` loops over code points) for user-perceived character counts will yield incorrect results for strings containing multi-codepoint emojis, combining marks, or other extended grapheme clusters.
- gotcha JavaScript's `String.normalize()` method is insufficient for correctly combining all types of combining marks into single user-perceived characters, especially in languages like Hindi or with 'Zalgo' text, where many combinations lack a single dedicated Unicode codepoint.
- gotcha When truncating or slicing strings for display (e.g., limiting character input, fitting text into a UI element), using `substring` or `slice` with raw JavaScript character indices can inadvertently split a grapheme cluster in half, leading to corrupted or unreadable text.
Install
-
npm install grapheme-splitter -
yarn add grapheme-splitter -
pnpm add grapheme-splitter
Imports
- GraphemeSplitter
import GraphemeSplitter from 'grapheme-splitter';
const GraphemeSplitter = require('grapheme-splitter'); - GraphemeSplitter (ESM via bundler)
const GraphemeSplitter = require('grapheme-splitter');import GraphemeSplitter from 'grapheme-splitter';
- Instance methods
GraphemeSplitter.splitGraphemes('string');const splitter = new GraphemeSplitter(); splitter.splitGraphemes('string');
Quickstart
const GraphemeSplitter = require('grapheme-splitter');
const splitter = new GraphemeSplitter();
const textWithEmojis = "🌷🎁💩😜👍🏳️🌈";
const textWithDiacritics = "Ĺo͂řȩm̅"; // 10 JavaScript chars
const hindiText = "अनुच्छेद"; // Hindi word, 8 JavaScript chars
const zalgoText = "Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘"; // 58 JavaScript chars
console.log('--- Splitting Emojis ---');
console.log(`Original: "${textWithEmojis}" (length: ${textWithEmojis.length})`);
console.log(`Split: ${JSON.stringify(splitter.splitGraphemes(textWithEmojis))} (count: ${splitter.countGraphemes(textWithEmojis)})`);
// Expected: ["🌷","🎁","💩","😜","👍","🏳️🌈"] (count: 6)
console.log('\n--- Splitting Diacritics ---');
console.log(`Original: "${textWithDiacritics}" (length: ${textWithDiacritics.length})`);
console.log(`Split: ${JSON.stringify(splitter.splitGraphemes(textWithDiacritics))} (count: ${splitter.countGraphemes(textWithDiacritics)})`);
// Expected: ["Ĺ","o͂","ř","ȩ","m̅"] (count: 5)
console.log('\n--- Splitting Hindi ---');
console.log(`Original: "${hindiText}" (length: ${hindiText.length})`);
console.log(`Split: ${JSON.stringify(splitter.splitGraphemes(hindiText))} (count: ${splitter.countGraphemes(hindiText)})`);
// Expected: ["अ","नु","च्","छे","द"] (count: 5)
console.log('\n--- Splitting Zalgo ---');
console.log(`Original: "${zalgoText}" (length: ${zalgoText.length})`);
console.log(`Split: ${JSON.stringify(splitter.splitGraphemes(zalgoText))} (count: ${splitter.countGraphemes(zalgoText)})`);
// Expected: ["Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍","A̴̵̜̰͔ͫ͗͢","L̠ͨͧͩ͘","G̴̻͈͍͔̹̑͗̎̅͛́","Ǫ̵̹̻̝̳͂̌̌͘"] (count: 5)