{"id":10963,"library":"grapheme-splitter","title":"Grapheme Splitter","description":"grapheme-splitter is a JavaScript library designed to accurately segment strings into user-perceived characters, known as extended grapheme clusters, as defined by Unicode Standard Annex #29 (UAX #29) Default Grapheme Cluster Boundaries. It addresses fundamental issues in JavaScript's native string handling, where `String.length` and simple character iteration can misrepresent visual character counts due to multi-codepoint emojis (e.g., `🏳️‍🌈`), combining marks (like in German 'ü', Spanish 'ñ', or Hindi text), and 'Zalgo' text. Unlike `String.normalize()` or libraries like `punycode.js`, `grapheme-splitter` provides a comprehensive solution for these complex Unicode cases. The current stable version is 1.0.4, indicating a mature and stable codebase with an infrequent release cadence focused on maintenance rather than rapid feature additions. Its key differentiator is precise adherence to Unicode grapheme cluster rules, making it essential for text processing, input field validation, and display logic in internationalized applications.","status":"active","version":"1.0.4","language":"javascript","source_language":"en","source_url":"https://github.com/orling/grapheme-splitter","tags":["javascript","utf-8","strings","emoji","split"],"install":[{"cmd":"npm install grapheme-splitter","lang":"bash","label":"npm"},{"cmd":"yarn add grapheme-splitter","lang":"bash","label":"yarn"},{"cmd":"pnpm add grapheme-splitter","lang":"bash","label":"pnpm"}],"dependencies":[],"imports":[{"note":"Primarily a CommonJS library; use `require` for Node.js environments. Modern bundlers might allow `import` but it's not a native ESM module.","wrong":"import GraphemeSplitter from 'grapheme-splitter';","symbol":"GraphemeSplitter","correct":"const GraphemeSplitter = require('grapheme-splitter');"},{"note":"For browser environments or projects configured for ESM via bundlers like Webpack/Rollup, this import style typically works. However, the package itself is not a native ES module.","wrong":"const GraphemeSplitter = require('grapheme-splitter');","symbol":"GraphemeSplitter (ESM via bundler)","correct":"import GraphemeSplitter from 'grapheme-splitter';"},{"note":"Methods like `splitGraphemes`, `iterateGraphemes`, and `countGraphemes` are instance methods and must be called on an instantiated `GraphemeSplitter` object.","wrong":"GraphemeSplitter.splitGraphemes('string');","symbol":"Instance methods","correct":"const splitter = new GraphemeSplitter();\nsplitter.splitGraphemes('string');"}],"quickstart":{"code":"const GraphemeSplitter = require('grapheme-splitter');\n\nconst splitter = new GraphemeSplitter();\n\nconst textWithEmojis = \"🌷🎁💩😜👍🏳️‍🌈\";\nconst textWithDiacritics = \"Ĺo͂řȩm̅\"; // 10 JavaScript chars\nconst hindiText = \"अनुच्छेद\"; // Hindi word, 8 JavaScript chars\nconst zalgoText = \"Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘\"; // 58 JavaScript chars\n\nconsole.log('--- Splitting Emojis ---');\nconsole.log(`Original: \"${textWithEmojis}\" (length: ${textWithEmojis.length})`);\nconsole.log(`Split: ${JSON.stringify(splitter.splitGraphemes(textWithEmojis))} (count: ${splitter.countGraphemes(textWithEmojis)})`);\n// Expected: [\"🌷\",\"🎁\",\"💩\",\"😜\",\"👍\",\"🏳️‍🌈\"] (count: 6)\n\nconsole.log('\\n--- Splitting Diacritics ---');\nconsole.log(`Original: \"${textWithDiacritics}\" (length: ${textWithDiacritics.length})`);\nconsole.log(`Split: ${JSON.stringify(splitter.splitGraphemes(textWithDiacritics))} (count: ${splitter.countGraphemes(textWithDiacritics)})`);\n// Expected: [\"Ĺ\",\"o͂\",\"ř\",\"ȩ\",\"m̅\"] (count: 5)\n\nconsole.log('\\n--- Splitting Hindi ---');\nconsole.log(`Original: \"${hindiText}\" (length: ${hindiText.length})`);\nconsole.log(`Split: ${JSON.stringify(splitter.splitGraphemes(hindiText))} (count: ${splitter.countGraphemes(hindiText)})`);\n// Expected: [\"अ\",\"नु\",\"च्\",\"छे\",\"द\"] (count: 5)\n\nconsole.log('\\n--- Splitting Zalgo ---');\nconsole.log(`Original: \"${zalgoText}\" (length: ${zalgoText.length})`);\nconsole.log(`Split: ${JSON.stringify(splitter.splitGraphemes(zalgoText))} (count: ${splitter.countGraphemes(zalgoText)})`);\n// Expected: [\"Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍\",\"A̴̵̜̰͔ͫ͗͢\",\"L̠ͨͧͩ͘\",\"G̴̻͈͍͔̹̑͗̎̅͛́\",\"Ǫ̵̹̻̝̳͂̌̌͘\"] (count: 5)","lang":"javascript","description":"This example demonstrates how to initialize `GraphemeSplitter` and use its `splitGraphemes` and `countGraphemes` methods across various complex Unicode strings, including multi-codepoint emojis, diacritics, Hindi text, and Zalgo text, showing the correct user-perceived character counts and segments."},"warnings":[{"fix":"Use `new GraphemeSplitter().countGraphemes(myString)` to get the accurate number of user-perceived characters, or `splitGraphemes` for an array of these clusters.","message":"Relying on `String.length` or simple iteration (e.g., `for...of` loops over code points) for user-perceived character counts will yield incorrect results for strings containing multi-codepoint emojis, combining marks, or other extended grapheme clusters.","severity":"gotcha","affected_versions":"all"},{"fix":"While `String.normalize()` can resolve some canonical equivalences, for full grapheme cluster segmentation, `grapheme-splitter` is required as it implements the UAX #29 standard.","message":"JavaScript's `String.normalize()` method is insufficient for correctly combining all types of combining marks into single user-perceived characters, especially in languages like Hindi or with 'Zalgo' text, where many combinations lack a single dedicated Unicode codepoint.","severity":"gotcha","affected_versions":"all"},{"fix":"First split the string into an array of grapheme clusters using `splitter.splitGraphemes(myString)`, then slice the resulting array, and finally `join('')` the array back into a string if needed. For example, `splitter.splitGraphemes(myString).slice(0, maxLength).join('')`.","message":"When truncating or slicing strings for display (e.g., limiting character input, fitting text into a UI element), using `substring` or `slice` with raw JavaScript character indices can inadvertently split a grapheme cluster in half, leading to corrupted or unreadable text.","severity":"gotcha","affected_versions":"all"}],"env_vars":null,"last_verified":"2026-04-19T00:00:00.000Z","next_check":"2026-07-18T00:00:00.000Z","problems":[{"fix":"Use `const splitter = new GraphemeSplitter(); const correctLength = splitter.countGraphemes(myString);`","cause":"JavaScript's `String.length` counts UTF-16 code units, not user-perceived characters (grapheme clusters). Emojis and combining marks often consist of multiple code units.","error":"myString.length is incorrect for emojis and international text"},{"fix":"Transform the string into an array of grapheme clusters first: `const splitter = new GraphemeSplitter(); const graphemes = splitter.splitGraphemes(myString); const slicedGraphemes = graphemes.slice(0, N); const result = slicedGraphemes.join('');`","cause":"Slicing a string by JavaScript character index can cut through a multi-codepoint grapheme cluster (e.g., an emoji or a base character with a combining mark), resulting in a broken visual character.","error":"String.prototype.slice() or substring() produces malformed characters"},{"fix":"To iterate over user-perceived characters, use `for (const grapheme of splitter.iterateGraphemes(myString))` or `splitter.splitGraphemes(myString).forEach(...)`.","cause":"While `for...of` and `Array.from()` iterate over Unicode *code points*, a single user-perceived character (grapheme cluster) can be composed of multiple code points (e.g., 'A' + combining acute accent).","error":"Iterating over string (for...of) or Array.from() yields incorrect 'characters'"}],"ecosystem":"npm"}