{"id":13297,"library":"himalaya","title":"Himalaya: HTML to JSON Parser","description":"Himalaya is a JavaScript library designed to synchronously parse HTML documents into a structured JSON Abstract Syntax Tree (AST) and to convert that JSON AST back into HTML. Its current stable version is 1.1.1. The library maintains a steady release cadence, with minor releases adding features like source position tracking (v1.1.0) and patch releases addressing bugs, such as issues with malformed closing tags (v1.0.1) or CSS value parsing (v0.2.13). A significant breaking change occurred in v1.0.0, which dropped the older v0 specification in favor of a new, standardized v1 AST format, making the v1 spec the default. Key differentiators include its synchronous operation, robust handling of \"weird HTML\" edge cases (like unclosed tags, extra closing tags, void/self-closing tags, doctypes, and comments), and its ability to accurately preserve whitespace from the original HTML. It also offers a `stringify` method for converting the JSON AST back to HTML, facilitating round-trip transformations. The parser explicitly does not process the content of `<script>`, `<style>`, and `<template>` tags, treating them as raw text.","status":"active","version":"1.1.1","language":"javascript","source_language":"en","source_url":"https://github.com/andrejewski/himalaya","tags":["javascript","ast","html","json","parser"],"install":[{"cmd":"npm install himalaya","lang":"bash","label":"npm"},{"cmd":"yarn add himalaya","lang":"bash","label":"yarn"},{"cmd":"pnpm add himalaya","lang":"bash","label":"pnpm"}],"dependencies":[],"imports":[{"note":"Himalaya primarily uses ES Modules for modern Node.js and bundlers. For older CommonJS environments, ensure proper transpilation or use the specific CommonJS export if available.","wrong":"const parse = require('himalaya').parse","symbol":"parse","correct":"import { parse } from 'himalaya'"},{"note":"Used to convert a Himalaya JSON AST back into an HTML string. Follows the same module import conventions as `parse`.","wrong":"const stringify = require('himalaya').stringify","symbol":"stringify","correct":"import { stringify } from 'himalaya'"},{"note":"An object containing default parsing options, useful for spreading when custom options are needed, e.g., enabling `includePositions`.","wrong":"const parseDefaults = require('himalaya').parseDefaults","symbol":"parseDefaults","correct":"import { parseDefaults } from 'himalaya'"}],"quickstart":{"code":"import fs from 'fs';\nimport { parse, stringify } from 'himalaya';\n\n// Imagine webpage.html contains:\n// <div class=\"container\">\n//   <p>Hello, <b>world</b>!</p>\n//   <!-- some comment -->\n// </div>\n\nconst exampleHtml = '<div class=\"container\"><p>Hello, <b>world</b>!</p><!-- some comment --></div>';\n\n// Simulate reading from a file for the quickstart example\n// const html = fs.readFileSync('/webpage.html', { encoding: 'utf8' });\nconst html = exampleHtml;\n\nconsole.log('Original HTML:\\n', html);\n\n// Parse HTML into JSON AST\nconst json = parse(html);\nconsole.log('\\nParsed JSON AST:\\n', JSON.stringify(json, null, 2));\n\n// Modify the JSON (e.g., change content of the bold tag)\nif (json[0]?.children[0]?.children[1]?.tagName === 'b') {\n  json[0].children[0].children[1].children[0].content = 'Himalaya';\n}\n\n// Stringify JSON AST back to HTML\nconst newHtml = stringify(json);\nconsole.log('\\nStringified HTML (after modification):\\n', newHtml);\n// fs.writeFileSync('/new_webpage.html', newHtml);\n","lang":"javascript","description":"Demonstrates parsing HTML to a JSON AST, making a simple modification to the AST, and then stringifying it back to HTML using `parse` and `stringify`."},"warnings":[{"fix":"Review the v1 AST specification (`ast-spec-v1.md`) and refactor code that processes the parsed JSON output. The `v0.3.0` release allowed opting into v1 early, providing a migration path.","message":"Version 1.0.0 dropped support for the old v0 specification. If you were relying on the v0 AST format or APIs, you must update your code to conform to the v1 specification.","severity":"breaking","affected_versions":">=1.0.0"},{"fix":"Ensure custom or non-standard HTML-like tags start with an alphanumeric character. For cases requiring more permissive parsing of tag names, consider pre-processing the input HTML or using an alternative parser.","message":"Himalaya v0.3.1 introduced stricter parsing for HTML tag names, requiring them to start with an alphanumeric character. This aligns closer with the HTML5 spec but might break parsing for highly unusual or malformed custom tags that previously worked.","severity":"breaking","affected_versions":">=0.3.1"},{"fix":"To include position data, pass an options object to `parse`: `parse(html, { ...parseDefaults, includePositions: true })`. This will add a `position` field to each node in the AST.","message":"The `parse` function in versions >=1.1.0 can emit nodes with `position` fields (start/end index, line, column), but this feature is opt-in. Not enabling it when needed means position data will be absent.","severity":"gotcha","affected_versions":">=1.1.0"},{"fix":"For whitespace removal, post-process the resulting JSON AST. For parsing script/style contents, extract the `content` property of these nodes and parse them with a dedicated JavaScript or CSS parser.","message":"Himalaya explicitly *preserves whitespace* and does not parse the contents of `<script>`, `<style>`, and `<template>` tags, treating them as raw text. If you need whitespace trimmed or the contents of these tags parsed, post-processing is required.","severity":"gotcha","affected_versions":">=0.2.10"}],"env_vars":null,"last_verified":"2026-04-19T00:00:00.000Z","next_check":"2026-07-18T00:00:00.000Z","problems":[{"fix":"Upgrade to Himalaya v1.0.1 or newer. This issue was fixed in PR #86, which ensured the parser correctly handles and recovers from such scenarios without premature termination.","cause":"Earlier versions could exit parsing prematurely when encountering unnecessary closing tags, leaving subsequent HTML unparsed.","error":"Unparsed HTML content or empty output when input contains malformed closing tags, such as `</i>x`."},{"fix":"Update to Himalaya v0.2.13 or newer. This patch release resolved the issue by correctly parsing the entire value of CSS attributes even when they contain multiple colons.","cause":"Previous versions of Himalaya only parsed the segment before the second colon in CSS attribute values, leading to incomplete attribute data.","error":"CSS attribute values containing colons (e.g., `background-image: url(data:image/png...)`) are incorrectly truncated or parsed."},{"fix":"Upgrade to Himalaya v0.2.12 or newer. Version 0.2.11 provided a fix for `tbody`, `thead`, `tfoot`, and 0.2.12 extended it to `td` and `tr` to properly handle nested tables.","cause":"Versions prior to v0.2.11 had a bug where `tbody`, `thead`, and `tfoot` would incorrectly auto-close if a `table` tag was nested inside them.","error":"HTML tables with nested `<table>` elements within `<tbody>`, `<thead>`, or `<tfoot>` tags are incorrectly auto-closed, leading to malformed AST."},{"fix":"Update to Himalaya v0.2.10 or newer. This release improved whitespace parsing by recognizing any character defined by the RegExp metacharacter `\\s`.","cause":"Earlier lexer implementations only recognized standard spaces as whitespace, ignoring other `\\s` characters like tabs, newlines, etc.","error":"Whitespace characters other than a standard space (' ') are not recognized correctly, leading to incorrect AST representation of spacing."}],"ecosystem":"npm","meta_description":null,"install_score":null,"install_tag":null,"quickstart_score":null,"quickstart_tag":null,"pypi_latest":null,"cli_name":"","cli_version":null}