HTML Parser with Fault Tolerance and Sanitization
The `html-parser` library provides a fault-tolerant parser for HTML and XML, designed to process even malformed input without 'explosions'. Its primary feature is robust sanitization capabilities, allowing developers to strip unwanted elements, attributes, and comments from untrusted HTML content. The library operates using a callback-based API, offering granular control over how various HTML tokens (elements, attributes, text, comments, CDATA, doctype) are handled during parsing. Currently at version 0.11.0 and last published over nine years ago, this package is no longer actively maintained. Its key differentiators historically were its resilience to invalid markup and its built-in, configurable sanitization features, making it suitable for preparing user-generated HTML for safe display, though its age raises concerns about modern security vulnerabilities.
Common errors
-
TypeError: htmlParser.parse is not a function
cause Attempting to use `import { parse } from 'html-parser';` or not correctly requiring the module.fixEnsure you are using CommonJS `require` and accessing `parse` as a method of the default export: `const htmlParser = require('html-parser'); htmlParser.parse(...)` -
ReferenceError: htmlParser is not defined
cause The module was not correctly `require`d or is out of scope.fixAdd `const htmlParser = require('html-parser');` at the top of your file to ensure the module is loaded and accessible. -
Sanitized HTML still contains unwanted elements/attributes.
cause Incorrect configuration of the `elements` or `attributes` options in the `sanitize` method, or using the `comments: false` option.fixReview the `sanitize` options carefully. `elements` and `attributes` arrays specify *what to remove*, or provide a callback function that returns `true` for items to be removed. Ensure `comments: true` is set if comments should be stripped.
Warnings
- breaking This package is pre-1.0 (v0.11.0) and abandoned, meaning its API is not stable and may have contained breaking changes between minor versions. There is no guarantee of backward compatibility.
- gotcha The package is CommonJS-only and does not provide ES module exports. Attempting to use `import` statements will result in errors.
- gotcha The `sanitize` function, while provided, relies on simple element/attribute blacklists or callback logic. Given the library's abandonment, it is highly unlikely to be robust against modern XSS vectors and other security vulnerabilities. It should not be solely relied upon for security-critical sanitization without thorough, independent auditing.
- gotcha This package is over nine years old and has not been updated. It may contain unpatched security vulnerabilities, performance issues, or incompatibilities with newer Node.js versions or browser environments.
Install
-
npm install html-parser -
yarn add html-parser -
pnpm add html-parser
Imports
- htmlParser
import htmlParser from 'html-parser';
const htmlParser = require('html-parser'); - parse
import { parse } from 'html-parser';const htmlParser = require('html-parser'); htmlParser.parse(htmlString, callbacks); - sanitize
import { sanitize } from 'html-parser';const htmlParser = require('html-parser'); const sanitizedHtml = htmlParser.sanitize(htmlString, options);
Quickstart
const htmlParser = require('html-parser');
const html = '<!doctype html><html><body onload="alert(\'hello\');">Hello<br />world</body></html>';
console.log('--- Parsing Example ---');
htmlParser.parse(html, {
openElement: function(name) { console.log('open: %s', name); },
closeOpenedElement: function(name, token, unary) { console.log('token: %s, unary: %s', token, unary); },
closeElement: function(name) { console.log('close: %s', name); },
comment: function(value) { console.log('comment: %s', value); },
cdata: function(value) { console.log('cdata: %s', value); },
attribute: function(name, value) { console.log('attribute: %s=%s', name, value); },
docType: function(value) { console.log('doctype: %s', value); },
text: function(value) { console.log('text: %s', value); }
});
const maliciousHtml = '<script>alert(\'danger!\')</script><p onclick="alert(\'danger!\')">blah blah<!-- useless comment --></p>';
console.log('\n--- Sanitization Example ---');
const sanitized = htmlParser.sanitize(maliciousHtml, {
elements: [ 'script' ], // Elements to remove
attributes: [ 'onclick' ], // Attributes to remove
comments: true // Remove comments
});
console.log('Original: %s', maliciousHtml);
console.log('Sanitized: %s', sanitized);