HTML Parser with Fault Tolerance and Sanitization

0.11.0 · abandoned · verified Tue Apr 21

The `html-parser` library provides a fault-tolerant parser for HTML and XML, designed to process even malformed input without 'explosions'. Its primary feature is robust sanitization capabilities, allowing developers to strip unwanted elements, attributes, and comments from untrusted HTML content. The library operates using a callback-based API, offering granular control over how various HTML tokens (elements, attributes, text, comments, CDATA, doctype) are handled during parsing. Currently at version 0.11.0 and last published over nine years ago, this package is no longer actively maintained. Its key differentiators historically were its resilience to invalid markup and its built-in, configurable sanitization features, making it suitable for preparing user-generated HTML for safe display, though its age raises concerns about modern security vulnerabilities.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates both the callback-based HTML parsing and the sanitization features of the library. It shows how to process an HTML string, logging events for various tokens, and how to remove malicious script tags, event attributes, and comments.

const htmlParser = require('html-parser');

const html = '<!doctype html><html><body onload="alert(\'hello\');">Hello<br />world</body></html>';

console.log('--- Parsing Example ---');
htmlParser.parse(html, {
	openElement: function(name) { console.log('open: %s', name); },
	closeOpenedElement: function(name, token, unary) { console.log('token: %s, unary: %s', token, unary); },
	closeElement: function(name) { console.log('close: %s', name); },
	comment: function(value) { console.log('comment: %s', value); },
	cdata: function(value) { console.log('cdata: %s', value); },
	attribute: function(name, value) { console.log('attribute: %s=%s', name, value); },
	docType: function(value) { console.log('doctype: %s', value); },
	text: function(value) { console.log('text: %s', value); }
});

const maliciousHtml = '<script>alert(\'danger!\')</script><p onclick="alert(\'danger!\')">blah blah<!-- useless comment --></p>';
console.log('\n--- Sanitization Example ---');
const sanitized = htmlParser.sanitize(maliciousHtml, {
	elements: [ 'script' ], // Elements to remove
	attributes: [ 'onclick' ], // Attributes to remove
	comments: true // Remove comments
});
console.log('Original: %s', maliciousHtml);
console.log('Sanitized: %s', sanitized);

view raw JSON →