parse5-sax-parser
parse5-sax-parser is a streaming SAX-style HTML parser, designed for efficient, event-driven processing of HTML documents without building a full Document Object Model (DOM). It is part of the comprehensive `parse5` toolset, known for its high conformance to the WHATWG HTML Living Standard. The current stable version is 8.0.1. The project maintains an active release cadence, with major versions (like v7.0.0 and v8.0.0) introducing significant architectural changes and features, complemented by frequent patch and minor releases for dependency updates and bug fixes. Its key differentiators include its streaming nature, SAX (Simple API for XML) event model, and robust HTML5 spec compliance, making it suitable for scenarios where memory efficiency and raw content inspection are prioritized over DOM manipulation. It's often used in conjunction with other `parse5` modules or as a standalone component for tasks like data extraction or sanitization.
Common errors
-
ReferenceError: require is not defined
cause Attempting to use `require()` to import `parse5-sax-parser` in an ECMAScript Module (ESM) context or a Node.js environment configured for ESM.fixChange `const { SAXParser } = require('parse5-sax-parser');` to `import { SAXParser } from 'parse5-sax-parser';`. Ensure your `package.json` has `"type": "module"` or use `.mjs` file extensions for ESM files. -
TypeError: SAXParser is not a constructor
cause Incorrectly importing `SAXParser` as a default import, or attempting to use a CommonJS `require()` pattern in a project that expects ESM named exports, or vice-versa.fixVerify your import statement. For ESM, use `import { SAXParser } from 'parse5-sax-parser';`. For older CommonJS projects (pre-v7), it would have been `const { SAXParser } = require('parse5-sax-parser');`. -
My parser isn't emitting all expected events or seems to hang after processing some input.
cause Forgetting to signal the end of the input stream to the SAXParser, especially when manually `write()`-ing chunks instead of piping from another stream.fixIf you are writing data manually using `parser.write(chunk)`, ensure you call `parser.end()` when all data has been written. If using `pipe()`, ensure the source stream correctly signals its end (e.g., by pushing `null` for `Readable` streams).
Warnings
- breaking Starting with v7.0.0, all `parse5` packages, including `parse5-sax-parser`, are published as ECMAScript Modules (ESM) only. Direct CommonJS `require()` statements are no longer supported by default.
- breaking As of v7.0.0, `parse5` and its sub-packages now ship their own TypeScript definitions. You should remove any `@types/parse5-sax-parser` package from your project as it is no longer needed and can cause type conflicts.
- gotcha parse5-sax-parser is a pass-through transform stream. It emits events but does *not* modify the HTML content itself. If you pipe data through it, the output will be identical to the input. This means it cannot be used for HTML sanitization or rewriting directly; for that, consider `parse5-html-rewriting-stream` or building a DOM with `parse5` and then serializing.
- breaking The underlying `parse5` core package, upon which `parse5-sax-parser` relies, received significant updates in v7.0.0 to catch up with the latest HTML Living Standard specification. This might lead to subtle differences in parsing results for certain edge cases compared to previous versions.
- breaking In `parse5` v6.0.0 (and therefore affecting the broader parse5 ecosystem), the `TreeAdapter` interface introduced a new mandatory method, `updateNodeSourceCodeLocation`. While `parse5-sax-parser` does not directly build a DOM tree, applications that heavily integrate custom `TreeAdapter` implementations with the core `parse5` functionality might need to update their adapters if they are also using `parse5-sax-parser` in the same project context.
Install
-
npm install parse5-sax-parser -
yarn add parse5-sax-parser -
pnpm add parse5-sax-parser
Imports
- SAXParser
const SAXParser = require('parse5-sax-parser').SAXParser;import { SAXParser } from 'parse5-sax-parser'; - SAXParserOptions
import { SAXParserOptions } from 'parse5-sax-parser';import type { SAXParserOptions } from 'parse5-sax-parser'; - StartTag
import { StartTag } from 'parse5-sax-parser/lib/tokens';import type { StartTag } from 'parse5-sax-parser';
Quickstart
import { SAXParser } from 'parse5-sax-parser';
import { Readable } from 'stream';
// Simulate an HTML input stream
const htmlStream = new Readable({
read() {
this.push('<!DOCTYPE html><html><head><title>Test</title></head><body>');
this.push('<h1>Hello, <b>world</b>!</h1><p>This is a <a href="#">link</a>.</p>');
this.push('<!-- a comment --><br>');
this.push('</body></html>');
this.push(null); // No more data
}
});
const parser = new SAXParser();
parser.on('doctype', (doctype) => {
console.log('DOCTYPE:', doctype.name);
});
parser.on('startTag', (tag) => {
console.log(`Start Tag: <${tag.name}> Attributes:`, tag.attrs.map(attr => `${attr.name}="${attr.value}"`).join(' '));
});
parser.on('endTag', (tag) => {
console.log(`End Tag: </${tag.name}>`);
});
parser.on('text', (text) => {
if (text.text.trim().length > 0) {
console.log('Text:', JSON.stringify(text.text));
}
});
parser.on('comment', (comment) => {
console.log('Comment:', comment.text);
});
parser.on('error', (err) => {
console.error('Parsing error:', err);
});
parser.on('finish', () => {
console.log('Parsing finished!');
});
// Pipe the HTML stream through the parser. SAXParser is a passthrough stream.
// It emits events but passes the original data unchanged, allowing further piping.
htmlStream.pipe(parser);
// If you wanted to, you could pipe it further, e.g., parser.pipe(anotherWritableStream);