parse5-sax-parser

8.0.0 · active · verified Tue Apr 21

parse5-sax-parser is a streaming SAX-style HTML parser, designed for efficient, event-driven processing of HTML documents without building a full Document Object Model (DOM). It is part of the comprehensive `parse5` toolset, known for its high conformance to the WHATWG HTML Living Standard. The current stable version is 8.0.1. The project maintains an active release cadence, with major versions (like v7.0.0 and v8.0.0) introducing significant architectural changes and features, complemented by frequent patch and minor releases for dependency updates and bug fixes. Its key differentiators include its streaming nature, SAX (Simple API for XML) event model, and robust HTML5 spec compliance, making it suitable for scenarios where memory efficiency and raw content inspection are prioritized over DOM manipulation. It's often used in conjunction with other `parse5` modules or as a standalone component for tasks like data extraction or sanitization.

Common errors

Warnings

Install

Imports

Quickstart

Demonstrates how to use parse5-sax-parser as a streaming event emitter, piping HTML data through it and listening for SAX-style events like startTag, endTag, text, and comment.

import { SAXParser } from 'parse5-sax-parser';
import { Readable } from 'stream';

// Simulate an HTML input stream
const htmlStream = new Readable({
  read() {
    this.push('<!DOCTYPE html><html><head><title>Test</title></head><body>');
    this.push('<h1>Hello, <b>world</b>!</h1><p>This is a <a href="#">link</a>.</p>');
    this.push('<!-- a comment --><br>');
    this.push('</body></html>');
    this.push(null); // No more data
  }
});

const parser = new SAXParser();

parser.on('doctype', (doctype) => {
  console.log('DOCTYPE:', doctype.name);
});

parser.on('startTag', (tag) => {
  console.log(`Start Tag: <${tag.name}> Attributes:`, tag.attrs.map(attr => `${attr.name}="${attr.value}"`).join(' '));
});

parser.on('endTag', (tag) => {
  console.log(`End Tag: </${tag.name}>`);
});

parser.on('text', (text) => {
  if (text.text.trim().length > 0) {
    console.log('Text:', JSON.stringify(text.text));
  }
});

parser.on('comment', (comment) => {
  console.log('Comment:', comment.text);
});

parser.on('error', (err) => {
  console.error('Parsing error:', err);
});

parser.on('finish', () => {
  console.log('Parsing finished!');
});

// Pipe the HTML stream through the parser. SAXParser is a passthrough stream.
// It emits events but passes the original data unchanged, allowing further piping.
htmlStream.pipe(parser);
// If you wanted to, you could pipe it further, e.g., parser.pipe(anotherWritableStream);

view raw JSON →