HTML5 Parser for Python

1.1 · active · verified Sun Mar 29

html5lib is a pure-Python library for parsing HTML documents, designed to conform to the WHATWG HTML specification, as implemented by major web browsers. Its current stable version is 1.1, released in June 2020, with development on version 1.2 ongoing but unreleased. The library's release cadence is irregular, with significant time between major stable releases.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates basic HTML parsing using `html5lib.parse` for the default `xml.etree` output, and how to use `html5lib.HTMLParser` with a custom treebuilder like `xml.dom.minidom`. It also shows how to enable strict parsing to catch HTML errors. By default, `html5lib` provides an `xml.etree` element instance, but `xml.dom.minidom` and `lxml.etree` are also supported via treebuilders.

import html5lib

# Parse a simple HTML string
document = html5lib.parse("<p>Hello <b>World</b>!</p>")
print(f"Parsed document tag: {document.tag}")
print(f"First child's tag: {document[0].tag}")
print(f"First child's text: {document[0].text}")

# Parse with a specific treebuilder (e.g., xml.dom.minidom)
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
minidom_document = parser.parse("<p>Another <i>example</i>.</p>")
# Accessing an element using minidom structure
p_tag = minidom_document.getElementsByTagName('p')[0]
print(f"Minidom document P tag: {p_tag.tagName}")

# Example with strict parsing (raises exceptions on errors)
try:
    strict_parser = html5lib.HTMLParser(strict=True)
    strict_parser.parse("<div><p>Missing close tag")
except html5lib.html5parser.ParseError as e:
    print(f"Strict parsing error: {e}")

view raw JSON →