HTML5 RDF Parser

1.2.1 · active · verified Thu Apr 16

html5rdf is a pure-python library for parsing HTML to DOMFragment objects, primarily intended for use within RDFLib. It is a fork of `html5lib-python` and `html5lib-modern`, designed to conform to the WHATWG HTML specification. Maintained by the RDFLib team, it serves as a drop-in replacement for `html5lib` without Python 2 support or legacy dependencies like `six` and `webencodings`. The current version is 1.2.1, with releases occurring as needed for bug fixes and RDFLib integration.

Common errors

Warnings

Install

Imports

Quickstart

This example demonstrates basic HTML parsing from both a string and a file-like object using `html5rdf.parse`. By default, it returns an `xml.etree` element instance. You can specify different treebuilders like 'lxml' or 'dom' (for `xml.dom.minidom`) during parsing.

import html5rdf

# Parse a string
document_from_string = html5rdf.parse("<p>Hello World!</p>")
print(f"Parsed from string: {document_from_string.tag}")

# Parse from a file-like object
html_content = b"<html><body><h1>Test</h1></body></html>"
import io
with io.BytesIO(html_content) as f:
    document_from_file = html5rdf.parse(f)
    print(f"Parsed from file: {document_from_file.tag}")

view raw JSON →