HTML5 RDF Parser
html5rdf is a pure-python library for parsing HTML to DOMFragment objects, primarily intended for use within RDFLib. It is a fork of `html5lib-python` and `html5lib-modern`, designed to conform to the WHATWG HTML specification. Maintained by the RDFLib team, it serves as a drop-in replacement for `html5lib` without Python 2 support or legacy dependencies like `six` and `webencodings`. The current version is 1.2.1, with releases occurring as needed for bug fixes and RDFLib integration.
Common errors
-
html5lib.html5parser.ParseError: Unexpected DOCTYPE. Ignored.
cause `html5rdf` (inheriting from `html5lib`) can raise `ParseError` exceptions when parsing HTML documents that contain unexpected or malformed DOCTYPE declarations, or other parsing errors when the parser is initialized with `strict=True`.fixTo prevent these exceptions, initialize the parser with `strict=False` (default behavior): `parser = html5rdf.HTMLParser(strict=False)`. Alternatively, handle `html5lib.html5parser.ParseError` exceptions in your code if strict parsing is desired. -
Tests are failing in html5rdf 1.2.1 after installation.
cause A known issue in version 1.2.1 exists where included unit tests may fail due to specific changes or test data packaging.fixThis is likely a packaging or test-specific issue that does not necessarily reflect on the core parsing functionality. If encountering this, monitor the GitHub issues for a fix or consider checking out the repository and trying development branch if available. For normal usage, parsing functionality should still be stable. -
Casting an HTML literal with certain content to `rdf:HTML` datatype leads to an empty literal or incorrect output when used with RDFLib.
cause This issue arises from `html5rdf`'s underlying `html5lib` parsing of HTML fragments, particularly those that are not valid standalone documents (e.g., `<body>` or `<tr>` without proper parent elements). It may result in fragments with no children or incorrect child nodes.fixEnsure that the HTML fragments being parsed are as semantically complete as possible or be aware of `html5rdf`'s behavior with incomplete/invalid fragments. The bug might require upstream fixes in the parsing logic for fragments. Verify the output structure after parsing if unexpected results occur.
Warnings
- breaking Do not install `html5rdf` alongside older `html5lib` or `html5lib-modern` packages. `html5rdf` is a fork and exposes the module under the same name internally, leading to aliasing issues and unexpected behavior if both are present in the dependency tree.
- gotcha When using the `lxml` treebuilder (e.g., `html5rdf.parse(html, treebuilder='lxml')`), `lxml` is supported under CPython but is known to cause segfaults when used with PyPy.
- deprecated The `html5lib` sanitizer functionality (e.g., `html5lib.serialize(sanitize=True)` or `html5lib.filters.sanitizer`) has been removed from `html5rdf` as it was deprecated in the upstream `html5lib` project.
Install
-
pip install html5rdf
Imports
- html5rdf
import html5rdf
- HTMLParser
from html5rdf import HTMLParser
- getTreeBuilder
import html5rdf.getTreeBuilder
from html5rdf import getTreeBuilder
Quickstart
import html5rdf
# Parse a string
document_from_string = html5rdf.parse("<p>Hello World!</p>")
print(f"Parsed from string: {document_from_string.tag}")
# Parse from a file-like object
html_content = b"<html><body><h1>Test</h1></body></html>"
import io
with io.BytesIO(html_content) as f:
document_from_file = html5rdf.parse(f)
print(f"Parsed from file: {document_from_file.tag}")