{"id":7291,"library":"html5rdf","title":"HTML5 RDF Parser","description":"html5rdf is a pure-python library for parsing HTML to DOMFragment objects, primarily intended for use within RDFLib. It is a fork of `html5lib-python` and `html5lib-modern`, designed to conform to the WHATWG HTML specification. Maintained by the RDFLib team, it serves as a drop-in replacement for `html5lib` without Python 2 support or legacy dependencies like `six` and `webencodings`. The current version is 1.2.1, with releases occurring as needed for bug fixes and RDFLib integration.","status":"active","version":"1.2.1","language":"en","source_language":"en","source_url":"https://github.com/RDFLib/html5rdf","tags":["HTML","parser","RDF","WHATWG","html5lib"],"install":[{"cmd":"pip install html5rdf","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Optional, for accelerated ElementTree implementation and `lxml.etree` treebuilder.","package":"lxml","optional":true},{"reason":"Optional, for a treewalker.","package":"genshi","optional":true},{"reason":"Optional, as a fallback for character encoding detection.","package":"chardet","optional":true}],"imports":[{"symbol":"html5rdf","correct":"import html5rdf"},{"symbol":"HTMLParser","correct":"from html5rdf import HTMLParser"},{"note":"getTreeBuilder is typically accessed directly from the top-level html5rdf module or via a parser instance.","wrong":"import html5rdf.getTreeBuilder","symbol":"getTreeBuilder","correct":"from html5rdf import getTreeBuilder"}],"quickstart":{"code":"import html5rdf\n\n# Parse a string\ndocument_from_string = html5rdf.parse(\"<p>Hello World!</p>\")\nprint(f\"Parsed from string: {document_from_string.tag}\")\n\n# Parse from a file-like object\nhtml_content = b\"<html><body><h1>Test</h1></body></html>\"\nimport io\nwith io.BytesIO(html_content) as f:\n    document_from_file = html5rdf.parse(f)\n    print(f\"Parsed from file: {document_from_file.tag}\")","lang":"python","description":"This example demonstrates basic HTML parsing from both a string and a file-like object using `html5rdf.parse`. By default, it returns an `xml.etree` element instance. You can specify different treebuilders like 'lxml' or 'dom' (for `xml.dom.minidom`) during parsing."},"warnings":[{"fix":"Ensure only `html5rdf` or a compatible `html5lib` version is in your project's dependencies, but not both at the same time if aliasing occurs.","message":"Do not install `html5rdf` alongside older `html5lib` or `html5lib-modern` packages. `html5rdf` is a fork and exposes the module under the same name internally, leading to aliasing issues and unexpected behavior if both are present in the dependency tree.","severity":"breaking","affected_versions":"All versions of html5rdf (when co-installed with aliasing html5lib versions)"},{"fix":"Avoid using the `lxml` treebuilder when running Python applications with PyPy. Opt for `xml.etree` (default) or `xml.dom.minidom` instead.","message":"When using the `lxml` treebuilder (e.g., `html5rdf.parse(html, treebuilder='lxml')`), `lxml` is supported under CPython but is known to cause segfaults when used with PyPy.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For HTML sanitization needs, migrate to dedicated libraries like `Bleach`, which is recommended by the original `html5lib` project. `html5rdf` focuses solely on parsing HTML to DOM fragments.","message":"The `html5lib` sanitizer functionality (e.g., `html5lib.serialize(sanitize=True)` or `html5lib.filters.sanitizer`) has been removed from `html5rdf` as it was deprecated in the upstream `html5lib` project.","severity":"deprecated","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"To prevent these exceptions, initialize the parser with `strict=False` (default behavior): `parser = html5rdf.HTMLParser(strict=False)`. Alternatively, handle `html5lib.html5parser.ParseError` exceptions in your code if strict parsing is desired.","cause":"`html5rdf` (inheriting from `html5lib`) can raise `ParseError` exceptions when parsing HTML documents that contain unexpected or malformed DOCTYPE declarations, or other parsing errors when the parser is initialized with `strict=True`.","error":"html5lib.html5parser.ParseError: Unexpected DOCTYPE. Ignored."},{"fix":"This is likely a packaging or test-specific issue that does not necessarily reflect on the core parsing functionality. If encountering this, monitor the GitHub issues for a fix or consider checking out the repository and trying development branch if available. For normal usage, parsing functionality should still be stable.","cause":"A known issue in version 1.2.1 exists where included unit tests may fail due to specific changes or test data packaging.","error":"Tests are failing in html5rdf 1.2.1 after installation."},{"fix":"Ensure that the HTML fragments being parsed are as semantically complete as possible or be aware of `html5rdf`'s behavior with incomplete/invalid fragments. The bug might require upstream fixes in the parsing logic for fragments. Verify the output structure after parsing if unexpected results occur.","cause":"This issue arises from `html5rdf`'s underlying `html5lib` parsing of HTML fragments, particularly those that are not valid standalone documents (e.g., `<body>` or `<tr>` without proper parent elements). It may result in fragments with no children or incorrect child nodes.","error":"Casting an HTML literal with certain content to `rdf:HTML` datatype leads to an empty literal or incorrect output when used with RDFLib."}]}