tinyhtml5
tinyhtml5 is a HTML5 parser, currently at version 2.1.0, that transforms a possibly malformed HTML document into an ElementTree tree. It is a simplified and modernized fork of the unmaintained `html5lib` library, focusing solely on parsing and generating `ElementTree` output. It typically releases updates to support new Python versions and minor feature enhancements.
Warnings
- breaking Python 3.9 support was dropped in tinyhtml5 2.1.0. Python 3.10 or newer is now required. Users on older Python versions will need to use tinyhtml5 2.0.0 or earlier.
- breaking tinyhtml5 is a simplified fork of `html5lib`. It only exposes a single `tinyhtml5.parse()` function that returns an `ElementTree` object. Many features present in `html5lib`, such as tree walkers, adapters, filters, and alternative tree builders (e.g., DOM, BeautifulSoup), are not supported in tinyhtml5.
- gotcha The only output format supported by `tinyhtml5` is ElementTree. If you are accustomed to working with other HTML parsing libraries like `BeautifulSoup` or `lxml` directly, you will receive an `ElementTree` object and may need to convert it or use `ElementTree`'s API for further processing.
Install
-
pip install tinyhtml5
Imports
- parse
from tinyhtml5 import parse
Quickstart
from tinyhtml5 import parse html_string = '<html><body><p>Hello, tinyhtml5!</p></body></html>' parsed_tree = parse(html_string) # The parsed_tree is an ElementTree object print(parsed_tree) print(parsed_tree.tag) print(parsed_tree[0].tag) print(parsed_tree[0][0].text)