Resiliparse
Resiliparse is a collection of robust and fast processing tools for parsing and analyzing web archive data, encompassing utilities for character encoding, HTML parsing, content extraction, and process guarding. It is currently at version 0.16.0 and is actively maintained as part of the ChatNoir web analytics toolkit.
Warnings
- breaking DOMNode objects become invalid after any DOM tree manipulation (modification or deallocation of the parent tree). Continuing to use existing `DOMNode` instances after manipulation can lead to Python crashes or security vulnerabilities due to dangling pointers (use-after-free).
- gotcha The HTML parsing module is currently marked as experimental. While generally well-tested, it may contain upstream Lexbor bugs that are fixed but not yet released in Resiliparse. Building from the latest Lexbor Git master might offer a more stable experience.
- gotcha When parsing HTML from bytes using `parse_from_bytes()`, the `encoding` parameter is a 'best guess'. Internally, Resiliparse will remap the encoding according to the WHATWG specification and use `bytes_to_str()` which attempts fallback encodings if the primary one fails.
- gotcha FastWARC is a separate package and needs to be installed independently if WARC file parsing is required. It is not bundled with the main `resiliparse` package.
Install
-
pip install resiliparse -
pip install fastwarc -
pip install 'resiliparse[cli]'
Imports
- HTMLTree
from resiliparse.parse.html import HTMLTree
- detect_encoding
from resiliparse.parse.encoding import detect_encoding
- bytes_to_str
from resiliparse.parse.encoding import bytes_to_str
- extract_plain_text
from resiliparse.extract.html2text import extract_plain_text
Quickstart
from resiliparse.parse.html import HTMLTree
from resiliparse.parse.encoding import detect_encoding, bytes_to_str
html_content = """<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Example page</title>
</head>
<body>
<main id="foo">
<p id="a">Hello <span class="bar">world</span>!</p>
</main>
</body>
</html>"""
# Parse from a Unicode string
tree = HTMLTree.parse(html_content)
print(f"Document title: {tree.title}")
# Find an element by CSS selector
paragraph = tree.query_selector('p.bar')
if paragraph:
print(f"First paragraph with class 'bar': {paragraph.text}")
# Parse from bytes with encoding detection
html_bytes = html_content.encode('utf-16')
encoding = detect_encoding(html_bytes)
decoded_html = bytes_to_str(html_bytes, encoding)
tree_from_bytes = HTMLTree.parse(decoded_html)
print(f"Title from bytes: {tree_from_bytes.title}")