Resiliparse

0.16.0 · active · verified Tue Apr 14

Resiliparse is a collection of robust and fast processing tools for parsing and analyzing web archive data, encompassing utilities for character encoding, HTML parsing, content extraction, and process guarding. It is currently at version 0.16.0 and is actively maintained as part of the ChatNoir web analytics toolkit.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates parsing HTML from both Unicode strings and byte strings with encoding detection, and then performing basic DOM selection to extract information.

from resiliparse.parse.html import HTMLTree
from resiliparse.parse.encoding import detect_encoding, bytes_to_str

html_content = """<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>Example page</title>
</head>
<body>
  <main id="foo">
    <p id="a">Hello <span class="bar">world</span>!</p>
  </main>
</body>
</html>"""

# Parse from a Unicode string
tree = HTMLTree.parse(html_content)
print(f"Document title: {tree.title}")

# Find an element by CSS selector
paragraph = tree.query_selector('p.bar')
if paragraph:
    print(f"First paragraph with class 'bar': {paragraph.text}")

# Parse from bytes with encoding detection
html_bytes = html_content.encode('utf-16')
encoding = detect_encoding(html_bytes)
decoded_html = bytes_to_str(html_bytes, encoding)
tree_from_bytes = HTMLTree.parse(decoded_html)
print(f"Title from bytes: {tree_from_bytes.title}")

view raw JSON →