{"id":6222,"library":"resiliparse","title":"Resiliparse","description":"Resiliparse is a collection of robust and fast processing tools for parsing and analyzing web archive data, encompassing utilities for character encoding, HTML parsing, content extraction, and process guarding. It is currently at version 0.16.0 and is actively maintained as part of the ChatNoir web analytics toolkit.","status":"active","version":"0.16.0","language":"en","source_language":"en","source_url":"https://github.com/chatnoir-eu/chatnoir-resiliparse","tags":["web scraping","HTML parsing","web archives","NLP","encoding detection","DOM manipulation"],"install":[{"cmd":"pip install resiliparse","lang":"bash","label":"Install main package"},{"cmd":"pip install fastwarc","lang":"bash","label":"Install FastWARC (high-performance WARC parsing)"},{"cmd":"pip install 'resiliparse[cli]'","lang":"bash","label":"Install with CLI tools"}],"dependencies":[{"reason":"High-performance WARC parsing library, often used in conjunction with Resiliparse for web archive data.","package":"fastwarc","optional":true},{"reason":"Used by EncodingDetector for universal character encoding detection (C wrapper).","package":"uchardet","optional":false}],"imports":[{"symbol":"HTMLTree","correct":"from resiliparse.parse.html import HTMLTree"},{"symbol":"detect_encoding","correct":"from resiliparse.parse.encoding import detect_encoding"},{"symbol":"bytes_to_str","correct":"from resiliparse.parse.encoding import bytes_to_str"},{"symbol":"extract_plain_text","correct":"from resiliparse.extract.html2text import extract_plain_text"}],"quickstart":{"code":"from resiliparse.parse.html import HTMLTree\nfrom resiliparse.parse.encoding import detect_encoding, bytes_to_str\n\nhtml_content = \"\"\"<!doctype html>\n<html lang=\"en\">\n<head>\n  <meta charset=\"utf-8\">\n  <title>Example page</title>\n</head>\n<body>\n  <main id=\"foo\">\n    <p id=\"a\">Hello <span class=\"bar\">world</span>!</p>\n  </main>\n</body>\n</html>\"\"\"\n\n# Parse from a Unicode string\ntree = HTMLTree.parse(html_content)\nprint(f\"Document title: {tree.title}\")\n\n# Find an element by CSS selector\nparagraph = tree.query_selector('p.bar')\nif paragraph:\n    print(f\"First paragraph with class 'bar': {paragraph.text}\")\n\n# Parse from bytes with encoding detection\nhtml_bytes = html_content.encode('utf-16')\nencoding = detect_encoding(html_bytes)\ndecoded_html = bytes_to_str(html_bytes, encoding)\ntree_from_bytes = HTMLTree.parse(decoded_html)\nprint(f\"Title from bytes: {tree_from_bytes.title}\")","lang":"python","description":"This quickstart demonstrates parsing HTML from both Unicode strings and byte strings with encoding detection, and then performing basic DOM selection to extract information."},"warnings":[{"fix":"Re-obtain or re-query `DOMNode` instances after any modification to the HTMLTree or its nodes. Avoid storing `DOMNode` references across modification operations.","message":"DOMNode objects become invalid after any DOM tree manipulation (modification or deallocation of the parent tree). Continuing to use existing `DOMNode` instances after manipulation can lead to Python crashes or security vulnerabilities due to dangling pointers (use-after-free).","severity":"breaking","affected_versions":"All versions up to 0.16.0"},{"fix":"For critical applications, consider building Resiliparse binaries with the latest Lexbor Git master or be aware of potential edge-case parsing issues.","message":"The HTML parsing module is currently marked as experimental. While generally well-tested, it may contain upstream Lexbor bugs that are fixed but not yet released in Resiliparse. Building from the latest Lexbor Git master might offer a more stable experience.","severity":"gotcha","affected_versions":"All versions up to 0.16.0"},{"fix":"While a 'best guess' is often sufficient, be aware of the internal remapping and fallback logic. For highly critical encoding scenarios, manual pre-processing with `detect_encoding()` and `bytes_to_str()` might offer more granular control.","message":"When parsing HTML from bytes using `parse_from_bytes()`, the `encoding` parameter is a 'best guess'. Internally, Resiliparse will remap the encoding according to the WHATWG specification and use `bytes_to_str()` which attempts fallback encodings if the primary one fails.","severity":"gotcha","affected_versions":"All versions up to 0.16.0"},{"fix":"Ensure `pip install fastwarc` is run if you plan to work with WARC archives.","message":"FastWARC is a separate package and needs to be installed independently if WARC file parsing is required. It is not bundled with the main `resiliparse` package.","severity":"gotcha","affected_versions":"All versions up to 0.16.0"}],"env_vars":null,"last_verified":"2026-04-14T00:00:00.000Z","next_check":"2026-07-13T00:00:00.000Z","problems":[]}