selectolax
Selectolax is a fast and lightweight Python library for parsing HTML5 documents with CSS selectors. It leverages Cython bindings to the high-performance Modest and Lexbor engines, with Lexbor being the recommended and actively developed backend. It is actively maintained, with frequent releases addressing bugs and adding features.
Warnings
- breaking Empty HTML tags are now serialized to `<tag value="">` instead of `<tag value>`. This change affects how attributes of empty tags are represented in the output HTML.
- deprecated The `HTMLParser` (Modest backend) from `selectolax.parser` is deprecated. Users should migrate to `LexborHTMLParser` from `selectolax.lexbor` for improved performance, features, and continued support.
- gotcha The `css_first()` method returns `None` if no element matches the given CSS selector. Failing to check for `None` before accessing attributes or methods (e.g., `.text()`, `.attrs`) will result in an `AttributeError`.
- gotcha Version `0.4.5` was a bugged release and was subsequently yanked from PyPI. Installing or using this specific version is not recommended.
- gotcha Earlier versions of selectolax (prior to 0.4.6 and 0.4.0) contained memory leaks in the fragment parser and potential segfaults when accessing attributes or performing DOM modifications like `decompose()` or `unwrap()`.
- gotcha Installation via `pip install selectolax` might fail with compilation errors, especially if installing an outdated version on a newer Python environment, or if Cython is not readily available.
Install
-
pip install selectolax -
pip install selectolax[cython]
Imports
- LexborHTMLParser
from selectolax.lexbor import LexborHTMLParser
Quickstart
from selectolax.lexbor import LexborHTMLParser
html_content = """
<html>
<head><title>My Awesome Page</title></head>
<body>
<h1 id="main-title" data-version="1.0">Welcome!</h1>
<div class="post">
<p>This is the first post.</p>
<a href="/post/1">Read more</a>
</div>
<div class="post">
<p>This is the second post.</p>
<a href="/post/2">Read more</a>
</div>
<p class="footer">© 2026</p>
</body>
</html>
"""
tree = LexborHTMLParser(html_content)
# Get the title
title = tree.css_first('title').text() if tree.css_first('title') else 'No Title'
print(f"Page Title: {title}")
# Get the text of the main heading
main_heading = tree.css_first('h1#main-title').text() if tree.css_first('h1#main-title') else 'N/A'
print(f"Main Heading: {main_heading}")
# Get all post paragraphs and their links
posts_data = []
for post_node in tree.css('.post'):
paragraph_text = post_node.css_first('p').text() if post_node.css_first('p') else ''
link_href = post_node.css_first('a').attrs.get('href') if post_node.css_first('a') else ''
posts_data.append({'paragraph': paragraph_text, 'link': link_href})
print("\nPosts Found:")
for post in posts_data:
print(f"- {post['paragraph']} (Link: {post['link']})")