selectolax

0.4.7 · active · verified Thu Apr 09

Selectolax is a fast and lightweight Python library for parsing HTML5 documents with CSS selectors. It leverages Cython bindings to the high-performance Modest and Lexbor engines, with Lexbor being the recommended and actively developed backend. It is actively maintained, with frequent releases addressing bugs and adding features.

Warnings

Install

Imports

Quickstart

This example demonstrates how to parse an HTML string using `LexborHTMLParser`, extract text from specific elements using CSS selectors, and iterate through multiple matching elements to gather data. It also shows how to access attributes and safely handle cases where an element might not be found.

from selectolax.lexbor import LexborHTMLParser

html_content = """
<html>
<head><title>My Awesome Page</title></head>
<body>
    <h1 id="main-title" data-version="1.0">Welcome!</h1>
    <div class="post">
        <p>This is the first post.</p>
        <a href="/post/1">Read more</a>
    </div>
    <div class="post">
        <p>This is the second post.</p>
        <a href="/post/2">Read more</a>
    </div>
    <p class="footer">© 2026</p>
</body>
</html>
"""

tree = LexborHTMLParser(html_content)

# Get the title
title = tree.css_first('title').text() if tree.css_first('title') else 'No Title'
print(f"Page Title: {title}")

# Get the text of the main heading
main_heading = tree.css_first('h1#main-title').text() if tree.css_first('h1#main-title') else 'N/A'
print(f"Main Heading: {main_heading}")

# Get all post paragraphs and their links
posts_data = []
for post_node in tree.css('.post'):
    paragraph_text = post_node.css_first('p').text() if post_node.css_first('p') else ''
    link_href = post_node.css_first('a').attrs.get('href') if post_node.css_first('a') else ''
    posts_data.append({'paragraph': paragraph_text, 'link': link_href})

print("\nPosts Found:")
for post in posts_data:
    print(f"- {post['paragraph']} (Link: {post['link']})")

view raw JSON →