Parsel
Parsel is a powerful Python library designed to extract data from HTML and XML documents using XPath and CSS selectors. It provides a flexible and efficient way to navigate and query web content, making it a common dependency for web scraping tools. The current version is 1.11.0, and it maintains an active development cycle with frequent updates, often tied to Python version support and dependency requirement changes.
Warnings
- breaking Support for older Python versions (3.9, PyPy 3.10) has been removed in 1.11.0. Earlier versions (3.8, 3.7, 3.6, 3.5, 2.7) were removed in previous releases.
- breaking The `Selector.remove()` and `SelectorList.remove()` methods, deprecated in 1.7.0, have been entirely removed.
- breaking The default encoding name for documents loaded via `body` or when parsing changed from `"utf8"` to `"utf-8"` due to compatibility issues in some environments.
- breaking Minimum supported versions for core dependencies like `lxml`, `packaging`, `jmespath`, and `cssselect` have been bumped.
Install
-
pip install parsel
Imports
- Selector
from parsel import Selector
- SelectorList
from parsel import SelectorList
Quickstart
from parsel import Selector
html_doc = '''
<html>
<head><title>My Awesome Page</title></head>
<body>
<div id="main">
<h1>Hello Parsel!</h1>
<p class="intro">This is an <a href="/example">introductory</a> paragraph.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
</div>
</body>
</html>
'''
# Create a Selector from HTML text
selector = Selector(text=html_doc)
# Extract title using CSS selector
title = selector.css('title::text').get()
print(f"Title: {title}")
# Extract H1 text using XPath
h1_text = selector.xpath('//h1/text()').get()
print(f"H1 Text: {h1_text}")
# Extract all list items using CSS selector
list_items = selector.css('ul li::text').getall()
print(f"List Items: {list_items}")
# Extract attribute using CSS selector
link_href = selector.css('.intro a::attr(href)').get()
print(f"Link href: {link_href}")
# Example with JSON and JMESPath (Parsel >= 1.8.0)
json_doc = '{"data": {"products": [{"id": 1, "name": "Laptop"}, {"id": 2, "name": "Mouse"}]}}'
json_selector = Selector(text=json_doc, type='json')
product_names = json_selector.jmespath('data.products[*].name').getall()
print(f"Product names (JMESPath): {product_names}")