requests-html
requests-html is a Python library designed for web scraping and HTML parsing, extending the capabilities of the popular `requests` library. It provides HTML parsing with CSS selectors (jQuery-style) and XPath, automatic encoding detection, mocked user-agents, and crucially, full JavaScript support via Headless Chromium (Pyppeteer). The current version is 0.10.0, with its last PyPI release in February 2019, suggesting a slower release cadence, though the underlying `requests` library is actively maintained.
Warnings
- gotcha JavaScript rendering (using `r.html.render()`) requires `pyppeteer` and will automatically download a Chromium browser into your home directory the first time it's invoked. This can take some time and consume disk space.
- gotcha The `requests-html` library's latest release on PyPI is from February 2019. While functional, it might not receive frequent updates compared to its core dependency `requests`. Community contributions via GitHub are ongoing, but new features or critical bug fixes may not be immediately released to PyPI.
- gotcha Asynchronous support (`AsyncHTMLSession`) requires Python 3.6+ and the `requests-html[async]` installation. The `.run()` method for `AsyncHTMLSession` executes coroutines and its results list order reflects the completion order, not the order coroutines were passed.
- breaking Older documentation and some historical context indicated stricter Python 3.6 support. While `requests-html` generally functions with newer Python 3 versions (e.g., 3.7+), direct compatibility guarantees were historically tied to 3.6. Always test thoroughly with your specific Python version.
Install
-
pip install requests-html -
pip install requests-html[async]
Imports
- HTMLSession
from requests_html import HTMLSession
- AsyncHTMLSession
from requests_html import AsyncHTMLSession
Quickstart
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.python.org/')
# Extract title using CSS selector
title = r.html.find('title', first=True).text
print(f"Page title: {title}")
# Extract all absolute links
print("Absolute links:")
for link in r.html.absolute_links:
if 'docs' in link:
print(link)
# Example for JavaScript rendering (requires pyppeteer and Chromium)
# To run this, ensure pyppeteer is installed and Chromium is downloaded.
# r_js = session.get('https://pyppeteer.github.io/')
# r_js.html.render(sleep=1)
# js_content = r_js.html.find('#example-id', first=True).text
# print(f"JS rendered content: {js_content}")
session.close()