justext
justext is a heuristic-based boilerplate removal tool for HTML documents. It extracts the main content from web pages, discarding navigation, advertisements, and other extraneous elements. The current version is 3.0.2, and it typically releases updates for bug fixes and compatibility issues.
Warnings
- breaking justext v3.0.0 dropped support for Python 3.4 and older versions (including Python 2.x). Attempts to install or run on these versions will fail or lead to unexpected behavior.
- gotcha The `justext.justext()` function expects raw HTML content as bytes (e.g., from `response.content`). Passing a decoded string (e.g., `response.text`) can lead to parsing errors or incorrect results due to encoding issues with `lxml`.
- gotcha Older versions of justext (specifically before v3.0.1) had compatibility issues with newer versions of `lxml`, leading to parsing errors. Similarly, versions before v3.0.0 would fail on Python 3.8+ due to the removal of `cgi.escape`.
Install
-
pip install justext
Imports
- justext
import justext
Quickstart
import requests
import justext
# Example URL (replace with a real URL for actual testing)
url = "https://www.python.org"
try:
response = requests.get(url, timeout=5)
# justext expects bytes as input
html_content = response.content
# Get the English stoplist
stoplist = justext.get_stoplist("English")
# Process the HTML content
paragraphs = justext.justext(html_content, stoplist)
print(f"Extracted text from {url}:")
for paragraph in paragraphs:
if not paragraph.is_boilerplate:
print(paragraph.text)
except requests.exceptions.RequestException as e:
print(f"Error fetching URL: {e}")
except Exception as e:
print(f"An error occurred: {e}")