justext

3.0.2 · active · verified Thu Apr 09

justext is a heuristic-based boilerplate removal tool for HTML documents. It extracts the main content from web pages, discarding navigation, advertisements, and other extraneous elements. The current version is 3.0.2, and it typically releases updates for bug fixes and compatibility issues.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to fetch an HTML document using `requests`, then pass its raw byte content to `justext.justext()` along with a predefined stoplist (e.g., 'English') to extract and print human-readable text, filtering out boilerplate.

import requests
import justext

# Example URL (replace with a real URL for actual testing)
url = "https://www.python.org"

try:
    response = requests.get(url, timeout=5)
    # justext expects bytes as input
    html_content = response.content

    # Get the English stoplist
    stoplist = justext.get_stoplist("English")

    # Process the HTML content
    paragraphs = justext.justext(html_content, stoplist)

    print(f"Extracted text from {url}:")
    for paragraph in paragraphs:
        if not paragraph.is_boilerplate:
            print(paragraph.text)

except requests.exceptions.RequestException as e:
    print(f"Error fetching URL: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

view raw JSON →