BoilerPy3

1.0.7 · active · verified Thu Apr 16

BoilerPy3 is an active Python port of Christian Kohlschütter's Boilerpipe library, designed for robust HTML boilerplate removal and main text extraction from web pages. It is currently at version 1.0.7 and is based on Boilerpipe 1.2 functionality. The library focuses on providing a more Pythonic interface, including type-hinting and snake_case conventions.

Common errors

Warnings

Install

Imports

Quickstart

Demonstrates how to extract content from both a raw HTML string and a URL. For URL extraction, it is highly recommended to use the `requests` library for robust fetching, then pass the HTML content to the extractor.

from boilerpy3 import extractors
import requests

# Example 1: Extract from raw HTML string
html_content = "<html><body><h1>Title</h1><p>Main content here.</p><footer>Footer</footer></body></html>"
extractor = extractors.ArticleExtractor()
content_from_html = extractor.get_content(html_content)
print(f"Content from HTML: {content_from_html}")

# Example 2: Extract from a URL (recommended to use 'requests' for robustness)
# Replace with a real URL for testing
url = "https://example.com"

try:
    response = requests.get(url, timeout=5)
    response.raise_for_status() # Raise an exception for HTTP errors
    html_from_url = response.text
    content_from_url = extractor.get_content(html_from_url)
    print(f"\nContent from URL: {content_from_url}")
except requests.exceptions.RequestException as e:
    print(f"\nError fetching URL {url}: {e}")

view raw JSON →