BoilerPy3
BoilerPy3 is an active Python port of Christian Kohlschütter's Boilerpipe library, designed for robust HTML boilerplate removal and main text extraction from web pages. It is currently at version 1.0.7 and is based on Boilerpipe 1.2 functionality. The library focuses on providing a more Pythonic interface, including type-hinting and snake_case conventions.
Common errors
-
ImportError: No module named boilerpipe
cause This error often occurs when trying to import `boilerpipe` instead of `boilerpy3`, or when an older `boilerpipe-py3` package (which might have been Java-dependent) was installed. `boilerpy3` is a distinct, native Python port.fixEnsure you have installed `boilerpy3` using `pip install boilerpy3` and that your import statements are `from boilerpy3 import extractors`. -
HTMLExtractionError: Could not extract content from the given HTML.
cause This exception (or a similar error) can occur when the input HTML is malformed, unexpectedly structured, or when no discernible main content can be identified by the chosen extractor. Prior to v1.0.4, this would always raise an exception.fixConsider passing `raise_on_failure=False` to your `Extractor` constructor (e.g., `extractors.ArticleExtractor(raise_on_failure=False)`) to gracefully handle errors and retrieve any partial content. Also, try different extractors like `DefaultExtractor` or `KeepEverythingExtractor`. -
AttributeError: 'TextBlock' object has no attribute 'set_is_content'
cause This specific method (`TextBlock.set_is_content()`) was temporarily missing or broken in some versions (e.g., prior to v1.0.5) and then restored. Other `AttributeError`s could arise from accessing internal camelCase variables after v1.0.5's snake_case conversion.fixUpgrade to the latest `boilerpy3` version (v1.0.5 or later fixed `set_is_content`). For other `AttributeError`s related to internal names, ensure you are using the documented public API and check if a camelCase attribute was converted to snake_case.
Warnings
- gotcha The `*_from_url()` methods (e.g., `get_content_from_url`) provided by `BoilerPy3` are intended for testing purposes only. For robust, production-grade URL content fetching, it is strongly recommended to use a dedicated HTTP library like `requests` to retrieve the HTML, then pass the raw HTML to `extractor.get_content()`.
- gotcha From v1.0.4, the `Extractor` classes gained a `raise_on_failure` parameter, defaulting to `True`. If an HTML extraction error is encountered, an exception will be raised. Setting it to `False` will handle exceptions internally and return any partially extracted text, which changes default error handling behavior.
- gotcha Version 1.0.5 introduced conversions from camelCase variable names to snake_case within the library. While internal, if you were accessing private or undocumented attributes from older versions, this might cause `AttributeError`.
- gotcha BoilerPy3 is a native Python port of Boilerpipe 1.2. It does not include features from Boilerpipe 1.3, as testing showed 1.3 performed worse in the Python port. Users expecting the latest Boilerpipe features may find the behavior different.
Install
-
pip install boilerpy3
Imports
- ArticleExtractor
from boilerpy3.extractors import ArticleExtractor
from boilerpy3 import extractors extractor = extractors.ArticleExtractor()
- DefaultExtractor
from boilerpy3 import extractors extractor = extractors.DefaultExtractor()
Quickstart
from boilerpy3 import extractors
import requests
# Example 1: Extract from raw HTML string
html_content = "<html><body><h1>Title</h1><p>Main content here.</p><footer>Footer</footer></body></html>"
extractor = extractors.ArticleExtractor()
content_from_html = extractor.get_content(html_content)
print(f"Content from HTML: {content_from_html}")
# Example 2: Extract from a URL (recommended to use 'requests' for robustness)
# Replace with a real URL for testing
url = "https://example.com"
try:
response = requests.get(url, timeout=5)
response.raise_for_status() # Raise an exception for HTTP errors
html_from_url = response.text
content_from_url = extractor.get_content(html_from_url)
print(f"\nContent from URL: {content_from_url}")
except requests.exceptions.RequestException as e:
print(f"\nError fetching URL {url}: {e}")