{"id":7989,"library":"boilerpy3","title":"BoilerPy3","description":"BoilerPy3 is an active Python port of Christian Kohlschütter's Boilerpipe library, designed for robust HTML boilerplate removal and main text extraction from web pages. It is currently at version 1.0.7 and is based on Boilerpipe 1.2 functionality. The library focuses on providing a more Pythonic interface, including type-hinting and snake_case conventions.","status":"active","version":"1.0.7","language":"en","source_language":"en","source_url":"https://github.com/jmriebold/BoilerPy3","tags":["HTML","text extraction","boilerplate removal","web scraping","content extraction"],"install":[{"cmd":"pip install boilerpy3","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Recommended for robust fetching of URL content, as built-in `_from_url()` methods are for testing only.","package":"requests","optional":true}],"imports":[{"note":"While direct import works, the recommended pattern from the official examples is `from boilerpy3 import extractors`.","wrong":"from boilerpy3.extractors import ArticleExtractor","symbol":"ArticleExtractor","correct":"from boilerpy3 import extractors\nextractor = extractors.ArticleExtractor()"},{"note":"Used for general text extraction when `ArticleExtractor` is not suitable.","symbol":"DefaultExtractor","correct":"from boilerpy3 import extractors\nextractor = extractors.DefaultExtractor()"}],"quickstart":{"code":"from boilerpy3 import extractors\nimport requests\n\n# Example 1: Extract from raw HTML string\nhtml_content = \"<html><body><h1>Title</h1><p>Main content here.</p><footer>Footer</footer></body></html>\"\nextractor = extractors.ArticleExtractor()\ncontent_from_html = extractor.get_content(html_content)\nprint(f\"Content from HTML: {content_from_html}\")\n\n# Example 2: Extract from a URL (recommended to use 'requests' for robustness)\n# Replace with a real URL for testing\nurl = \"https://example.com\"\n\ntry:\n    response = requests.get(url, timeout=5)\n    response.raise_for_status() # Raise an exception for HTTP errors\n    html_from_url = response.text\n    content_from_url = extractor.get_content(html_from_url)\n    print(f\"\\nContent from URL: {content_from_url}\")\nexcept requests.exceptions.RequestException as e:\n    print(f\"\\nError fetching URL {url}: {e}\")","lang":"python","description":"Demonstrates how to extract content from both a raw HTML string and a URL. For URL extraction, it is highly recommended to use the `requests` library for robust fetching, then pass the HTML content to the extractor."},"warnings":[{"fix":"Use `requests` or a similar library to fetch HTML, then process with `extractor.get_content(html_string)`.","message":"The `*_from_url()` methods (e.g., `get_content_from_url`) provided by `BoilerPy3` are intended for testing purposes only. For robust, production-grade URL content fetching, it is strongly recommended to use a dedicated HTTP library like `requests` to retrieve the HTML, then pass the raw HTML to `extractor.get_content()`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Explicitly set `raise_on_failure=True` or `raise_on_failure=False` in the `Extractor` constructor based on desired error handling. If your code expects exceptions for all failures, ensure it's `True`.","message":"From v1.0.4, the `Extractor` classes gained a `raise_on_failure` parameter, defaulting to `True`. If an HTML extraction error is encountered, an exception will be raised. Setting it to `False` will handle exceptions internally and return any partially extracted text, which changes default error handling behavior.","severity":"gotcha","affected_versions":">=1.0.4"},{"fix":"Review any code that might have accessed internal or undocumented camelCase attributes and update them to their snake_case equivalents if they were exposed. Stick to documented public APIs to avoid such issues.","message":"Version 1.0.5 introduced conversions from camelCase variable names to snake_case within the library. While internal, if you were accessing private or undocumented attributes from older versions, this might cause `AttributeError`.","severity":"gotcha","affected_versions":">=1.0.5"},{"fix":"Be aware that the underlying algorithm is equivalent to Boilerpipe 1.2. If a specific Boilerpipe 1.3 feature is needed, it might not be available in `boilerpy3`.","message":"BoilerPy3 is a native Python port of Boilerpipe 1.2. It does not include features from Boilerpipe 1.3, as testing showed 1.3 performed worse in the Python port. Users expecting the latest Boilerpipe features may find the behavior different.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Ensure you have installed `boilerpy3` using `pip install boilerpy3` and that your import statements are `from boilerpy3 import extractors`.","cause":"This error often occurs when trying to import `boilerpipe` instead of `boilerpy3`, or when an older `boilerpipe-py3` package (which might have been Java-dependent) was installed. `boilerpy3` is a distinct, native Python port.","error":"ImportError: No module named boilerpipe"},{"fix":"Consider passing `raise_on_failure=False` to your `Extractor` constructor (e.g., `extractors.ArticleExtractor(raise_on_failure=False)`) to gracefully handle errors and retrieve any partial content. Also, try different extractors like `DefaultExtractor` or `KeepEverythingExtractor`.","cause":"This exception (or a similar error) can occur when the input HTML is malformed, unexpectedly structured, or when no discernible main content can be identified by the chosen extractor. Prior to v1.0.4, this would always raise an exception.","error":"HTMLExtractionError: Could not extract content from the given HTML."},{"fix":"Upgrade to the latest `boilerpy3` version (v1.0.5 or later fixed `set_is_content`). For other `AttributeError`s related to internal names, ensure you are using the documented public API and check if a camelCase attribute was converted to snake_case.","cause":"This specific method (`TextBlock.set_is_content()`) was temporarily missing or broken in some versions (e.g., prior to v1.0.5) and then restored. Other `AttributeError`s could arise from accessing internal camelCase variables after v1.0.5's snake_case conversion.","error":"AttributeError: 'TextBlock' object has no attribute 'set_is_content'"}]}