html2text
html2text is a Python library that efficiently converts HTML into clean, easy-to-read plain ASCII text, which is also valid Markdown. It provides extensive customization options for the conversion process. The library maintains an active and healthy development status with regular releases, ensuring ongoing support and feature enhancements.
Common errors
-
ModuleNotFoundError: No module named 'html2text'
cause The 'html2text' library has not been installed in the Python environment where the code is being executed, or the wrong Python interpreter is being used.fixInstall the library using pip: `pip install html2text`. If multiple Python versions are present, ensure you are installing it for the correct interpreter (e.g., `pip3 install html2text`). -
TypeError: a bytes-like object is required, not 'str'
cause The `html2text` function or an `HTML2Text` instance's `handle()` method received a string object (`str`) when it expected a bytes-like object, or vice-versa, which often occurs when processing content from network requests (e.g., `urllib.request.urlopen().read()`) without proper encoding/decoding.fixEnsure that the input to `html2text` is a Unicode string. If you obtain bytes from a source like `urllib.request.urlopen().read()`, decode it first: `html_content_bytes = website.read(); html_content_str = html_content_bytes.decode('utf-8'); text = html2text.html2text(html_content_str)`. -
AttributeError: 'str' object has no attribute 'decode'
cause This error occurs in Python 3 when attempting to call the `.decode()` method on a string (`str`) object. In Python 3, strings are already Unicode, and `.decode()` is a method for `bytes` objects to convert them to `str`.fixRemove the `.decode()` call if the data is already a string. If you intend to convert bytes to a string, ensure your variable holds a `bytes` object before calling `.decode()`. -
ImportError: cannot import name 'unescape' from 'html2text'
cause The `unescape` function is not directly importable from the top-level `html2text` module in current versions of the library, or it might have been an internal utility in older versions.fixInstead of directly importing `unescape` from `html2text`, use Python's built-in `html.unescape` for general HTML entity unescaping, or rely on `html2text`'s main conversion methods which handle unescaping implicitly: `import html; cleaned_text = html.unescape(some_html_string)` or `import html2text; converter = html2text.HTML2Text(); result = converter.handle(some_html_string)`.
Warnings
- breaking Support for Python 2.x and older Python 3 versions was removed in release 2019.8.11. The library now officially requires Python 3.9 or newer.
- breaking The functionality to retrieve HTML over the network by passing URLs directly to the library was removed in release 2019.8.11. Earlier versions issued deprecation warnings for this feature.
- gotcha To configure conversion options (e.g., `ignore_links`, `body_width`, `images_as_html`), you must create an instance of `html2text.HTML2Text()` and set properties on it, then call its `handle()` method. The top-level `html2text.html2text()` function does not accept these configuration options directly.
- gotcha By default, `html2text` may wrap long lines. To disable this, which is often desirable for programmatic parsing or specific Markdown formatting, set the `body_width` option to `0`.
Install
-
pip install html2text
Imports
- html2text
import html2text text = html2text.html2text(html_content)
- HTML2Text
from html2text import HTML2Text h = HTML2Text() h.ignore_links = True text = h.handle(html_content)
Quickstart
import html2text
html_content = """
<h1>Welcome</h1>
<p>Hello, <b>world</b>! This is a <a href="https://example.com">link</a>.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
"""
# Basic conversion
plain_text = html2text.html2text(html_content)
print("--- Basic Conversion ---")
print(plain_text)
# Custom conversion with options (e.g., ignoring links and no line wrapping)
h = html2text.HTML2Text()
h.ignore_links = True # Do not include link URLs
h.body_width = 0 # Disable line wrapping
custom_text = h.handle(html_content)
print("\n--- Custom Conversion (No links, no wrap) ---")
print(custom_text)