html2text
html2text is a Python library that efficiently converts HTML into clean, easy-to-read plain ASCII text, which is also valid Markdown. It provides extensive customization options for the conversion process. The library maintains an active and healthy development status with regular releases, ensuring ongoing support and feature enhancements.
Warnings
- breaking Support for Python 2.x and older Python 3 versions was removed in release 2019.8.11. The library now officially requires Python 3.9 or newer.
- breaking The functionality to retrieve HTML over the network by passing URLs directly to the library was removed in release 2019.8.11. Earlier versions issued deprecation warnings for this feature.
- gotcha To configure conversion options (e.g., `ignore_links`, `body_width`, `images_as_html`), you must create an instance of `html2text.HTML2Text()` and set properties on it, then call its `handle()` method. The top-level `html2text.html2text()` function does not accept these configuration options directly.
- gotcha By default, `html2text` may wrap long lines. To disable this, which is often desirable for programmatic parsing or specific Markdown formatting, set the `body_width` option to `0`.
Install
-
pip install html2text
Imports
- html2text
import html2text text = html2text.html2text(html_content)
- HTML2Text
from html2text import HTML2Text h = HTML2Text() h.ignore_links = True text = h.handle(html_content)
Quickstart
import html2text
html_content = """
<h1>Welcome</h1>
<p>Hello, <b>world</b>! This is a <a href="https://example.com">link</a>.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
"""
# Basic conversion
plain_text = html2text.html2text(html_content)
print("--- Basic Conversion ---")
print(plain_text)
# Custom conversion with options (e.g., ignoring links and no line wrapping)
h = html2text.HTML2Text()
h.ignore_links = True # Do not include link URLs
h.body_width = 0 # Disable line wrapping
custom_text = h.handle(html_content)
print("\n--- Custom Conversion (No links, no wrap) ---")
print(custom_text)