HTML to Markdown Converter
html-to-markdown is a high-performance Python library for converting HTML to Markdown, powered by a Rust core. Currently at version 3.1.0, it offers a clean Python API and aims for consistent output across multiple language bindings. The library is actively maintained with ongoing development and performance enhancements.
Warnings
- breaking Version 2.x introduced a complete rewrite with a Rust core, leading to significant performance gains but also breaking changes in the API. While a `v1_compat` module was provided, users upgrading from 1.x should review the changelog for necessary code adjustments.
- gotcha Markdown is a less expressive format than HTML. Complex HTML structures, inline styles, and certain advanced tags (e.g., `<script>`, `<style>`) will be simplified or entirely removed during conversion, potentially leading to a loss of original formatting or functionality.
- gotcha Conversion of complex HTML tables (e.g., with `colspan`, `rowspan`, nested elements) and `<code>`/`<pre>` blocks might not perfectly retain original formatting or indentation in Markdown. This can lead to less readable or incorrectly structured output.
- gotcha The primary `convert()` function only returns the Markdown string. If you need to extract structured metadata like titles, links, or headings from the HTML during conversion, you must use `convert_with_metadata()` which returns a dictionary including both content and metadata.
Install
-
pip install html-to-markdown
Imports
- convert
from html_to_markdown import convert
- ConversionOptions
from html_to_markdown import ConversionOptions
- convert_with_metadata
from html_to_markdown import convert_with_metadata
Quickstart
from html_to_markdown import convert, ConversionOptions
html_content = """
<h1>Welcome</h1>
<p>This is <strong>bold</strong> and <em>italic</em> text.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
"""
# Basic conversion
markdown_output = convert(html_content)
print(f"Default Markdown:\n{markdown_output}")
# Conversion with options
options = ConversionOptions(
heading_style="atx",
list_indent_width=2,
output_format="commonmark"
)
formatted_markdown = convert(html_content, options)
print(f"\nFormatted Markdown (CommonMark):\n{formatted_markdown}")
# Example for Djot output (another lightweight markup language)
djot_options = ConversionOptions(output_format="djot")
djot_output = convert(html_content, djot_options)
print(f"\nDjot Output:\n{djot_output}")