html2docx
html2docx is a Python library that converts valid HTML input into Microsoft Word (.docx) documents. It leverages `python-docx` for document generation and `BeautifulSoup` for robust HTML parsing, aiming to translate common HTML structures and basic styling into an editable Word format. The current version is 1.6.0, with a release cadence that focuses on bug fixes and minor feature enhancements rather than frequent major API changes.
Common errors
-
ModuleNotFoundError: No module named 'html2docx'
cause The `html2docx` package is not installed in the current Python environment, or the environment where the script is being run is not the one where the package was installed.fixRun `pip install html2docx` in your terminal or ensure your virtual environment is activated. -
AttributeError: 'NoneType' object has no attribute 'find_all'
cause This typically occurs when the HTML input provided to `parse_html_section` is empty, malformed, or could not be successfully parsed by `BeautifulSoup`, leading to an attempt to call methods on a `None` object.fixEnsure the `html_content` string is not empty and contains valid, well-formed HTML. Print `html_content` before parsing to debug its contents. -
ValueError: Invalid image path or data URL
cause An `<img>` tag in the HTML specifies a source (`src`) that is either not a valid URL (absolute or local file path), an improperly formatted data URL, or the image file cannot be accessed/downloaded.fixVerify all image `src` attributes. For local files, ensure paths are correct relative to the script or absolute. For remote URLs, check network connectivity and URL validity. Ensure data URLs are base64 encoded correctly. -
The output DOCX document contains unstyled text or misses formatting (not an error, but a common issue)
cause The source HTML relies heavily on CSS for styling that `html2docx` does not support or cannot translate directly into `python-docx` styles due to limitations of the DOCX format.fixSimplify HTML and CSS to use basic semantic tags (e.g., `<strong>`, `<em>`, `<h1>`-`<h6>`, `<ul>`, `<ol>`) and minimal inline styles that map well to Word document capabilities (e.g., `text-align`). Avoid complex CSS properties like `float`, `position`, or `display: flex`.
Warnings
- gotcha html2docx primarily translates HTML *structure and semantics* (e.g., `<h1>`, `<p>`, `<ul>`, `<strong>`) rather than exact *visual layout* dictated by complex CSS. Advanced CSS properties (e.g., `float`, `position`, `flexbox`) are often ignored or translated imperfectly due to DOCX format limitations.
- gotcha Relative image paths in the HTML may not resolve correctly when converting, especially if the DOCX is generated in a different context than the HTML was intended to be viewed. The library needs direct access to image files.
- gotcha The library expects well-formed and valid HTML. Malformed tags, unclosed elements, or overly complex/non-standard HTML structures can lead to unexpected output, missing content, or errors during parsing.
Install
-
pip install html2docx
Imports
- Html2Docx
from html2docx import Html2Docx
Quickstart
import os
import tempfile
from html2docx import Html2Docx
# Example HTML content with basic structure and an external image
html_content = """
<h1>Hello World!</h1>
<p>This is a paragraph with some <strong>bold</strong> and <em>italic</em> text.</p>
<ul>
<li>List Item 1</li>
<li>List Item 2</li>
</ul>
<p style="text-align: center;">A centered paragraph.</p>
<img src="https://www.python.org/static/community_logos/python-logo-only.png" width="100px" alt="Python Logo">
"""
# Initialize the parser
new_parser = Html2Docx()
# Parse the HTML and get a python-docx Document object
docx = new_parser.parse_html_section(html_content)
# Save the document to a temporary file
with tempfile.NamedTemporaryFile(suffix=".docx", delete=False) as temp_file:
docx.save(temp_file.name)
print(f"Generated DOCX saved to: {temp_file.name}")
# To view the file, uncomment the following line (might not work on all OSs)
# os.startfile(temp_file.name) # On Windows
# import subprocess; subprocess.call(['open', temp_file.name]) # On macOS
# In a real application, you might want to clean up the temp file after use
# os.remove(temp_file.name)