HTML to DOCX Converter
html-for-docx is a Python library designed to convert HTML content into Microsoft Word (.docx) documents easily and efficiently. It is an actively maintained fork of the discontinued `pqzx/html2docx` project, providing a more reliable solution for generating Word documents from various HTML inputs. The current version is 1.1.4, with a consistent release cadence focusing on bug fixes and feature enhancements, including improved CSS and HTML tag support.
Common errors
-
FileNotFoundError: [Errno 2] No such file or directory: 'your_file_name.docx'
cause The specified output DOCX file path is invalid, or the directory where the file is supposed to be saved does not exist.fixEnsure that the directory path for the output `.docx` file already exists, or provide a full absolute path. Python's `os.makedirs()` can be used to create directories if needed. -
Tables are not showing borders or other expected styles in the output DOCX.
cause The `html-for-docx` library does not apply default styles to tables.fixSet the `table_style` attribute on your `HtmlToDocx` parser instance before processing, e.g., `parser = HtmlToDocx(table_style='Table Grid')` or `parser.table_style = 'Light Shading Accent 1'`. Refer to `python-docx` documentation or Word itself for available table style names. -
Specific HTML tags or inline CSS styles (e.g., `color`, `font-size`) are not being applied, or render incorrectly in the DOCX output.
cause The library might not support all CSS properties, or there could be style precedence issues. Check the documentation for currently supported properties.fixConsult the `html-for-docx` documentation for the list of supported HTML tags and CSS properties. For custom class-based styling, use the `style_map` option. For highest precedence, apply inline CSS with `!important`. -
Crash or incorrect rendering when processing images with RGBA color profiles.
cause Older versions (prior to 1.1.3) had a bug handling specific image formats, notably those with RGBA color profiles.fixUpgrade `html-for-docx` to version 1.1.3 or higher, as this specific bug was fixed in that release.
Warnings
- gotcha HTML to DOCX conversion inherently carries limitations, especially with complex CSS layouts, responsive designs, or intricate styling. The output DOCX might not perfectly match the browser's rendering of the HTML.
- gotcha By default, tables in the output DOCX will not have any specific styling (e.g., borders).
- gotcha If you are using `python-docx` templates with custom styles, these custom styles will not be present if you initialize `document = Document()` without loading the template. This can lead to missing styles when adding HTML content.
Install
-
pip install html-for-docx
Imports
- HtmlToDocx
from html4docx import HtmlToDocx
Quickstart
from docx import Document
from html4docx import HtmlToDocx
from io import BytesIO
# Example 1: Add HTML to an existing Document object and save
document = Document() # Or load an existing .docx: Document('template.docx')
parser = HtmlToDocx()
html_string = '<h1>Hello world</h1><p>This is a <strong>paragraph</strong> with some <em>formatting</em>.</p>'
parser.add_html_to_document(html_string, document)
document.save('output.docx')
print("Saved 'output.docx' with basic HTML content.")
# Example 2: Convert an HTML string directly to a BytesIO object (in-memory)
buffer = BytesIO()
parser_in_memory = HtmlToDocx()
html_string_2 = '<p style="color: blue;">This text is blue.</p>'
parser_in_memory.add_html_to_document(html_string_2, buffer)
# To read from the buffer again, reset its position
buffer.seek(0)
print(f"Generated DOCX in memory, size: {len(buffer.getvalue())} bytes.")
# Example 3: Convert an HTML file directly
# Create a dummy HTML file for demonstration
with open('input.html', 'w', encoding='utf-8') as f:
f.write('<h2>Content from file</h2><p>This was converted from an HTML file.</p>')
file_parser = HtmlToDocx()
file_parser.parse_html_file('input.html', 'output_from_file.docx')
print("Saved 'output_from_file.docx' from 'input.html'.")