{"id":9816,"library":"html2docx","title":"html2docx","description":"html2docx is a Python library that converts valid HTML input into Microsoft Word (.docx) documents. It leverages `python-docx` for document generation and `BeautifulSoup` for robust HTML parsing, aiming to translate common HTML structures and basic styling into an editable Word format. The current version is 1.6.0, with a release cadence that focuses on bug fixes and minor feature enhancements rather than frequent major API changes.","status":"active","version":"1.6.0","language":"en","source_language":"en","source_url":"https://github.com/erezlife/html2docx","tags":["html","docx","word","conversion","document-generation","beautifulsoup","python-docx"],"install":[{"cmd":"pip install html2docx","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Core dependency for generating .docx files.","package":"python-docx","optional":false},{"reason":"Core dependency for parsing HTML content.","package":"beautifulsoup4","optional":false}],"imports":[{"symbol":"Html2Docx","correct":"from html2docx import Html2Docx"}],"quickstart":{"code":"import os\nimport tempfile\nfrom html2docx import Html2Docx\n\n# Example HTML content with basic structure and an external image\nhtml_content = \"\"\"\n<h1>Hello World!</h1>\n<p>This is a paragraph with some <strong>bold</strong> and <em>italic</em> text.</p>\n<ul>\n    <li>List Item 1</li>\n    <li>List Item 2</li>\n</ul>\n<p style=\"text-align: center;\">A centered paragraph.</p>\n<img src=\"https://www.python.org/static/community_logos/python-logo-only.png\" width=\"100px\" alt=\"Python Logo\">\n\"\"\"\n\n# Initialize the parser\nnew_parser = Html2Docx()\n\n# Parse the HTML and get a python-docx Document object\ndocx = new_parser.parse_html_section(html_content)\n\n# Save the document to a temporary file\nwith tempfile.NamedTemporaryFile(suffix=\".docx\", delete=False) as temp_file:\n    docx.save(temp_file.name)\n    print(f\"Generated DOCX saved to: {temp_file.name}\")\n\n# To view the file, uncomment the following line (might not work on all OSs)\n# os.startfile(temp_file.name) # On Windows\n# import subprocess; subprocess.call(['open', temp_file.name]) # On macOS\n\n# In a real application, you might want to clean up the temp file after use\n# os.remove(temp_file.name)","lang":"python","description":"This quickstart demonstrates how to convert a string of HTML content into a .docx file using `html2docx`. It initializes `Html2Docx`, parses the HTML, and saves the resulting `python-docx` Document object to a temporary file, suitable for viewing or further processing."},"warnings":[{"fix":"Simplify HTML and CSS to use basic semantic tags and inline styles that map well to Word document capabilities. Avoid relying on complex CSS for layout; use simpler constructs like `text-align` or basic list/paragraph styles.","message":"html2docx primarily translates HTML *structure and semantics* (e.g., `<h1>`, `<p>`, `<ul>`, `<strong>`) rather than exact *visual layout* dictated by complex CSS. Advanced CSS properties (e.g., `float`, `position`, `flexbox`) are often ignored or translated imperfectly due to DOCX format limitations.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Always use absolute URLs for remote images or ensure local image paths are absolute or correctly resolvable from the script's execution directory. Verify the image files exist and are accessible.","message":"Relative image paths in the HTML may not resolve correctly when converting, especially if the DOCX is generated in a different context than the HTML was intended to be viewed. The library needs direct access to image files.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Pre-process HTML with an HTML linter or a library like `BeautifulSoup` itself to clean up and ensure it's well-formed before passing it to `html2docx`. Avoid non-standard or deprecated HTML tags.","message":"The library expects well-formed and valid HTML. Malformed tags, unclosed elements, or overly complex/non-standard HTML structures can lead to unexpected output, missing content, or errors during parsing.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-17T00:00:00.000Z","next_check":"2026-07-16T00:00:00.000Z","problems":[{"fix":"Run `pip install html2docx` in your terminal or ensure your virtual environment is activated.","cause":"The `html2docx` package is not installed in the current Python environment, or the environment where the script is being run is not the one where the package was installed.","error":"ModuleNotFoundError: No module named 'html2docx'"},{"fix":"Ensure the `html_content` string is not empty and contains valid, well-formed HTML. Print `html_content` before parsing to debug its contents.","cause":"This typically occurs when the HTML input provided to `parse_html_section` is empty, malformed, or could not be successfully parsed by `BeautifulSoup`, leading to an attempt to call methods on a `None` object.","error":"AttributeError: 'NoneType' object has no attribute 'find_all'"},{"fix":"Verify all image `src` attributes. For local files, ensure paths are correct relative to the script or absolute. For remote URLs, check network connectivity and URL validity. Ensure data URLs are base64 encoded correctly.","cause":"An `<img>` tag in the HTML specifies a source (`src`) that is either not a valid URL (absolute or local file path), an improperly formatted data URL, or the image file cannot be accessed/downloaded.","error":"ValueError: Invalid image path or data URL"},{"fix":"Simplify HTML and CSS to use basic semantic tags (e.g., `<strong>`, `<em>`, `<h1>`-`<h6>`, `<ul>`, `<ol>`) and minimal inline styles that map well to Word document capabilities (e.g., `text-align`). Avoid complex CSS properties like `float`, `position`, or `display: flex`.","cause":"The source HTML relies heavily on CSS for styling that `html2docx` does not support or cannot translate directly into `python-docx` styles due to limitations of the DOCX format.","error":"The output DOCX document contains unstyled text or misses formatting (not an error, but a common issue)"}]}