Mammoth
Mammoth is an open-source Python library designed to convert Microsoft Word `.docx` documents into clean and semantic HTML or Markdown. It focuses on preserving the semantic structure of the document (e.g., headings, lists, tables) rather than attempting to replicate exact visual formatting. The current version is 1.12.0. The library has a steady release cadence, with updates addressing features and maintenance.
Warnings
- gotcha Mammoth prioritizes semantic conversion over exact visual fidelity. It converts styles like 'Heading 1' to `<h1>` elements, ignoring precise font sizes or colors. Users expecting a pixel-perfect rendition of their Word document may be disappointed by the 'clean' HTML output.
- breaking Markdown support is deprecated. The `convert_to_markdown` function still exists but is discouraged. Future versions may remove or significantly change this functionality. Generating HTML and then using a separate library for HTML to Markdown conversion is recommended for better results.
- breaking Mammoth performs no sanitization of the source `.docx` document. Converting documents from untrusted users can introduce security vulnerabilities, such as `javascript:` links in the output HTML.
- gotcha WMF images are not handled by default. If your `.docx` documents contain WMF images, they will not be correctly converted or embedded in the output HTML.
- gotcha Custom style mappings using `p[style-name='...'] => ...:fresh` can lead to unexpected HTML structures if not fully understood. The `:fresh` option forces a new HTML element, rather than appending content to an existing one, which can be critical for layout.
Install
-
pip install mammoth
Imports
- convert_to_html
import mammoth mammoth.convert_to_html(...)
- extract_raw_text
import mammoth mammoth.extract_raw_text(...)
- images
import mammoth.images
Quickstart
import mammoth
# Create a dummy docx file for demonstration
from docx import Document
document = Document()
document.add_heading('My Document Title', level=1)
document.add_paragraph('This is a paragraph with some **bold** and *italic* text.')
document.add_paragraph('A second paragraph.')
document.save('document.docx')
# Convert docx to HTML
with open('document.docx', 'rb') as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value
messages = result.messages
print('Generated HTML:')
print(html)
if messages:
print('\nMessages during conversion:')
for message in messages:
print(message)
# Clean up the dummy file
import os
os.remove('document.docx')