Mammoth

1.12.0 · active · verified Thu Apr 09

Mammoth is an open-source Python library designed to convert Microsoft Word `.docx` documents into clean and semantic HTML or Markdown. It focuses on preserving the semantic structure of the document (e.g., headings, lists, tables) rather than attempting to replicate exact visual formatting. The current version is 1.12.0. The library has a steady release cadence, with updates addressing features and maintenance.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to convert a `.docx` file to HTML using `mammoth.convert_to_html`. It also shows how to access any messages (warnings or errors) generated during the conversion process. The input file must be opened in binary read mode (`'rb'`). For this example, a dummy `.docx` file is created using the `python-docx` library.

import mammoth

# Create a dummy docx file for demonstration
from docx import Document
document = Document()
document.add_heading('My Document Title', level=1)
document.add_paragraph('This is a paragraph with some **bold** and *italic* text.')
document.add_paragraph('A second paragraph.')
document.save('document.docx')

# Convert docx to HTML
with open('document.docx', 'rb') as docx_file:
    result = mammoth.convert_to_html(docx_file)
    html = result.value
    messages = result.messages

print('Generated HTML:')
print(html)

if messages:
    print('\nMessages during conversion:')
    for message in messages:
        print(message)

# Clean up the dummy file
import os
os.remove('document.docx')

view raw JSON →