{"id":2111,"library":"mammoth","title":"Mammoth","description":"Mammoth is an open-source Python library designed to convert Microsoft Word `.docx` documents into clean and semantic HTML or Markdown. It focuses on preserving the semantic structure of the document (e.g., headings, lists, tables) rather than attempting to replicate exact visual formatting. The current version is 1.12.0. The library has a steady release cadence, with updates addressing features and maintenance.","status":"active","version":"1.12.0","language":"en","source_language":"en","source_url":"https://github.com/mwilliamson/python-mammoth","tags":["docx","html","markdown","document conversion","word processing"],"install":[{"cmd":"pip install mammoth","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Internal dependency for document processing.","package":"cobble","optional":false}],"imports":[{"note":"Primary function for DOCX to HTML conversion.","symbol":"convert_to_html","correct":"import mammoth\n\nmammoth.convert_to_html(...)"},{"note":"Used for extracting plain text, ignoring all formatting.","symbol":"extract_raw_text","correct":"import mammoth\n\nmammoth.extract_raw_text(...)"},{"note":"Accesses image handling utilities, e.g., for custom image converters.","symbol":"images","correct":"import mammoth.images"}],"quickstart":{"code":"import mammoth\n\n# Create a dummy docx file for demonstration\nfrom docx import Document\ndocument = Document()\ndocument.add_heading('My Document Title', level=1)\ndocument.add_paragraph('This is a paragraph with some **bold** and *italic* text.')\ndocument.add_paragraph('A second paragraph.')\ndocument.save('document.docx')\n\n# Convert docx to HTML\nwith open('document.docx', 'rb') as docx_file:\n    result = mammoth.convert_to_html(docx_file)\n    html = result.value\n    messages = result.messages\n\nprint('Generated HTML:')\nprint(html)\n\nif messages:\n    print('\\nMessages during conversion:')\n    for message in messages:\n        print(message)\n\n# Clean up the dummy file\nimport os\nos.remove('document.docx')","lang":"python","description":"This quickstart demonstrates how to convert a `.docx` file to HTML using `mammoth.convert_to_html`. It also shows how to access any messages (warnings or errors) generated during the conversion process. The input file must be opened in binary read mode (`'rb'`). For this example, a dummy `.docx` file is created using the `python-docx` library."},"warnings":[{"fix":"Understand Mammoth's design philosophy: it converts semantic meaning. Use custom style mappings to fine-tune HTML output based on your DOCX styles, focusing on structure rather than visual appearance. Review the generated HTML to ensure it meets your structural requirements.","message":"Mammoth prioritizes semantic conversion over exact visual fidelity. It converts styles like 'Heading 1' to `<h1>` elements, ignoring precise font sizes or colors. Users expecting a pixel-perfect rendition of their Word document may be disappointed by the 'clean' HTML output.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Migrate to converting DOCX to HTML using `mammoth.convert_to_html` and then use a dedicated HTML-to-Markdown conversion library if Markdown is required.","message":"Markdown support is deprecated. The `convert_to_markdown` function still exists but is discouraged. Future versions may remove or significantly change this functionality. Generating HTML and then using a separate library for HTML to Markdown conversion is recommended for better results.","severity":"breaking","affected_versions":">=1.11.0"},{"fix":"Always sanitize the HTML output generated by Mammoth, especially if the source documents originate from untrusted users. Implement a robust HTML sanitization library (e.g., `Bleach`) after conversion to remove potentially malicious content.","message":"Mammoth performs no sanitization of the source `.docx` document. Converting documents from untrusted users can introduce security vulnerabilities, such as `javascript:` links in the output HTML.","severity":"breaking","affected_versions":"All versions"},{"fix":"Convert WMF images to a supported format (e.g., PNG) before conversion, or implement a custom image converter using `mammoth.images.img_element` that leverages external tools like LibreOffice as demonstrated in the Mammoth recipes.","message":"WMF images are not handled by default. If your `.docx` documents contain WMF images, they will not be correctly converted or embedded in the output HTML.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Carefully test custom style mappings, especially when using `:fresh`. Understand when you want content appended to an existing HTML element versus when a new element should be created. Refer to the official documentation for advanced style mapping syntax.","message":"Custom style mappings using `p[style-name='...'] => ...:fresh` can lead to unexpected HTML structures if not fully understood. The `:fresh` option forces a new HTML element, rather than appending content to an existing one, which can be critical for layout.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-09T00:00:00.000Z","next_check":"2026-07-08T00:00:00.000Z"}