docx2python
docx2python is a Python library for extracting structured content from .docx files. It can extract headers, footers, formatted text, footnotes, endnotes, comments, document properties, and images, converting them into a Python object. The library is also capable of preserving document structure, including numbered and bulleted lists, and handling tables. It is currently at version 3.6.2 and receives active maintenance.
Common errors
-
AttributeError: partially initialized module 'docx' has no attribute 'Document'
cause This error, often seen with `python-docx`, can occur with `docx2python` if a user's script or a file in the import path is named `docx.py`. This creates a module name collision, causing Python to import the user's file instead of the actual library.fixRename your Python script or any conflicting file from `docx.py` to something else (e.g., `extract_doc.py`) to avoid shadowing the library module. -
Failed to parse .docx file (often with complex traceback including `KeyError` or XML parsing errors)
cause `.docx` files downloaded directly from services like Google Sheets may not conform perfectly to the Open XML standard in a way that `docx2python` expects. They might lack certain metadata or structural elements.fixOpen the problematic `.docx` file in Microsoft Word (or a compatible word processor) and simply re-save it. This often 'fixes' the underlying XML structure, making it parsable by `docx2python`.
Warnings
- breaking In version 3.0, the `html` and `duplicate_merged_cells` arguments to the `docx2python` function became keyword-only. Positional arguments for these will raise a TypeError.
- breaking Version 3.0 introduced changes to table output: tables are now consistently `nxm` (rows x columns) nested lists. If `duplicate_merged_cells=True` (default), merged cells will be duplicated to fill the `nxm` structure. This improves consistency for processing but changes the raw data structure compared to older versions.
- gotcha The primary output (e.g., `docx_content.body`) is a deeply nested list structure, where paragraphs are consistently found at depth 4 (e.g., `output.body[i][j][k][l]` is a paragraph string). This structure can be complex to navigate directly.
Install
-
pip install docx2python
Imports
- docx2python
from docx2python import docx2python
- iter_paragraphs
from docx2python.iterators import iter_paragraphs
Quickstart
import os
from docx2python import docx2python
# Create a dummy docx file for demonstration (in a real scenario, this file would exist)
# For a proper test, ensure 'example.docx' exists in the same directory
# with some text and a table.
# Example: A .docx file with 'Hello World' and a simple 2x2 table.
# Assuming 'example.docx' exists:
docx_file = 'example.docx'
if not os.path.exists(docx_file):
print(f"Please create a file named '{docx_file}' with some content for the quickstart.")
else:
try:
with docx2python(docx_file) as docx_content:
print("--- Extracted Document Text ---")
print(docx_content.text)
print("\n--- Document Body Structure (nested list) ---")
# The body is a nested list, with paragraphs at depth 4
print(docx_content.body[:1]) # Print first element for brevity
if docx_content.images:
print("\n--- Extracted Images (names only) ---")
for name in docx_content.images.keys():
print(name)
else:
print("\nNo images found.")
except Exception as e:
print(f"An error occurred: {e}")
print(f"Ensure '{docx_file}' is a valid and readable .docx file.")