docx2python

3.6.2 · active · verified Thu Apr 16

docx2python is a Python library for extracting structured content from .docx files. It can extract headers, footers, formatted text, footnotes, endnotes, comments, document properties, and images, converting them into a Python object. The library is also capable of preserving document structure, including numbered and bulleted lists, and handling tables. It is currently at version 3.6.2 and receives active maintenance.

Common errors

Warnings

Install

Imports

Quickstart

Demonstrates how to extract all text content from a .docx file as a single string, access the nested list representation of the document body, and list extracted image filenames. The example assumes a 'example.docx' file exists.

import os
from docx2python import docx2python

# Create a dummy docx file for demonstration (in a real scenario, this file would exist)
# For a proper test, ensure 'example.docx' exists in the same directory
# with some text and a table.
# Example: A .docx file with 'Hello World' and a simple 2x2 table.

# Assuming 'example.docx' exists:
docx_file = 'example.docx'

if not os.path.exists(docx_file):
    print(f"Please create a file named '{docx_file}' with some content for the quickstart.")
else:
    try:
        with docx2python(docx_file) as docx_content:
            print("--- Extracted Document Text ---")
            print(docx_content.text)
            print("\n--- Document Body Structure (nested list) ---")
            # The body is a nested list, with paragraphs at depth 4
            print(docx_content.body[:1]) # Print first element for brevity

            if docx_content.images:
                print("\n--- Extracted Images (names only) ---")
                for name in docx_content.images.keys():
                    print(name)
            else:
                print("\nNo images found.")

    except Exception as e:
        print(f"An error occurred: {e}")
        print(f"Ensure '{docx_file}' is a valid and readable .docx file.")

view raw JSON →