docx2txt

0.9 · active · verified Thu Apr 09

docx2txt is a pure Python-based utility designed to extract text and images from .docx files. It leverages the `python-docx` library for parsing the document structure and `Pillow` for image handling. The current version is 0.9, and the project appears to be in maintenance mode with infrequent releases, primarily addressing minor updates.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to extract text from a .docx file and optionally extract embedded images to a specified directory. Ensure a .docx file exists for the example to run correctly.

import docx2txt
import os

# Assuming 'my_document.docx' exists in the current directory
# and 'extracted_images' is a directory for image output.
# If 'my_document.docx' does not exist, create a dummy one for testing.
if not os.path.exists('my_document.docx'):
    print("Please create a dummy 'my_document.docx' file for this example.")
    # Example: Create a simple dummy docx (requires python-docx library)
    # from docx import Document
    # document = Document()
    # document.add_paragraph('This is a test document for docx2txt.')
    # document.save('my_document.docx')

# Extract text
text = docx2txt.process("my_document.docx")
print("Extracted Text:\n", text)

# Extract text and images to a specified directory
image_dir = 'extracted_images'
if not os.path.exists(image_dir):
    os.makedirs(image_dir)

text_with_images = docx2txt.process("my_document.docx", image_dir)
print(f"\nExtracted Text (images saved to {image_dir}):\n", text_with_images)

view raw JSON →