docx2txt

0.9 · active · verified Thu Apr 09

docx2txt is a pure Python-based utility designed to extract text and images from .docx files. It leverages the `python-docx` library for parsing the document structure and `Pillow` for image handling. The current version is 0.9, and the project appears to be in maintenance mode with infrequent releases, primarily addressing minor updates.

Warnings

gotcha Input files must exist and be valid .docx files. Passing non-existent paths or corrupted/invalid files will raise `FileNotFoundError` or other exceptions from `python-docx`.
Fix: Always ensure the input path points to an accessible and valid .docx file before calling `docx2txt.process()`.
gotcha When extracting images, you must provide a valid directory path for `img_dir`. If `img_dir` is not provided, images will be skipped. If the provided directory does not exist, `docx2txt` will attempt to create it. Ensure the process has write permissions to the specified `img_dir`.
Fix: Pass a string representing an existing or creatable directory path to the `img_dir` argument of `docx2txt.process()`.
gotcha docx2txt relies on `python-docx` and may not perfectly handle all complex .docx features (e.g., embedded objects, intricate formatting, specific table layouts, or non-standard XML structures). Text extraction might lose some formatting or omit certain content types.
Fix: For critical applications, always verify the extracted text against the original document. Consider alternative libraries or more robust parsing solutions for highly complex documents.
gotcha Although PyPI states `requires_python: None`, the underlying `python-docx` dependency (version >=0.8.10) requires Python 3.6 or newer. Therefore, `docx2txt` effectively also requires Python 3.6+ to function correctly.
Fix: Ensure your environment uses Python 3.6 or a newer version before installing and using `docx2txt`.

Install

pip install docx2txt Install latest version

Imports

process

import docx2txt
text = docx2txt.process('document.docx')

Quickstart

This quickstart demonstrates how to extract text from a .docx file and optionally extract embedded images to a specified directory. Ensure a .docx file exists for the example to run correctly.

import docx2txt
import os

# Assuming 'my_document.docx' exists in the current directory
# and 'extracted_images' is a directory for image output.
# If 'my_document.docx' does not exist, create a dummy one for testing.
if not os.path.exists('my_document.docx'):
    print("Please create a dummy 'my_document.docx' file for this example.")
    # Example: Create a simple dummy docx (requires python-docx library)
    # from docx import Document
    # document = Document()
    # document.add_paragraph('This is a test document for docx2txt.')
    # document.save('my_document.docx')

# Extract text
text = docx2txt.process("my_document.docx")
print("Extracted Text:\n", text)

# Extract text and images to a specified directory
image_dir = 'extracted_images'
if not os.path.exists(image_dir):
    os.makedirs(image_dir)

text_with_images = docx2txt.process("my_document.docx", image_dir)
print(f"\nExtracted Text (images saved to {image_dir}):\n", text_with_images)

view raw JSON →