docx2txt
docx2txt is a pure Python-based utility designed to extract text and images from .docx files. It leverages the `python-docx` library for parsing the document structure and `Pillow` for image handling. The current version is 0.9, and the project appears to be in maintenance mode with infrequent releases, primarily addressing minor updates.
Warnings
- gotcha Input files must exist and be valid .docx files. Passing non-existent paths or corrupted/invalid files will raise `FileNotFoundError` or other exceptions from `python-docx`.
- gotcha When extracting images, you must provide a valid directory path for `img_dir`. If `img_dir` is not provided, images will be skipped. If the provided directory does not exist, `docx2txt` will attempt to create it. Ensure the process has write permissions to the specified `img_dir`.
- gotcha docx2txt relies on `python-docx` and may not perfectly handle all complex .docx features (e.g., embedded objects, intricate formatting, specific table layouts, or non-standard XML structures). Text extraction might lose some formatting or omit certain content types.
- gotcha Although PyPI states `requires_python: None`, the underlying `python-docx` dependency (version >=0.8.10) requires Python 3.6 or newer. Therefore, `docx2txt` effectively also requires Python 3.6+ to function correctly.
Install
-
pip install docx2txt
Imports
- process
import docx2txt text = docx2txt.process('document.docx')
Quickstart
import docx2txt
import os
# Assuming 'my_document.docx' exists in the current directory
# and 'extracted_images' is a directory for image output.
# If 'my_document.docx' does not exist, create a dummy one for testing.
if not os.path.exists('my_document.docx'):
print("Please create a dummy 'my_document.docx' file for this example.")
# Example: Create a simple dummy docx (requires python-docx library)
# from docx import Document
# document = Document()
# document.add_paragraph('This is a test document for docx2txt.')
# document.save('my_document.docx')
# Extract text
text = docx2txt.process("my_document.docx")
print("Extracted Text:\n", text)
# Extract text and images to a specified directory
image_dir = 'extracted_images'
if not os.path.exists(image_dir):
os.makedirs(image_dir)
text_with_images = docx2txt.process("my_document.docx", image_dir)
print(f"\nExtracted Text (images saved to {image_dir}):\n", text_with_images)