textract
textract is a Python library designed to extract text from a wide variety of document formats, including PDFs, Word documents, images (via OCR), and audio files, providing a unified interface. The current stable version is 1.6.5, released in March 2022. While releases aren't on a strict schedule, the project is actively maintained with bug fixes and feature additions.
Warnings
- breaking textract relies heavily on external system-level libraries and executables (e.g., `pdftotext` for PDFs, `antiword` for .doc, `tesseract-ocr` for images, `sox` for audio). Without these, extraction for certain file types will fail with a `ShellError` or `FileNotFoundError`.
- deprecated As of pip 24.1, `textract 1.6.5` has a non-standard dependency specifier (`extract-msg<=0.29.*`). This will result in a `DEPRECATION` warning during installation and may cause issues with future pip versions.
- gotcha Handling special characters in filenames (e.g., spaces, non-ASCII characters) can sometimes lead to `FileNotFoundError` or `ShellError` when `textract` passes the filename to underlying command-line utilities.
- gotcha While textract 1.5.0 and newer officially support Python 3, older versions were primarily Python 2 compatible. Direct migration from very old codebases might expose subtle compatibility issues if not upgraded properly.
- gotcha UnicodeDecodeError can occur, especially in non-standard environments or with files containing unusual encodings, as `textract` relies on `chardet` for input encoding inference and outputs byte strings that need proper decoding.
Install
-
pip install textract
Imports
- process
import textract text = textract.process('path/to/file.extension')
Quickstart
import textract
import os
# For demonstration, create a dummy text file
dummy_file_path = 'example.txt'
with open(dummy_file_path, 'w') as f:
f.write('This is some sample text in a TXT file.')
try:
# Extract text from the dummy file
text_bytes = textract.process(dummy_file_path)
text_decoded = text_bytes.decode('utf-8')
print(f"Extracted text: {text_decoded}")
except Exception as e:
print(f"An error occurred: {e}")
finally:
# Clean up the dummy file
if os.path.exists(dummy_file_path):
os.remove(dummy_file_path)
# Example for a PDF (requires pdftotext system dependency)
# try:
# pdf_text = textract.process('path/to/document.pdf')
# print(pdf_text.decode('utf-8'))
# except Exception as e:
# print(f"Could not process PDF: {e}. Is pdftotext installed and in PATH?")