textract

1.6.5 · active · verified Wed Apr 15

textract is a Python library designed to extract text from a wide variety of document formats, including PDFs, Word documents, images (via OCR), and audio files, providing a unified interface. The current stable version is 1.6.5, released in March 2022. While releases aren't on a strict schedule, the project is actively maintained with bug fixes and feature additions.

Warnings

Install

Imports

Quickstart

Demonstrates how to extract text from a file using `textract.process()`. Note that for many file types (like PDF, DOCX, images), corresponding system-level dependencies are required for successful extraction. The output is a byte string, which typically needs to be decoded to UTF-8.

import textract
import os

# For demonstration, create a dummy text file
dummy_file_path = 'example.txt'
with open(dummy_file_path, 'w') as f:
    f.write('This is some sample text in a TXT file.')

try:
    # Extract text from the dummy file
    text_bytes = textract.process(dummy_file_path)
    text_decoded = text_bytes.decode('utf-8')
    print(f"Extracted text: {text_decoded}")
except Exception as e:
    print(f"An error occurred: {e}")
finally:
    # Clean up the dummy file
    if os.path.exists(dummy_file_path):
        os.remove(dummy_file_path)

# Example for a PDF (requires pdftotext system dependency)
# try:
#     pdf_text = textract.process('path/to/document.pdf')
#     print(pdf_text.decode('utf-8'))
# except Exception as e:
#     print(f"Could not process PDF: {e}. Is pdftotext installed and in PATH?")

view raw JSON →