{"id":6913,"library":"textract","title":"textract","description":"textract is a Python library designed to extract text from a wide variety of document formats, including PDFs, Word documents, images (via OCR), and audio files, providing a unified interface. The current stable version is 1.6.5, released in March 2022. While releases aren't on a strict schedule, the project is actively maintained with bug fixes and feature additions.","status":"active","version":"1.6.5","language":"en","source_language":"en","source_url":"https://github.com/deanmalmgren/textract","tags":["text extraction","document processing","OCR","PDF","DOCX","TXT","email","audio","unstructured data"],"install":[{"cmd":"pip install textract","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Used for .docx parsing; requires system libraries like libxml2 and libxslt1.","package":"lxml","optional":false},{"reason":"Used for .pdf parsing (default method); can benefit from system-level poppler-utils for better performance.","package":"pdfminer.six","optional":false},{"reason":"Used for .xlsx parsing.","package":"xlrd","optional":false},{"reason":"Used for .msg parsing; note on specific version constraint in warnings.","package":"extract-msg","optional":false},{"reason":"Used for audio file parsing.","package":"SpeechRecognition","optional":true}],"imports":[{"symbol":"process","correct":"import textract\ntext = textract.process('path/to/file.extension')"}],"quickstart":{"code":"import textract\nimport os\n\n# For demonstration, create a dummy text file\ndummy_file_path = 'example.txt'\nwith open(dummy_file_path, 'w') as f:\n    f.write('This is some sample text in a TXT file.')\n\ntry:\n    # Extract text from the dummy file\n    text_bytes = textract.process(dummy_file_path)\n    text_decoded = text_bytes.decode('utf-8')\n    print(f\"Extracted text: {text_decoded}\")\nexcept Exception as e:\n    print(f\"An error occurred: {e}\")\nfinally:\n    # Clean up the dummy file\n    if os.path.exists(dummy_file_path):\n        os.remove(dummy_file_path)\n\n# Example for a PDF (requires pdftotext system dependency)\n# try:\n#     pdf_text = textract.process('path/to/document.pdf')\n#     print(pdf_text.decode('utf-8'))\n# except Exception as e:\n#     print(f\"Could not process PDF: {e}. Is pdftotext installed and in PATH?\")","lang":"python","description":"Demonstrates how to extract text from a file using `textract.process()`. Note that for many file types (like PDF, DOCX, images), corresponding system-level dependencies are required for successful extraction. The output is a byte string, which typically needs to be decoded to UTF-8."},"warnings":[{"fix":"Install the necessary system dependencies for the file types you intend to process. Refer to the official textract documentation for a comprehensive list based on your operating system (e.g., `apt-get` for Debian/Ubuntu, `brew` for macOS). For instance, for PDF files, ensure `poppler-utils` (which provides `pdftotext`) is installed.","message":"textract relies heavily on external system-level libraries and executables (e.g., `pdftotext` for PDFs, `antiword` for .doc, `tesseract-ocr` for images, `sox` for audio). Without these, extraction for certain file types will fail with a `ShellError` or `FileNotFoundError`.","severity":"breaking","affected_versions":"All versions"},{"fix":"While textract itself needs an update to fix this, users can often mitigate by pinning `pip` to an older version or, if possible, by manually installing `extract-msg` at the specified version before installing `textract`. Consider monitoring the project for an update addressing this, or using a fork like `textract-py3` if it resolves this issue.","message":"As of pip 24.1, `textract 1.6.5` has a non-standard dependency specifier (`extract-msg<=0.29.*`). This will result in a `DEPRECATION` warning during installation and may cause issues with future pip versions.","severity":"deprecated","affected_versions":"1.6.5 (and potentially earlier versions with similar specifiers)"},{"fix":"Ensure filenames are URL-encoded if being passed via web contexts, or consider sanitizing/simplifying filenames to alphanumeric characters and underscores before processing, especially on certain operating systems or with specific parsers.","message":"Handling special characters in filenames (e.g., spaces, non-ASCII characters) can sometimes lead to `FileNotFoundError` or `ShellError` when `textract` passes the filename to underlying command-line utilities.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Always use textract version 1.5.0 or higher for Python 3 projects. Ensure all project dependencies are also Python 3 compatible. Refer to release notes for `1.5.0` for detailed Python 3 migration notes.","message":"While textract 1.5.0 and newer officially support Python 3, older versions were primarily Python 2 compatible. Direct migration from very old codebases might expose subtle compatibility issues if not upgraded properly.","severity":"gotcha","affected_versions":"<1.5.0 for Python 3 incompatibility; potential minor issues in 1.5.0-1.6.5 for specific edge cases."},{"fix":"Explicitly specify the desired output encoding in `textract.process(..., encoding='utf-8')`. If `chardet` struggles, pre-process the file to a known encoding or try different decoding strategies in your application.","message":"UnicodeDecodeError can occur, especially in non-standard environments or with files containing unusual encodings, as `textract` relies on `chardet` for input encoding inference and outputs byte strings that need proper decoding.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-15T00:00:00.000Z","next_check":"2026-07-14T00:00:00.000Z","problems":[]}