OCRmyPDF

17.4.1 · active · verified Sun Apr 12

OCRmyPDF is a Python library and application that adds an invisible OCR text layer to scanned PDF files, making them searchable. It utilizes the Tesseract OCR engine and other external tools to process documents, capable of producing highly optimized and archived-ready (PDF/A) files. The project is actively maintained with frequent updates, typically seeing major version releases annually and minor/patch releases more often.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use the modern API introduced in OCRmyPDF v17.0.0, which involves passing an `OcrOptions` object to the `ocrmypdf.ocr()` function. This provides better type hinting and argument validation. It includes basic error handling and uses dummy files for immediate runnable testing. Remember that `ocrmypdf` heavily relies on external system dependencies (like Tesseract and Ghostscript) which must be installed separately.

import ocrmypdf
from ocrmypdf import OcrOptions
import os

# Create dummy input.pdf for demonstration
with open('input.pdf', 'wb') as f:
    f.write(b'%PDF-1.4\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj 3 0 obj<</Type/Page/MediaBox[0 0 612 792]/Contents 4 0 R>>endobj 4 0 obj<</Length 11>>stream\nBT /F1 12 Tf 72 712 Td (Hello World)Tj ET\nendstream\nendobj\nxref\n0 5\n0000000000 65535 f\n0000000009 00000 n\n0000000055 00000 n\n0000000109 00000 n\n0000000171 00000 n\ntrailer<</Size 5/Root 1 0 R>>startxref\n200\n%%EOF')

# The recommended way to call ocrmypdf.ocr() is to construct an OcrOptions object.
# This provides type hints and validation. (v17.0.0+)
options = OcrOptions(
    input_file='input.pdf',
    output_file='output_ocr.pdf',
    deskew=True,
    languages=['eng'],
    # Example: use environment variable for Tesseract path if needed for CI/local testing
    # tesseract_path=os.environ.get('TESSERACT_PATH', None)
)

try:
    ocrmypdf.ocr(options)
    print("OCR processing complete. Output saved to output_ocr.pdf")
except ocrmypdf.exceptions.BadArgs as e:
    print(f"Error with OCRmyPDF arguments: {e}")
except ocrmypdf.exceptions.InputFileError as e:
    print(f"Error with input file: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")
finally:
    # Clean up dummy files
    if os.path.exists('input.pdf'):
        os.remove('input.pdf')
    if os.path.exists('output_ocr.pdf'):
        os.remove('output_ocr.pdf')

view raw JSON →