OCRmyPDF
OCRmyPDF is a Python library and application that adds an invisible OCR text layer to scanned PDF files, making them searchable. It utilizes the Tesseract OCR engine and other external tools to process documents, capable of producing highly optimized and archived-ready (PDF/A) files. The project is actively maintained with frequent updates, typically seeing major version releases annually and minor/patch releases more often.
Warnings
- breaking OCRmyPDF relies heavily on external system dependencies (e.g., Tesseract OCR, Ghostscript). These are NOT installed by `pip install ocrmypdf` and must be provided by the operating system package manager (e.g., `apt`, `brew`, `choco`). Without them, the library will not function, often resulting in 'file not found' errors.
- breaking Starting with v17.0.0, the `ocrmypdf.ocr()` function now officially recommends accepting an `OcrOptions` object for all parameters. While the legacy positional argument style is still supported, using `OcrOptions` offers improved type hinting, validation, and clarity.
- deprecated As of v17.0.0, command-line flags like `--force-ocr`, `--skip-text`, and `--redo-ocr` are consolidated under the new `--mode` argument (e.g., `--mode force`, `--mode skip`, `--mode redo`). The old flags remain as silent aliases but are deprecated in favor of `--mode` for clearer API and command-line usage.
- gotcha OCRmyPDF maintains global state, meaning only one OCR operation can reliably run per Python process at a time. Attempting parallel `ocrmypdf.ocr()` calls within a single process can lead to unexpected behavior or deadlocks.
- gotcha A known issue with Ghostscript (a key dependency) can lead to JPEG corruption. This warning was updated in v17.4.1 to confirm persistence in Ghostscript 10.7.0.
- gotcha Running OCRmyPDF on a PDF that already contains text (either digital or a hidden OCR layer) will by default raise an error: 'Page already has text!'. This is a safety mechanism.
Install
-
pip install ocrmypdf
Imports
- ocr
from ocrmypdf import ocr
- OcrOptions
from ocrmypdf import OcrOptions
Quickstart
import ocrmypdf
from ocrmypdf import OcrOptions
import os
# Create dummy input.pdf for demonstration
with open('input.pdf', 'wb') as f:
f.write(b'%PDF-1.4\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj 3 0 obj<</Type/Page/MediaBox[0 0 612 792]/Contents 4 0 R>>endobj 4 0 obj<</Length 11>>stream\nBT /F1 12 Tf 72 712 Td (Hello World)Tj ET\nendstream\nendobj\nxref\n0 5\n0000000000 65535 f\n0000000009 00000 n\n0000000055 00000 n\n0000000109 00000 n\n0000000171 00000 n\ntrailer<</Size 5/Root 1 0 R>>startxref\n200\n%%EOF')
# The recommended way to call ocrmypdf.ocr() is to construct an OcrOptions object.
# This provides type hints and validation. (v17.0.0+)
options = OcrOptions(
input_file='input.pdf',
output_file='output_ocr.pdf',
deskew=True,
languages=['eng'],
# Example: use environment variable for Tesseract path if needed for CI/local testing
# tesseract_path=os.environ.get('TESSERACT_PATH', None)
)
try:
ocrmypdf.ocr(options)
print("OCR processing complete. Output saved to output_ocr.pdf")
except ocrmypdf.exceptions.BadArgs as e:
print(f"Error with OCRmyPDF arguments: {e}")
except ocrmypdf.exceptions.InputFileError as e:
print(f"Error with input file: {e}")
except Exception as e:
print(f"An unexpected error occurred: {e}")
finally:
# Clean up dummy files
if os.path.exists('input.pdf'):
os.remove('input.pdf')
if os.path.exists('output_ocr.pdf'):
os.remove('output_ocr.pdf')