{"id":4993,"library":"ocrmypdf","title":"OCRmyPDF","description":"OCRmyPDF is a Python library and application that adds an invisible OCR text layer to scanned PDF files, making them searchable. It utilizes the Tesseract OCR engine and other external tools to process documents, capable of producing highly optimized and archived-ready (PDF/A) files. The project is actively maintained with frequent updates, typically seeing major version releases annually and minor/patch releases more often.","status":"active","version":"17.4.1","language":"en","source_language":"en","source_url":"https://github.com/ocrmypdf/OCRmyPDF","tags":["pdf","ocr","document-processing","automation","tesseract","pdf/a"],"install":[{"cmd":"pip install ocrmypdf","lang":"bash","label":"Install Python package"}],"dependencies":[{"reason":"Required for text layer rendering in PDFs.","package":"fpdf2"},{"reason":"Required for advanced text layer rendering.","package":"uharfbuzz"},{"reason":"Core library for PDF manipulation, developed by the same author.","package":"pikepdf"},{"reason":"Optional Python dependency for PDF rasterization, serving as an alternative to Ghostscript. Recommended for best compatibility.","package":"pypdfium2","optional":true}],"imports":[{"symbol":"ocr","correct":"from ocrmypdf import ocr"},{"note":"As of v17.0.0, OcrOptions is exported directly from the top-level 'ocrmypdf' module for cleaner API usage.","wrong":"from ocrmypdf._options import OcrOptions","symbol":"OcrOptions","correct":"from ocrmypdf import OcrOptions"}],"quickstart":{"code":"import ocrmypdf\nfrom ocrmypdf import OcrOptions\nimport os\n\n# Create dummy input.pdf for demonstration\nwith open('input.pdf', 'wb') as f:\n    f.write(b'%PDF-1.4\\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj 3 0 obj<</Type/Page/MediaBox[0 0 612 792]/Contents 4 0 R>>endobj 4 0 obj<</Length 11>>stream\\nBT /F1 12 Tf 72 712 Td (Hello World)Tj ET\\nendstream\\nendobj\\nxref\\n0 5\\n0000000000 65535 f\\n0000000009 00000 n\\n0000000055 00000 n\\n0000000109 00000 n\\n0000000171 00000 n\\ntrailer<</Size 5/Root 1 0 R>>startxref\\n200\\n%%EOF')\n\n# The recommended way to call ocrmypdf.ocr() is to construct an OcrOptions object.\n# This provides type hints and validation. (v17.0.0+)\noptions = OcrOptions(\n    input_file='input.pdf',\n    output_file='output_ocr.pdf',\n    deskew=True,\n    languages=['eng'],\n    # Example: use environment variable for Tesseract path if needed for CI/local testing\n    # tesseract_path=os.environ.get('TESSERACT_PATH', None)\n)\n\ntry:\n    ocrmypdf.ocr(options)\n    print(\"OCR processing complete. Output saved to output_ocr.pdf\")\nexcept ocrmypdf.exceptions.BadArgs as e:\n    print(f\"Error with OCRmyPDF arguments: {e}\")\nexcept ocrmypdf.exceptions.InputFileError as e:\n    print(f\"Error with input file: {e}\")\nexcept Exception as e:\n    print(f\"An unexpected error occurred: {e}\")\nfinally:\n    # Clean up dummy files\n    if os.path.exists('input.pdf'):\n        os.remove('input.pdf')\n    if os.path.exists('output_ocr.pdf'):\n        os.remove('output_ocr.pdf')\n","lang":"python","description":"This quickstart demonstrates how to use the modern API introduced in OCRmyPDF v17.0.0, which involves passing an `OcrOptions` object to the `ocrmypdf.ocr()` function. This provides better type hinting and argument validation. It includes basic error handling and uses dummy files for immediate runnable testing. Remember that `ocrmypdf` heavily relies on external system dependencies (like Tesseract and Ghostscript) which must be installed separately."},"warnings":[{"fix":"Manually install required system dependencies (Tesseract, Ghostscript/pypdfium2, etc.) for your operating system. Consult the official OCRmyPDF installation documentation for detailed instructions specific to your platform.","message":"OCRmyPDF relies heavily on external system dependencies (e.g., Tesseract OCR, Ghostscript). These are NOT installed by `pip install ocrmypdf` and must be provided by the operating system package manager (e.g., `apt`, `brew`, `choco`). Without them, the library will not function, often resulting in 'file not found' errors.","severity":"breaking","affected_versions":"<=17.x.x"},{"fix":"Refactor calls to `ocrmypdf.ocr()` to construct and pass an `OcrOptions` instance: `options = OcrOptions(input_file='...', output_file='...', ...); ocrmypdf.ocr(options)`.","message":"Starting with v17.0.0, the `ocrmypdf.ocr()` function now officially recommends accepting an `OcrOptions` object for all parameters. While the legacy positional argument style is still supported, using `OcrOptions` offers improved type hinting, validation, and clarity.","severity":"breaking","affected_versions":">=17.0.0"},{"fix":"Switch from using individual flags like `--force-ocr` to the unified `--mode` argument in both the command-line interface and the Python API's `OcrOptions`.","message":"As of v17.0.0, command-line flags like `--force-ocr`, `--skip-text`, and `--redo-ocr` are consolidated under the new `--mode` argument (e.g., `--mode force`, `--mode skip`, `--mode redo`). The old flags remain as silent aliases but are deprecated in favor of `--mode` for clearer API and command-line usage.","severity":"deprecated","affected_versions":">=17.0.0"},{"fix":"For parallel processing of multiple PDFs, spawn separate Python processes for each OCRmyPDF task. Consider using `multiprocessing` or running `ocrmypdf` from subprocesses for isolation.","message":"OCRmyPDF maintains global state, meaning only one OCR operation can reliably run per Python process at a time. Attempting parallel `ocrmypdf.ocr()` calls within a single process can lead to unexpected behavior or deadlocks.","severity":"gotcha","affected_versions":"<=17.x.x"},{"fix":"If encountering JPEG corruption, consider using `pypdfium2` as the PDF rasterizer (if compatible with your setup) instead of Ghostscript, or use a Ghostscript version known not to have the bug. `--rasterizer pypdfium2` can be set in `OcrOptions`.","message":"A known issue with Ghostscript (a key dependency) can lead to JPEG corruption. This warning was updated in v17.4.1 to confirm persistence in Ghostscript 10.7.0.","severity":"gotcha","affected_versions":">=17.x.x (depending on Ghostscript version)"},{"fix":"If you intend to re-OCR or process such files, use `--force-ocr` (or `--mode force`), `--skip-text` (or `--mode skip`), or `--redo-ocr` (or `--mode redo`) depending on the desired behavior.","message":"Running OCRmyPDF on a PDF that already contains text (either digital or a hidden OCR layer) will by default raise an error: 'Page already has text!'. This is a safety mechanism.","severity":"gotcha","affected_versions":"<=17.x.x"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}