PyMuPDF

1.27.2.2 · active · verified Sat Mar 28

PyMuPDF is a high-performance Python library for data extraction, analysis, conversion, and manipulation of PDF and other document formats (XPS, OpenXPS, CBZ, CBR, FB2, EPUB, and various image formats). It acts as a powerful, thin wrapper around Artifex Software's MuPDF engine. The library is actively maintained with frequent releases, often synchronized with updates to the underlying MuPDF library, ensuring continuous improvements and bug fixes.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to open a PDF, iterate through its pages, extract text from each page, and print the collected text. It includes necessary resource management to ensure the document is closed properly and a dummy PDF creation for a self-contained example.

import pymupdf
import os

# Create a dummy PDF file for demonstration purposes
dummy_pdf_content = b"%PDF-1.4\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj\n2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj\n3 0 obj<</Type/Page/MediaBox[0 0 612 792]/Contents 4 0 R/Parent 2 0 R>>endobj\n4 0 obj<</Length 57>>stream\nBT\n/F1 24 Tf\n100 700 Td\n(Hello, PyMuPDF!) Tj\nET\nendstream\nendobj\nxref\n0 5\n0000000000 65535 f\n0000000009 00000 n\n0000000052 00000 n\n0000000108 00000 n\n0000000216 00000 n\ntrailer<</Size 5/Root 1 0 R>>startxref\n304\n%%EOF"
with open("example.pdf", "wb") as f:
    f.write(dummy_pdf_content)

# Open a PDF document and extract text
doc = None # Initialize doc to None
try:
    doc = pymupdf.open("example.pdf")
    full_text = []
    for page in doc:
        full_text.append(page.get_text())
    print("Extracted text:\n" + "\n".join(full_text))
except Exception as e:
    print(f"An error occurred: {e}")
finally:
    if doc and not doc.is_closed:
        doc.close() # Ensure the document is closed
    if os.path.exists("example.pdf"):
        os.remove("example.pdf") # Clean up the dummy file

view raw JSON →