PyMuPDF
PyMuPDF is a high-performance Python library for data extraction, analysis, conversion, and manipulation of PDF and other document formats (XPS, OpenXPS, CBZ, CBR, FB2, EPUB, and various image formats). It acts as a powerful, thin wrapper around Artifex Software's MuPDF engine. The library is actively maintained with frequent releases, often synchronized with updates to the underlying MuPDF library, ensuring continuous improvements and bug fixes.
Warnings
- deprecated The legacy import `import fitz` is still available for backward compatibility, but the recommended and official way to import PyMuPDF is `import pymupdf`. Using `import fitz` can lead to `ModuleNotFoundError` if an unmaintained, unrelated `fitz` package from PyPI is installed, causing conflicts.
- gotcha It is crucial to properly close `Document` objects to prevent memory leaks, especially in long-running applications or when processing many files. Failing to do so can consume significant system resources.
- breaking Many methods and properties have transitioned from `camelCase` (e.g., `pageCount`, `newPage`) to `snake_case` (e.g., `page_count`, `new_page`) starting around version 1.18.14. Old names are deprecated and will be removed in future major versions (e.g., removal planned post 1.19.0 / around 1.20.0).
- breaking PyMuPDF regularly updates its supported Python versions. For example, Python 3.8 support was dropped in a minor release. Upgrading PyMuPDF might unexpectedly break compatibility with older Python environments.
- gotcha On Windows, users may encounter `ImportError: DLL load failed while importing _extra` due to missing `MSVCP140.dll`. This often indicates a missing or outdated Microsoft Visual C++ Redistributable package.
Install
-
pip install --upgrade pymupdf
Imports
- pymupdf
import pymupdf
Quickstart
import pymupdf
import os
# Create a dummy PDF file for demonstration purposes
dummy_pdf_content = b"%PDF-1.4\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj\n2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj\n3 0 obj<</Type/Page/MediaBox[0 0 612 792]/Contents 4 0 R/Parent 2 0 R>>endobj\n4 0 obj<</Length 57>>stream\nBT\n/F1 24 Tf\n100 700 Td\n(Hello, PyMuPDF!) Tj\nET\nendstream\nendobj\nxref\n0 5\n0000000000 65535 f\n0000000009 00000 n\n0000000052 00000 n\n0000000108 00000 n\n0000000216 00000 n\ntrailer<</Size 5/Root 1 0 R>>startxref\n304\n%%EOF"
with open("example.pdf", "wb") as f:
f.write(dummy_pdf_content)
# Open a PDF document and extract text
doc = None # Initialize doc to None
try:
doc = pymupdf.open("example.pdf")
full_text = []
for page in doc:
full_text.append(page.get_text())
print("Extracted text:\n" + "\n".join(full_text))
except Exception as e:
print(f"An error occurred: {e}")
finally:
if doc and not doc.is_closed:
doc.close() # Ensure the document is closed
if os.path.exists("example.pdf"):
os.remove("example.pdf") # Clean up the dummy file