PyMuPDF Layout
PyMuPDF Layout is a fast and lightweight Python package that integrates with PyMuPDF to provide AI-driven layout analysis for PDFs. It converts PDFs into structured data (Markdown, JSON, or plain text) by training Graph Neural Networks on PDF internals, offering a 10x speed improvement over vision-based tools without requiring a GPU. It is currently at version 1.27.2.2 and receives frequent updates, often alongside its companion library, PyMuPDF4LLM.
Warnings
- gotcha The `pymupdf.layout` module *must* be imported before `pymupdf4llm` to ensure that PyMuPDF's layout analysis features are activated. If the order is incorrect, `pymupdf4llm` will run without layout enhancement.
- gotcha The `header=False` and `footer=False` parameters for omitting headers and footers are not applicable when extracting data using `pymupdf4llm.to_json()`. The JSON output is designed to be a comprehensive representation of all page data.
- breaking Prior to `pymupdf4llm` version 1.27, `pymupdf-layout` had to be explicitly installed and imported. Since `pymupdf4llm` v1.27, `pymupdf-layout` is automatically installed and used, simplifying the setup but changing the dependency structure.
- gotcha PyMuPDF Layout is licensed under PolyForm Noncommercial, which restricts commercial use. Review the license terms carefully for your specific application.
- gotcha For advanced document types (e.g., Office documents like DOCX, XLSX, PPTX), `PyMuPDF Pro` is required in addition to `PyMuPDF4LLM` to enable processing. PyMuPDF Layout itself primarily enhances PDF processing.
Install
-
pip install pymupdf-layout pymupdf4llm -
pip install pymupdf4llm
Imports
- layout
import pymupdf.layout
- pymupdf4llm
import pymupdf4llm
- pymupdf
import pymupdf
Quickstart
import pymupdf # For document opening
import pymupdf.layout # Crucial: import layout before pymupdf4llm
import pymupdf4llm # For structured data extraction
import os
# Create a dummy PDF file for demonstration
doc = pymupdf.open()
p = doc.new_page()
p.insert_text((50, 50), "Header text\n", fontname="helv", fontsize=12)
p.insert_text((50, 80), "This is a sample paragraph. It demonstrates basic text extraction.")
p.insert_text((50, 110), "Another paragraph with some more content to showcase layout analysis.")
doc.save("sample_document.pdf")
doc.close()
# Open the document with PyMuPDF
document = pymupdf.open("sample_document.pdf")
# Extract structured data as Markdown
markdown_output = pymupdf4llm.to_markdown(document)
print("\n--- Markdown Output ---")
print(markdown_output)
# Extract structured data as JSON (note: header/footer filtering not applicable to JSON)
json_output = pymupdf4llm.to_json(document)
print("\n--- JSON Output ---")
# For brevity, print only a part of the JSON structure if it's large
import json
print(json.dumps(json_output, indent=2))
# Remove the dummy file
os.remove("sample_document.pdf")