PyMuPDF Layout

1.27.2.2 · active · verified Wed Apr 01

PyMuPDF Layout is a fast and lightweight Python package that integrates with PyMuPDF to provide AI-driven layout analysis for PDFs. It converts PDFs into structured data (Markdown, JSON, or plain text) by training Graph Neural Networks on PDF internals, offering a 10x speed improvement over vision-based tools without requiring a GPU. It is currently at version 1.27.2.2 and receives frequent updates, often alongside its companion library, PyMuPDF4LLM.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use PyMuPDF Layout in conjunction with PyMuPDF4LLM to extract structured data from a PDF. It highlights the critical import order of `pymupdf.layout` before `pymupdf4llm` to enable layout analysis. The example creates a simple PDF, then extracts its content into Markdown and JSON formats.

import pymupdf # For document opening
import pymupdf.layout # Crucial: import layout before pymupdf4llm
import pymupdf4llm # For structured data extraction
import os

# Create a dummy PDF file for demonstration
doc = pymupdf.open()
p = doc.new_page()
p.insert_text((50, 50), "Header text\n", fontname="helv", fontsize=12)
p.insert_text((50, 80), "This is a sample paragraph. It demonstrates basic text extraction.")
p.insert_text((50, 110), "Another paragraph with some more content to showcase layout analysis.")
doc.save("sample_document.pdf")
doc.close()

# Open the document with PyMuPDF
document = pymupdf.open("sample_document.pdf")

# Extract structured data as Markdown
markdown_output = pymupdf4llm.to_markdown(document)
print("\n--- Markdown Output ---")
print(markdown_output)

# Extract structured data as JSON (note: header/footer filtering not applicable to JSON)
json_output = pymupdf4llm.to_json(document)
print("\n--- JSON Output ---")
# For brevity, print only a part of the JSON structure if it's large
import json
print(json.dumps(json_output, indent=2))

# Remove the dummy file
os.remove("sample_document.pdf")

view raw JSON →