PyMuPDF Layout

raw JSON →
1.27.2.2 verified Tue May 12 auth: no python install: stale

PyMuPDF Layout is a fast and lightweight Python package that integrates with PyMuPDF to provide AI-driven layout analysis for PDFs. It converts PDFs into structured data (Markdown, JSON, or plain text) by training Graph Neural Networks on PDF internals, offering a 10x speed improvement over vision-based tools without requiring a GPU. It is currently at version 1.27.2.2 and receives frequent updates, often alongside its companion library, PyMuPDF4LLM.

pip install pymupdf-layout pymupdf4llm
error ModuleNotFoundError: No module named 'pymupdf_layout'
cause The `pymupdf-layout` library is intended to be imported as a submodule of `pymupdf` (i.e., `pymupdf.layout`), not as a top-level module named `pymupdf_layout`. Additionally, it needs to be imported before `pymupdf4llm` to activate its layout features.
fix
Use import pymupdf.layout in your Python script, ensuring it's imported before import pymupdf4llm if you are using PyMuPDF4LLM for extraction.
error AttributeError: 'Page' object has no attribute 'extract_layout'
cause The `pymupdf-layout` features for structured data extraction (like Markdown, JSON, or plain text) are typically accessed through functions in the companion library `pymupdf4llm`, such as `to_markdown()`, `to_json()`, or `to_text()`, after `pymupdf.layout` has been properly imported to enable the underlying layout analysis.
fix
First, import pymupdf.layout and pymupdf4llm. Then, use methods from pymupdf4llm on a Document object obtained via pymupdf.open(), for example: md = pymupdf4llm.to_markdown(doc).
error ERROR: Failed building wheel for pymupdf
cause This error occurs during installation when `pip` cannot find a pre-compiled binary wheel for PyMuPDF (which `pymupdf-layout` depends on) for your system, and attempts to build it from source. Building from source requires C/C++ development tools (like Visual Studio on Windows or build-essential on Linux) which are often not present by default.
fix
Ensure your pip is up to date (python -m pip install --upgrade pip). If the error persists, install the necessary C/C++ build tools for your operating system: for Windows, install Visual Studio 2019 (Community edition is sufficient); for Linux, install build-essential (e.g., sudo apt-get install build-essential).
gotcha The `pymupdf.layout` module *must* be imported before `pymupdf4llm` to ensure that PyMuPDF's layout analysis features are activated. If the order is incorrect, `pymupdf4llm` will run without layout enhancement.
fix Ensure `import pymupdf.layout` appears before `import pymupdf4llm` in your code.
gotcha The `header=False` and `footer=False` parameters for omitting headers and footers are not applicable when extracting data using `pymupdf4llm.to_json()`. The JSON output is designed to be a comprehensive representation of all page data.
fix If header/footer exclusion is needed, process JSON output manually or use `to_markdown()` or `to_text()` with the respective parameters.
breaking Prior to `pymupdf4llm` version 1.27, `pymupdf-layout` had to be explicitly installed and imported. Since `pymupdf4llm` v1.27, `pymupdf-layout` is automatically installed and used, simplifying the setup but changing the dependency structure.
fix For older `pymupdf4llm` versions, explicitly `pip install pymupdf-layout` and `import pymupdf.layout`. For v1.27+, `pip install pymupdf4llm` is often sufficient, but explicitly importing `pymupdf.layout` is still good practice to ensure activation.
gotcha PyMuPDF Layout is licensed under PolyForm Noncommercial, which restricts commercial use. Review the license terms carefully for your specific application.
fix Consult the PolyForm Noncommercial license for details. For commercial use, contact Artifex Software for alternative licensing options.
gotcha For advanced document types (e.g., Office documents like DOCX, XLSX, PPTX), `PyMuPDF Pro` is required in addition to `PyMuPDF4LLM` to enable processing. PyMuPDF Layout itself primarily enhances PDF processing.
fix If processing non-PDF document formats, ensure you have the appropriate `PyMuPDF Pro` license and package installed alongside `pymupdf4llm`.
pip install pymupdf4llm
python os / libc variant status wheel install import disk mem side effects
3.10 alpine (musl) pymupdf-layout build_error - - - - - -
3.10 alpine (musl) pymupdf-layout - - - - - -
3.10 alpine (musl) pymupdf4llm wheel - - 175.9M - broken
3.10 alpine (musl) pymupdf4llm - - - - - -
3.10 slim (glibc) pymupdf-layout wheel 12.2s 1.02s 298M 34.4M clean
3.10 slim (glibc) pymupdf-layout - - 1.09s 298M 34.4M -
3.10 slim (glibc) pymupdf4llm wheel 11.7s 0.98s 298M 34.4M clean
3.10 slim (glibc) pymupdf4llm - - 1.06s 298M 34.4M -
3.11 alpine (musl) pymupdf-layout build_error - - - - - -
3.11 alpine (musl) pymupdf-layout - - - - - -
3.11 alpine (musl) pymupdf4llm wheel - - 159.8M - broken
3.11 alpine (musl) pymupdf4llm - - - - - -
3.11 slim (glibc) pymupdf-layout wheel 7.4s 3.58s 255M 39.1M clean
3.11 slim (glibc) pymupdf-layout - - 3.94s 254M 39.1M -
3.11 slim (glibc) pymupdf4llm wheel 7.2s 3.64s 255M 39.1M clean
3.11 slim (glibc) pymupdf4llm - - 3.90s 254M 39.1M -
3.12 alpine (musl) pymupdf-layout build_error - - - - - -
3.12 alpine (musl) pymupdf-layout - - - - - -
3.12 alpine (musl) pymupdf4llm wheel - - 150.2M - broken
3.12 alpine (musl) pymupdf4llm - - - - - -
3.12 slim (glibc) pymupdf-layout wheel 7.3s 2.65s 242M 36.5M clean
3.12 slim (glibc) pymupdf-layout - - 2.96s 241M 36.5M -
3.12 slim (glibc) pymupdf4llm wheel 7.2s 2.72s 242M 36.5M clean
3.12 slim (glibc) pymupdf4llm - - 3.04s 241M 36.5M -
3.13 alpine (musl) pymupdf-layout build_error - - - - - -
3.13 alpine (musl) pymupdf-layout - - - - - -
3.13 alpine (musl) pymupdf4llm wheel - - 146.7M - broken
3.13 alpine (musl) pymupdf4llm - - - - - -
3.13 slim (glibc) pymupdf-layout wheel 7.6s 2.51s 241M 36.5M clean
3.13 slim (glibc) pymupdf-layout - - 2.92s 240M 36.5M -
3.13 slim (glibc) pymupdf4llm wheel 7.3s 2.38s 241M 36.5M clean
3.13 slim (glibc) pymupdf4llm - - 2.89s 240M 36.5M -
3.9 alpine (musl) pymupdf-layout build_error - - - - - -
3.9 alpine (musl) pymupdf-layout - - - - - -
3.9 alpine (musl) pymupdf4llm wheel - - 147.6M - broken
3.9 alpine (musl) pymupdf4llm - - - - - -
3.9 slim (glibc) pymupdf-layout build_error - 1.7s - - - -
3.9 slim (glibc) pymupdf-layout - - - - - -
3.9 slim (glibc) pymupdf4llm wheel 3.3s - 218M - broken
3.9 slim (glibc) pymupdf4llm - - - - - -

This quickstart demonstrates how to use PyMuPDF Layout in conjunction with PyMuPDF4LLM to extract structured data from a PDF. It highlights the critical import order of `pymupdf.layout` before `pymupdf4llm` to enable layout analysis. The example creates a simple PDF, then extracts its content into Markdown and JSON formats.

import pymupdf # For document opening
import pymupdf.layout # Crucial: import layout before pymupdf4llm
import pymupdf4llm # For structured data extraction
import os

# Create a dummy PDF file for demonstration
doc = pymupdf.open()
p = doc.new_page()
p.insert_text((50, 50), "Header text\n", fontname="helv", fontsize=12)
p.insert_text((50, 80), "This is a sample paragraph. It demonstrates basic text extraction.")
p.insert_text((50, 110), "Another paragraph with some more content to showcase layout analysis.")
doc.save("sample_document.pdf")
doc.close()

# Open the document with PyMuPDF
document = pymupdf.open("sample_document.pdf")

# Extract structured data as Markdown
markdown_output = pymupdf4llm.to_markdown(document)
print("\n--- Markdown Output ---")
print(markdown_output)

# Extract structured data as JSON (note: header/footer filtering not applicable to JSON)
json_output = pymupdf4llm.to_json(document)
print("\n--- JSON Output ---")
# For brevity, print only a part of the JSON structure if it's large
import json
print(json.dumps(json_output, indent=2))

# Remove the dummy file
os.remove("sample_document.pdf")