PyMuPDF Utilities for LLM/RAG
PyMuPDF4LLM (also aliased as `pdf4llm`) is a Python library built on PyMuPDF, specialized in converting PDF documents into clean, structured data formats like Markdown, JSON, and plain text, specifically optimized for Large Language Model (LLM) and Retrieval-Augmented Generation (RAG) environments. It includes layout analysis, automatic OCR for scanned pages, and supports multi-column layouts and image extraction. The library is actively maintained and frequently updated, with the current stable version being 1.27.2.2.
Warnings
- gotcha Page numbering in PyMuPDF4LLM (and PyMuPDF) is 0-based. Users expecting 1-based indexing for page selection or output references should adjust their logic accordingly.
- gotcha For full OCR functionality (e.g., on scanned PDFs), an external Tesseract OCR engine must be installed and accessible on the system PATH, even though PyMuPDF4LLM handles automatic detection and invocation.
- gotcha While PyMuPDF4LLM excels at structured extraction, complex layouts such as deeply nested lists, certain table types without clear vertical borders, and link conversion (where the entire line might become a hyperlink) may not always be perfectly preserved in Markdown output.
- gotcha The exclusion of page headers and footers (e.g., using `header=False`, `footer=False`) is currently not applicable when generating JSON output, as JSON aims to represent all data for the included pages.
Install
-
pip install -U pymupdf4llm -
pip install -U 'pymupdf4llm[ocr,layout]'
Imports
- to_markdown
import pymupdf4llm md_text = pymupdf4llm.to_markdown("input.pdf") - to_json
import pymupdf4llm json_text = pymupdf4llm.to_json("input.pdf") - to_text
import pymupdf4llm plain_text = pymupdf4llm.to_text("input.pdf")
Quickstart
import pymupdf4llm
import pathlib
# Assuming 'input.pdf' exists in the same directory
# For real-world use, replace with a valid path or PyMuPDF Document object
input_pdf_path = "example.pdf"
# Create a dummy PDF for demonstration if it doesn't exist
# In a real scenario, you would have your actual PDF file
try:
import fitz # PyMuPDF
doc = fitz.open()
page = doc.new_page()
page.insert_text((72, 72), "# Hello, PyMuPDF4LLM!\n\nThis is a sample PDF content.\n\n- Item 1\n- Item 2\n\n| Header 1 | Header 2 |\n|----------|----------|\n| Data 1 | Data 2 |", fontsize=12)
doc.save(input_pdf_path)
doc.close()
except ImportError:
print("PyMuPDF not installed, cannot create dummy PDF. Please provide a real PDF.")
input_pdf_path = None
if input_pdf_path and pathlib.Path(input_pdf_path).exists():
# Convert the PDF content to Markdown
md_text = pymupdf4llm.to_markdown(input_pdf_path)
# Print the converted markdown content
print("\n--- Markdown Output ---")
print(md_text)
# Optionally, write it to a markdown file
output_md_path = pathlib.Path("output.md")
output_md_path.write_text(md_text, encoding="utf-8")
print(f"\nMarkdown saved to {output_md_path.absolute()}")
else:
print("Skipping quickstart as no PDF file is available.")