PyMuPDF Utilities for LLM/RAG

1.27.2.2 · active · verified Sun Apr 05

PyMuPDF4LLM (also aliased as `pdf4llm`) is a Python library built on PyMuPDF, specialized in converting PDF documents into clean, structured data formats like Markdown, JSON, and plain text, specifically optimized for Large Language Model (LLM) and Retrieval-Augmented Generation (RAG) environments. It includes layout analysis, automatic OCR for scanned pages, and supports multi-column layouts and image extraction. The library is actively maintained and frequently updated, with the current stable version being 1.27.2.2.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to convert a PDF document into Markdown format using `pymupdf4llm.to_markdown()`. It also shows how to save the output to a file. The library can also convert to JSON and plain text using `to_json()` and `to_text()` respectively.

import pymupdf4llm
import pathlib

# Assuming 'input.pdf' exists in the same directory
# For real-world use, replace with a valid path or PyMuPDF Document object
input_pdf_path = "example.pdf" 

# Create a dummy PDF for demonstration if it doesn't exist
# In a real scenario, you would have your actual PDF file
try:
    import fitz # PyMuPDF
    doc = fitz.open()
    page = doc.new_page()
    page.insert_text((72, 72), "# Hello, PyMuPDF4LLM!\n\nThis is a sample PDF content.\n\n- Item 1\n- Item 2\n\n| Header 1 | Header 2 |\n|----------|----------|\n| Data 1   | Data 2   |", fontsize=12)
    doc.save(input_pdf_path)
    doc.close()
except ImportError:
    print("PyMuPDF not installed, cannot create dummy PDF. Please provide a real PDF.")
    input_pdf_path = None

if input_pdf_path and pathlib.Path(input_pdf_path).exists():
    # Convert the PDF content to Markdown
    md_text = pymupdf4llm.to_markdown(input_pdf_path)

    # Print the converted markdown content
    print("\n--- Markdown Output ---")
    print(md_text)

    # Optionally, write it to a markdown file
    output_md_path = pathlib.Path("output.md")
    output_md_path.write_text(md_text, encoding="utf-8")
    print(f"\nMarkdown saved to {output_md_path.absolute()}")
else:
    print("Skipping quickstart as no PDF file is available.")

view raw JSON →