{"library":"pymupdf4llm","title":"PyMuPDF Utilities for LLM/RAG","description":"PyMuPDF4LLM (also aliased as `pdf4llm`) is a Python library built on PyMuPDF, specialized in converting PDF documents into clean, structured data formats like Markdown, JSON, and plain text, specifically optimized for Large Language Model (LLM) and Retrieval-Augmented Generation (RAG) environments. It includes layout analysis, automatic OCR for scanned pages, and supports multi-column layouts and image extraction. The library is actively maintained and frequently updated, with the current stable version being 1.27.2.2.","status":"active","version":"1.27.2.2","language":"en","source_language":"en","source_url":"https://github.com/pymupdf/pymupdf4llm/tree/main/pymupdf4llm","tags":["pdf","llm","rag","markdown","json","text-extraction","ocr","document-processing"],"install":[{"cmd":"pip install -U pymupdf4llm","lang":"bash","label":"Basic installation"},{"cmd":"pip install -U 'pymupdf4llm[ocr,layout]'","lang":"bash","label":"With OCR and enhanced layout features"}],"dependencies":[{"reason":"Core dependency for PDF processing, automatically installed.","package":"PyMuPDF","optional":false},{"reason":"For advanced layout analysis, automatically installed and activated with pymupdf4llm.","package":"pymupdf-layout","optional":false},{"reason":"Required for automatic OCR detection heuristics in PyMuPDF-Layout mode. Part of the '[ocr]' extra.","package":"opencv-python","optional":true},{"reason":"External dependency required on the system for OCR functionality.","package":"Tesseract OCR engine","optional":true},{"reason":"Optional OCR plugin for improved OCR via the '[ocr]' extra.","package":"RapidOCR","optional":true},{"reason":"Optional OCR plugin, especially for CJK languages, via the '[ocr]' extra.","package":"PaddleOCR","optional":true}],"imports":[{"symbol":"to_markdown","correct":"import pymupdf4llm\nmd_text = pymupdf4llm.to_markdown(\"input.pdf\")"},{"symbol":"to_json","correct":"import pymupdf4llm\njson_text = pymupdf4llm.to_json(\"input.pdf\")"},{"symbol":"to_text","correct":"import pymupdf4llm\nplain_text = pymupdf4llm.to_text(\"input.pdf\")"}],"quickstart":{"code":"import pymupdf4llm\nimport pathlib\n\n# Assuming 'input.pdf' exists in the same directory\n# For real-world use, replace with a valid path or PyMuPDF Document object\ninput_pdf_path = \"example.pdf\" \n\n# Create a dummy PDF for demonstration if it doesn't exist\n# In a real scenario, you would have your actual PDF file\ntry:\n    import fitz # PyMuPDF\n    doc = fitz.open()\n    page = doc.new_page()\n    page.insert_text((72, 72), \"# Hello, PyMuPDF4LLM!\\n\\nThis is a sample PDF content.\\n\\n- Item 1\\n- Item 2\\n\\n| Header 1 | Header 2 |\\n|----------|----------|\\n| Data 1   | Data 2   |\", fontsize=12)\n    doc.save(input_pdf_path)\n    doc.close()\nexcept ImportError:\n    print(\"PyMuPDF not installed, cannot create dummy PDF. Please provide a real PDF.\")\n    input_pdf_path = None\n\nif input_pdf_path and pathlib.Path(input_pdf_path).exists():\n    # Convert the PDF content to Markdown\n    md_text = pymupdf4llm.to_markdown(input_pdf_path)\n\n    # Print the converted markdown content\n    print(\"\\n--- Markdown Output ---\")\n    print(md_text)\n\n    # Optionally, write it to a markdown file\n    output_md_path = pathlib.Path(\"output.md\")\n    output_md_path.write_text(md_text, encoding=\"utf-8\")\n    print(f\"\\nMarkdown saved to {output_md_path.absolute()}\")\nelse:\n    print(\"Skipping quickstart as no PDF file is available.\")","lang":"python","description":"This quickstart demonstrates how to convert a PDF document into Markdown format using `pymupdf4llm.to_markdown()`. It also shows how to save the output to a file. The library can also convert to JSON and plain text using `to_json()` and `to_text()` respectively."},"warnings":[{"fix":"Remember that the first page is page 0. When using the `pages` parameter, provide 0-based page numbers (e.g., `pages=[0, 2, 4]` for the 1st, 3rd, and 5th pages).","message":"Page numbering in PyMuPDF4LLM (and PyMuPDF) is 0-based. Users expecting 1-based indexing for page selection or output references should adjust their logic accordingly.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Install Tesseract OCR on your operating system. For Python-side integration, install the `[ocr]` extras (`pip install 'pymupdf4llm[ocr]'`) which includes `opencv-python` for detection heuristics and optional OCR plugins like RapidOCR/PaddleOCR.","message":"For full OCR functionality (e.g., on scanned PDFs), an external Tesseract OCR engine must be installed and accessible on the system PATH, even though PyMuPDF4LLM handles automatic detection and invocation.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Review the generated Markdown for documents with highly intricate or unconventional layouts and manually adjust if specific formatting is critical. The JSON output for tables often provides more structured data if fidelity is paramount.","message":"While PyMuPDF4LLM excels at structured extraction, complex layouts such as deeply nested lists, certain table types without clear vertical borders, and link conversion (where the entire line might become a hyperlink) may not always be perfectly preserved in Markdown output.","severity":"gotcha","affected_versions":"All versions"},{"fix":"If header/footer exclusion is critical, process the document into Markdown or plain text, or perform post-processing on the JSON output to remove unwanted sections programmatically.","message":"The exclusion of page headers and footers (e.g., using `header=False`, `footer=False`) is currently not applicable when generating JSON output, as JSON aims to represent all data for the included pages.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-05T00:00:00.000Z","next_check":"2026-07-04T00:00:00.000Z"}