MinerU PDF to Markdown Converter

3.0.9 · active · verified Thu Apr 16

MinerU is a robust document parsing tool designed to convert various input formats, including PDF, images, DOCX, PPTX, and XLSX, into machine-readable Markdown and JSON. It is optimized for downstream retrieval, extraction, and processing, especially for LLM-ready formats. Currently at version 3.0.9, the library is actively maintained with ongoing architectural enhancements and feature improvements, particularly in handling scientific literature and complex document structures.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use MinerU's Python API to convert a local PDF file into Markdown format. It creates an output directory and uses the `parse_doc` function from `mineru.utils.demo_utils`. Users should replace `your_document.pdf` with the actual path to their PDF file. The `backend` parameter can be adjusted for CPU-only (pipeline) or GPU-accelerated (e.g., vlm-transformers) inference.

import os
from pathlib import Path
from mineru.utils.demo_utils import parse_doc

# Create a dummy PDF file for demonstration
# In a real scenario, replace 'input.pdf' with your actual PDF file path.
# This example assumes you have a 'demo_pdfs' directory with 'demo1.pdf'
# or create a placeholder for testing purposes.

# Example placeholder for an input PDF file
# For a real run, ensure 'your_document.pdf' exists or create it.
input_pdf_path = Path("your_document.pdf") # Replace with a real PDF path
if not input_pdf_path.exists():
    print(f"Warning: '{input_pdf_path}' not found. Please provide a valid PDF for the quickstart.")
    # Create a dummy file for execution to pass
    with open(input_pdf_path, 'w') as f:
        f.write("This is a dummy PDF content for testing.")
    print(f"Created a dummy '{input_pdf_path}' for demonstration. Parsing might not yield meaningful results.")

output_directory = Path("mineru_output")
output_directory.mkdir(exist_ok=True)

print(f"Parsing {input_pdf_path} to Markdown...")

# Parse the document using the pipeline backend (CPU-friendly)
# 'lang' can be adjusted, e.g., 'en' for English.
# 'backend' can be 'vlm-transformers' for higher accuracy if GPU is available.
parse_doc(
    path_list=[input_pdf_path],
    output_dir=output_directory,
    lang="en",
    backend="pipeline", # Use 'vlm-transformers' or 'vlm-sglang-engine' if GPU is available
    f_dump_md=True # Output markdown files
)

print(f"Parsing complete. Check output in: {output_directory.resolve()}")

# Clean up the dummy file if it was created
if input_pdf_path.name == "your_document.pdf" and input_pdf_path.exists() and input_pdf_path.stat().st_size < 100:
    input_pdf_path.unlink()

view raw JSON →