MinerU PDF to Markdown Converter
MinerU is a robust document parsing tool designed to convert various input formats, including PDF, images, DOCX, PPTX, and XLSX, into machine-readable Markdown and JSON. It is optimized for downstream retrieval, extraction, and processing, especially for LLM-ready formats. Currently at version 3.0.9, the library is actively maintained with ongoing architectural enhancements and feature improvements, particularly in handling scientific literature and complex document structures.
Common errors
-
failed to read file, please check if the file is corrupted and try uploading again
cause The MinerU API backend could not access, download, or correctly parse the provided file, often due to an inaccessible URL, an invalid PDF, or attempting to pass a URL when the API expects multipart file content.fixEnsure the file URL is publicly accessible and points directly to a valid PDF. If using an API, ensure you are uploading the *file content* via `multipart/form-data` to the `/file_parse` endpoint, not just the URL or metadata. Verify your MinerU environment is up to date with all required models. -
{'error': {'message': 'Either text, input_ids or input_embeds should be provided. '}}cause This error occurs when an API request to a text/image prompt endpoint (like `/generate`) is made without providing the required input fields, but is often seen when mistakenly sending a PDF extraction request to this endpoint instead of the dedicated PDF parsing endpoint.fixFor PDF extraction, ensure you are sending requests to the correct endpoint, typically `/file_parse` or `/tasks` for asynchronous processing, and that you are providing the actual PDF file content as `multipart/form-data`. -
Error: 1 task(s) failed while processing documents: task#1 (your_document.pdf): Timed out while polling task status for task
cause Tasks stuck in 'processing' or timing out, especially with `vllm-async-engine`, can be due to missing environment variables, GPU memory issues, or CPU inference timeouts for large documents.fixIf using `vllm-async-engine`, ensure `VLLM_USE_V1=1` is explicitly set as an environment variable. If on CPU, increase the timeout with `MINERU_PDF_RENDER_TIMEOUT=600` (for 600 seconds). For GPU issues, monitor memory usage and try switching to the `pipeline` backend (`-b pipeline`) for basic CPU processing.
Warnings
- breaking Version 2.0 and later removed the `pymupdf` dependency. Code relying on direct `pymupdf` calls or specific behaviors might break.
- gotcha Python 3.13 has limited support on Windows due to underlying dependencies like `ray`. Users on Windows should stick to Python 3.10, 3.11, or 3.12.
- gotcha Parsing complex documents may result in inaccurate output for specific elements. Known limitations include reading order in extremely complex layouts, limited vertical text support, recognition issues with uncommon list formats, lack of code block recognition, poor parsing of comic books/art albums/primary school textbooks, table recognition errors in complex tables, and inaccurate OCR for lesser-known languages.
- gotcha When parsing multiple PDFs in a loop, especially with the `vllm` backend, users might encounter `PdfiumError ("Failed to import pages")` due to resource exhaustion or `PDFium`'s non-thread-safe nature.
Install
-
pip install -U "mineru[all]" -
pip install "mineru[core]"
Imports
- parse_doc
from mineru import parse_doc
from mineru.utils.demo_utils import parse_doc
Quickstart
import os
from pathlib import Path
from mineru.utils.demo_utils import parse_doc
# Create a dummy PDF file for demonstration
# In a real scenario, replace 'input.pdf' with your actual PDF file path.
# This example assumes you have a 'demo_pdfs' directory with 'demo1.pdf'
# or create a placeholder for testing purposes.
# Example placeholder for an input PDF file
# For a real run, ensure 'your_document.pdf' exists or create it.
input_pdf_path = Path("your_document.pdf") # Replace with a real PDF path
if not input_pdf_path.exists():
print(f"Warning: '{input_pdf_path}' not found. Please provide a valid PDF for the quickstart.")
# Create a dummy file for execution to pass
with open(input_pdf_path, 'w') as f:
f.write("This is a dummy PDF content for testing.")
print(f"Created a dummy '{input_pdf_path}' for demonstration. Parsing might not yield meaningful results.")
output_directory = Path("mineru_output")
output_directory.mkdir(exist_ok=True)
print(f"Parsing {input_pdf_path} to Markdown...")
# Parse the document using the pipeline backend (CPU-friendly)
# 'lang' can be adjusted, e.g., 'en' for English.
# 'backend' can be 'vlm-transformers' for higher accuracy if GPU is available.
parse_doc(
path_list=[input_pdf_path],
output_dir=output_directory,
lang="en",
backend="pipeline", # Use 'vlm-transformers' or 'vlm-sglang-engine' if GPU is available
f_dump_md=True # Output markdown files
)
print(f"Parsing complete. Check output in: {output_directory.resolve()}")
# Clean up the dummy file if it was created
if input_pdf_path.name == "your_document.pdf" and input_pdf_path.exists() and input_pdf_path.stat().st_size < 100:
input_pdf_path.unlink()