{"library":"mineru","title":"MinerU PDF to Markdown Converter","description":"MinerU is a robust document parsing tool designed to convert various input formats, including PDF, images, DOCX, PPTX, and XLSX, into machine-readable Markdown and JSON. It is optimized for downstream retrieval, extraction, and processing, especially for LLM-ready formats. Currently at version 3.0.9, the library is actively maintained with ongoing architectural enhancements and feature improvements, particularly in handling scientific literature and complex document structures.","language":"python","status":"active","last_verified":"Thu Apr 16","install":{"commands":["pip install -U \"mineru[all]\"","pip install \"mineru[core]\""],"cli":{"name":"mineru","version":""}},"imports":["from mineru.utils.demo_utils import parse_doc"],"auth":{"required":false,"env_vars":[]},"quickstart":{"code":"import os\nfrom pathlib import Path\nfrom mineru.utils.demo_utils import parse_doc\n\n# Create a dummy PDF file for demonstration\n# In a real scenario, replace 'input.pdf' with your actual PDF file path.\n# This example assumes you have a 'demo_pdfs' directory with 'demo1.pdf'\n# or create a placeholder for testing purposes.\n\n# Example placeholder for an input PDF file\n# For a real run, ensure 'your_document.pdf' exists or create it.\ninput_pdf_path = Path(\"your_document.pdf\") # Replace with a real PDF path\nif not input_pdf_path.exists():\n    print(f\"Warning: '{input_pdf_path}' not found. Please provide a valid PDF for the quickstart.\")\n    # Create a dummy file for execution to pass\n    with open(input_pdf_path, 'w') as f:\n        f.write(\"This is a dummy PDF content for testing.\")\n    print(f\"Created a dummy '{input_pdf_path}' for demonstration. Parsing might not yield meaningful results.\")\n\noutput_directory = Path(\"mineru_output\")\noutput_directory.mkdir(exist_ok=True)\n\nprint(f\"Parsing {input_pdf_path} to Markdown...\")\n\n# Parse the document using the pipeline backend (CPU-friendly)\n# 'lang' can be adjusted, e.g., 'en' for English.\n# 'backend' can be 'vlm-transformers' for higher accuracy if GPU is available.\nparse_doc(\n    path_list=[input_pdf_path],\n    output_dir=output_directory,\n    lang=\"en\",\n    backend=\"pipeline\", # Use 'vlm-transformers' or 'vlm-sglang-engine' if GPU is available\n    f_dump_md=True # Output markdown files\n)\n\nprint(f\"Parsing complete. Check output in: {output_directory.resolve()}\")\n\n# Clean up the dummy file if it was created\nif input_pdf_path.name == \"your_document.pdf\" and input_pdf_path.exists() and input_pdf_path.stat().st_size < 100:\n    input_pdf_path.unlink()","lang":"python","description":"This quickstart demonstrates how to use MinerU's Python API to convert a local PDF file into Markdown format. It creates an output directory and uses the `parse_doc` function from `mineru.utils.demo_utils`. Users should replace `your_document.pdf` with the actual path to their PDF file. The `backend` parameter can be adjusted for CPU-only (pipeline) or GPU-accelerated (e.g., vlm-transformers) inference.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":null}