Marker PDF

1.10.2 · active · verified Wed Apr 15

Marker PDF is a Python library that converts PDF documents to markdown with high speed and accuracy. Leveraging advanced OCR and layout analysis models, it aims to preserve the structure and content of the original document. As of version 1.10.2, it is actively developed with frequent minor releases focusing on model improvements, performance, and bug fixes.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to convert a single PDF file to markdown using `convert_single_pdf`. It includes placeholders for a PDF path and shows how to retrieve the markdown output and any extracted image paths. For a truly runnable example without a pre-existing PDF, it attempts to create a dummy PDF using `pypdf`.

import os
from marker.convert import convert_single_pdf

# Placeholder for your PDF file path. Replace with an actual path.
# For a runnable example, ensure 'sample.pdf' exists or create a dummy.
pdf_path = os.environ.get('MARKER_PDF_PATH', 'sample.pdf')

# Create a dummy PDF file if it doesn't exist for the example to be runnable
if not os.path.exists(pdf_path):
    try:
        from pypdf import PdfWriter
        writer = PdfWriter()
        writer.add_blank_page(width=72, height=72)
        with open(pdf_path, 'wb') as f:
            writer.write(f)
        print(f"Created a dummy PDF at {pdf_path} for quickstart.")
    except ImportError:
        print("To create a dummy PDF, install pypdf: `pip install pypdf`")
        print(f"Please replace '{pdf_path}' with a path to a real PDF file.")
        pdf_path = None # Prevent execution if dummy couldn't be created

if pdf_path and os.path.exists(pdf_path):
    print(f"Converting PDF: {pdf_path}")
    full_text, out_paths, _ = convert_single_pdf(
        pdf_path,
        recompile_pdf=True,
        chunk_images=True
        # Add other configuration as needed, e.g., processor_config
    )

    print("--- Markdown Output ---")
    print(full_text[:500]) # Print first 500 characters of markdown
    print(f"Extracted image paths: {out_paths}")
else:
    print("Skipping conversion: PDF path not valid or dummy PDF creation failed.")

view raw JSON →