Marker PDF
Marker PDF is a Python library that converts PDF documents to markdown with high speed and accuracy. Leveraging advanced OCR and layout analysis models, it aims to preserve the structure and content of the original document. As of version 1.10.2, it is actively developed with frequent minor releases focusing on model improvements, performance, and bug fixes.
Warnings
- breaking The `format_lines` parameter was removed from the `convert_single_pdf` API and CLI in `v1.8.3`. Users who relied on this parameter for fine-tuning output formatting will need to adjust their calls.
- gotcha Marker PDF uses deep learning models for OCR and layout analysis, which can be computationally intensive. Conversion can consume significant CPU and RAM, especially for large, complex, or image-heavy PDFs. Performance might also be impacted by model updates (e.g., 'block mode' in `v1.9.0` made it 'a bit slower').
- gotcha The quality and exact formatting of the generated markdown can vary significantly based on the input PDF's structure, clarity, and the specific version of Marker PDF used. Frequent model updates (e.g., in `v1.10.0`, `v1.8.3`) aim to improve accuracy but can lead to subtle differences in output between versions.
- gotcha The license for Marker PDF changed to an OpenRAIL-M-v1.0 license around `v1.8.5`. This is a significant change regarding the usage rights and commercial terms for the library and its models.
Install
-
pip install marker-pdf
Imports
- convert_single_pdf
from marker.convert import convert_single_pdf
Quickstart
import os
from marker.convert import convert_single_pdf
# Placeholder for your PDF file path. Replace with an actual path.
# For a runnable example, ensure 'sample.pdf' exists or create a dummy.
pdf_path = os.environ.get('MARKER_PDF_PATH', 'sample.pdf')
# Create a dummy PDF file if it doesn't exist for the example to be runnable
if not os.path.exists(pdf_path):
try:
from pypdf import PdfWriter
writer = PdfWriter()
writer.add_blank_page(width=72, height=72)
with open(pdf_path, 'wb') as f:
writer.write(f)
print(f"Created a dummy PDF at {pdf_path} for quickstart.")
except ImportError:
print("To create a dummy PDF, install pypdf: `pip install pypdf`")
print(f"Please replace '{pdf_path}' with a path to a real PDF file.")
pdf_path = None # Prevent execution if dummy couldn't be created
if pdf_path and os.path.exists(pdf_path):
print(f"Converting PDF: {pdf_path}")
full_text, out_paths, _ = convert_single_pdf(
pdf_path,
recompile_pdf=True,
chunk_images=True
# Add other configuration as needed, e.g., processor_config
)
print("--- Markdown Output ---")
print(full_text[:500]) # Print first 500 characters of markdown
print(f"Extracted image paths: {out_paths}")
else:
print("Skipping conversion: PDF path not valid or dummy PDF creation failed.")