PDFMiner.six

20260107 · active · verified Sun Mar 29

PDFMiner.six is a community-maintained fork of the original PDFMiner, a powerful Python library for parsing and analyzing PDF documents. It focuses on extracting text data, layout information, and other elements like images, and supports various PDF specifications, CJK languages, and encryption. The library is actively maintained, with frequent releases addressing bug fixes, new features, and security enhancements.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates the simplest way to extract all text from a PDF file using the high-level `extract_text` function. The example includes creating a dummy PDF for demonstration purposes.

import io
from pdfminer.high_level import extract_text

# For demonstration, let's create a dummy PDF file path.
# In a real scenario, this would be the path to your .pdf file.
dummy_pdf_path = "example.pdf"

# Create a dummy PDF file for the example to run without error
# In a real application, replace this with actual PDF file handling.
try:
    with open(dummy_pdf_path, 'wb') as f:
        f.write(b'%PDF-1.4\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj 3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]/Contents 4 0 R>>endobj 4 0 obj<</Length 41>>stream\nBT /F1 24 Tf 100 700 Td (Hello, PDFMiner.six!) Tj ET\nendstream\nxref\n0 5\n0000000000 65535 f\n0000000009 00000 n\n0000000056 00000 n\n0000000114 00000 n\n0000000213 00000 n\ntrailer<</Size 5/Root 1 0 R>>startxref\n296\n%%EOF')
    
    # Extract text from the PDF
    text = extract_text(dummy_pdf_path)
    print("Extracted Text:")
    print(text)

except Exception as e:
    print(f"An error occurred: {e}")
finally:
    # Clean up the dummy PDF file
    import os
    if os.path.exists(dummy_pdf_path):
        os.remove(dummy_pdf_path)

view raw JSON →