PDFMiner

20191125 · maintenance · verified Thu Apr 16

PDFMiner is a Python library for extracting and analyzing text data from PDF documents, focusing on precise text location and layout information. The version `20191125` is the last release of the original `euske/pdfminer` project. It supports Python 3.6 and above, but has not been actively maintained since 2020. For ongoing development and community support, the `pdfminer.six` fork is recommended.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to extract text from a PDF file using PDFMiner's core components. It initializes a resource manager, a text converter, and a page interpreter to process the PDF document page by page. A dummy `dummy.pdf` file is created if not found, allowing the code to be runnable for demonstration purposes. This reflects the more verbose API usage typical of the original PDFMiner, as opposed to the simplified `high_level` API found in `pdfminer.six`.

import os
from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

def extract_text_from_pdf(pdf_path):
    # Ensure a dummy PDF exists for demonstration, or replace with a real path
    if not os.path.exists(pdf_path):
        print(f"Error: PDF file not found at {pdf_path}. Creating a dummy PDF for demonstration.")
        # In a real scenario, you'd handle the missing file appropriately.
        # For a runnable example, we'll create a simple dummy file.
        try:
            from reportlab.pdfgen import canvas
            c = canvas.Canvas(pdf_path)
            c.drawString(100, 750, "Hello, PDFMiner!")
            c.drawString(100, 730, "This is a dummy PDF for testing.")
            c.save()
            print(f"Dummy PDF created at {pdf_path}")
        except ImportError:
            print("Please install reportlab (`pip install reportlab`) to create dummy PDF, or provide a real PDF.")
            return ""

    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    
    with open(pdf_path, 'rb') as fp:
        parser = PDFParser(fp)
        document = PDFDocument(parser)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(document):
            interpreter.process_page(page)
        text = retstr.getvalue()
    
    device.close()
    retstr.close()
    return text

if __name__ == '__main__':
    pdf_file = 'dummy.pdf'
    extracted_content = extract_text_from_pdf(pdf_file)
    print("\n--- Extracted Text ---")
    print(extracted_content)

view raw JSON →