pdfminer2

raw JSON →
20151206 verified Fri May 01 auth: no python deprecated

A fork of PDFMiner for Python 3. Provides tools for extracting text, images, and metadata from PDF files. Version 20151206 is the last release; the project is largely superseded by pdfminer.six.

pip install pdfminer2
error AttributeError: 'PDFDocument' object has no attribute 'initialize'
cause The PDFDocument class does not have an 'initialize' method; the constructor handles initialization directly.
fix
Remove the call to .initialize(). Instantiate PDFDocument(parser) directly.
error ModuleNotFoundError: No module named 'pdfminer'
cause pdfminer2 is not installed or the import path is wrong.
fix
Install with 'pip install pdfminer2' and use correct imports as shown in quickstart.
deprecated pdfminer2 is deprecated. Use pdfminer.six for active maintenance and Python 3 support.
fix Replace pdfminer2 with pdfminer.six (pip install pdfminer.six) and update imports to pdfminer.high_level.
breaking Import paths differ between pdfminer2 and newer forks like pdfminer.six. Code that works on one may break on the other.
fix If migrating from pdfminer2 to pdfminer.six, use 'from pdfminer.high_level import extract_text' for simpler extraction.
gotcha The PDFDocument constructor may not validate the password argument; calling doc.initialize() can raise TypeError if password is wrong.
fix Wrap doc.initialize(password=...) in try-except or provide the correct password.

Extract text from a PDF file using pdfminer2.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO

# Open PDF file
with open('sample.pdf', 'rb') as fh:
    parser = PDFParser(fh)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
    text = retstr.getvalue()
    device.close()
    retstr.close()
    print(text)