pdfminer2
raw JSON → 20151206 verified Fri May 01 auth: no python deprecated
A fork of PDFMiner for Python 3. Provides tools for extracting text, images, and metadata from PDF files. Version 20151206 is the last release; the project is largely superseded by pdfminer.six.
pip install pdfminer2 Common errors
error AttributeError: 'PDFDocument' object has no attribute 'initialize' ↓
cause The PDFDocument class does not have an 'initialize' method; the constructor handles initialization directly.
fix
Remove the call to .initialize(). Instantiate PDFDocument(parser) directly.
error ModuleNotFoundError: No module named 'pdfminer' ↓
cause pdfminer2 is not installed or the import path is wrong.
fix
Install with 'pip install pdfminer2' and use correct imports as shown in quickstart.
Warnings
deprecated pdfminer2 is deprecated. Use pdfminer.six for active maintenance and Python 3 support. ↓
fix Replace pdfminer2 with pdfminer.six (pip install pdfminer.six) and update imports to pdfminer.high_level.
breaking Import paths differ between pdfminer2 and newer forks like pdfminer.six. Code that works on one may break on the other. ↓
fix If migrating from pdfminer2 to pdfminer.six, use 'from pdfminer.high_level import extract_text' for simpler extraction.
gotcha The PDFDocument constructor may not validate the password argument; calling doc.initialize() can raise TypeError if password is wrong. ↓
fix Wrap doc.initialize(password=...) in try-except or provide the correct password.
Imports
- PDFParser wrong
from pdfminer.pdfinterp import PDFParsercorrectfrom pdfminer.pdfparser import PDFParser - PDFPageInterpreter wrong
from pdfminer.pdfinterpreter import PDFPageInterpretercorrectfrom pdfminer.pdfinterp import PDFPageInterpreter
Quickstart
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
# Open PDF file
with open('sample.pdf', 'rb') as fh:
parser = PDFParser(fh)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
text = retstr.getvalue()
device.close()
retstr.close()
print(text)