PDFMiner
PDFMiner is a Python library for extracting and analyzing text data from PDF documents, focusing on precise text location and layout information. The version `20191125` is the last release of the original `euske/pdfminer` project. It supports Python 3.6 and above, but has not been actively maintained since 2020. For ongoing development and community support, the `pdfminer.six` fork is recommended.
Common errors
-
AttributeError: '_io.BytesIO' object has no attribute 'catalog'
cause This error typically occurs when a file-like object (like `io.BytesIO`) is passed directly to `PDFPage.create_pages()` or similar functions, but the API expects a `PDFDocument` object that has already been parsed by a `PDFParser`. This indicates incorrect API usage.fixEnsure you correctly parse the file first using `PDFParser` to create a `PDFDocument` instance, and then pass the `PDFDocument` object to `PDFPage.create_pages()`. Refer to the quickstart example for correct API flow. -
ModuleNotFoundError: No module named 'pdfminer.six'
cause This usually means you have installed the original `pdfminer` package but are attempting to import modules or use `high_level` functions specific to the `pdfminer.six` fork. Or, `pdfminer.six` was not installed at all.fixIf you intend to use `pdfminer.six` (recommended), ensure you install it with `pip install pdfminer.six`. If you're sticking to the original `pdfminer`, use its specific import paths and API patterns. The original `pdfminer` does not expose a `pdfminer.high_level` module. -
UnicodeEncodeError: 'charmap' codec can't encode character...
cause Encoding issues are common when handling diverse text content in PDFs, especially across different operating systems or locales, or when writing to files without specifying the correct encoding.fixAlways specify `encoding='utf-8'` when creating output files or `StringIO` objects if you expect Unicode characters. For `TextConverter`, ensure the `outfp` (output file pointer) is opened with `encoding='utf-8'` or handle character sets explicitly.
Warnings
- breaking The original `pdfminer` project (euske/pdfminer) is no longer actively maintained since 2020. While the latest version `20191125` supports Python 3, new features, bug fixes, and community support are primarily found in its actively maintained fork, `pdfminer.six`.
- gotcha PDFMiner struggles with text extraction from PDFs with complex layouts (e.g., multi-column, nested tables) and cannot extract text from scanned PDFs (images) without external Optical Character Recognition (OCR) tools.
- gotcha Output may contain raw character IDs like `(cid:x)` instead of readable text, especially for non-standard fonts or encoding issues. This happens when the font is not properly mapped to Unicode.
Install
-
pip install pdfminer
Imports
- PDFResourceManager
from pdfminer.pdfinterp import PDFResourceManager
- PDFPageInterpreter
from pdfminer.pdfinterp import PDFPageInterpreter
- PDFPage
from pdfminer.pdfpage import PDFPage
- PDFParser
from pdfminer.pdfparser import PDFParser
- PDFDocument
from pdfminer.pdfdocument import PDFDocument
- TextConverter
from pdfminer.converter import TextConverter
- LAParams
from pdfminer.layout import LAParams
Quickstart
import os
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
def extract_text_from_pdf(pdf_path):
# Ensure a dummy PDF exists for demonstration, or replace with a real path
if not os.path.exists(pdf_path):
print(f"Error: PDF file not found at {pdf_path}. Creating a dummy PDF for demonstration.")
# In a real scenario, you'd handle the missing file appropriately.
# For a runnable example, we'll create a simple dummy file.
try:
from reportlab.pdfgen import canvas
c = canvas.Canvas(pdf_path)
c.drawString(100, 750, "Hello, PDFMiner!")
c.drawString(100, 730, "This is a dummy PDF for testing.")
c.save()
print(f"Dummy PDF created at {pdf_path}")
except ImportError:
print("Please install reportlab (`pip install reportlab`) to create dummy PDF, or provide a real PDF.")
return ""
rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, laparams=laparams)
with open(pdf_path, 'rb') as fp:
parser = PDFParser(fp)
document = PDFDocument(parser)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
text = retstr.getvalue()
device.close()
retstr.close()
return text
if __name__ == '__main__':
pdf_file = 'dummy.pdf'
extracted_content = extract_text_from_pdf(pdf_file)
print("\n--- Extracted Text ---")
print(extracted_content)