PDFMiner.six
PDFMiner.six is a community-maintained fork of the original PDFMiner, a powerful Python library for parsing and analyzing PDF documents. It focuses on extracting text data, layout information, and other elements like images, and supports various PDF specifications, CJK languages, and encryption. The library is actively maintained, with frequent releases addressing bug fixes, new features, and security enhancements.
Warnings
- breaking Arbitrary Code Execution Vulnerabilities (CVE-2025-64512 and CVE-2025-70559) due to insecure deserialization of CMap cache files via Python's `pickle` module. This allowed attackers to execute arbitrary code by providing malicious PDF files or pickle files. [cite: 2 (release notes 20251230), 11, 12]
- deprecated The third argument (generation number) to `PDFObjRef` was deprecated.
- gotcha Textual output may contain raw character ID's (e.g., `(cid:x)` values) instead of readable characters for certain PDFs, especially those with non-standard font encodings or missing font data. This often happens when text cannot be properly converted to Unicode.
- gotcha Processing very large PDF files can lead to significant memory consumption and performance issues.
Install
-
pip install pdfminer.six -
pip install 'pdfminer.six[image]'
Imports
- extract_text
from pdfminer.high_level import extract_text
- PDFParser, PDFDocument, PDFResourceManager, PDFPageInterpreter, TextConverter, LAParams
from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage
Quickstart
import io
from pdfminer.high_level import extract_text
# For demonstration, let's create a dummy PDF file path.
# In a real scenario, this would be the path to your .pdf file.
dummy_pdf_path = "example.pdf"
# Create a dummy PDF file for the example to run without error
# In a real application, replace this with actual PDF file handling.
try:
with open(dummy_pdf_path, 'wb') as f:
f.write(b'%PDF-1.4\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj 3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]/Contents 4 0 R>>endobj 4 0 obj<</Length 41>>stream\nBT /F1 24 Tf 100 700 Td (Hello, PDFMiner.six!) Tj ET\nendstream\nxref\n0 5\n0000000000 65535 f\n0000000009 00000 n\n0000000056 00000 n\n0000000114 00000 n\n0000000213 00000 n\ntrailer<</Size 5/Root 1 0 R>>startxref\n296\n%%EOF')
# Extract text from the PDF
text = extract_text(dummy_pdf_path)
print("Extracted Text:")
print(text)
except Exception as e:
print(f"An error occurred: {e}")
finally:
# Clean up the dummy PDF file
import os
if os.path.exists(dummy_pdf_path):
os.remove(dummy_pdf_path)