PDFMiner.six
PDFMiner.six is a community-maintained fork of the original PDFMiner, a powerful Python library for parsing and analyzing PDF documents. It focuses on extracting text data, layout information, and other elements like images, and supports various PDF specifications, CJK languages, and encryption. The library is actively maintained, with frequent releases addressing bug fixes, new features, and security enhancements.
Common errors
-
ModuleNotFoundError: No module named 'pdfminer.high_level'
cause This error often occurs because the `pdfminer.six` library, which contains the `high_level` module, is either not installed, or there's a conflict with the older, unmaintained `pdfminer` package.fixEnsure you have `pdfminer.six` installed correctly and not the old `pdfminer`. If you have both, uninstall the old one. Use `pip install pdfminer.six` or `pip install --upgrade pdfminer.six`. If using a virtual environment, activate it first. -
pdfminer.pdfdocument.PDFTextExtractionNotAllowed: Text extraction is not allowed
cause This error indicates that the PDF document is encrypted or has usage restrictions that prevent text extraction.fixYou can attempt to extract text by providing the correct password using the `password` argument in functions like `extract_text`. If there's no password, you might be able to bypass the check by setting `check_extractable=False` in some lower-level functions, though this is not always recommended for security-restricted documents. -
AttributeError: module 'pdfminer' has no attribute 'high_level'
cause This typically arises when code written for `pdfminer.six` attempts to use the `high_level` module, but an older `pdfminer` library (which does not have this module) is being imported or is shadowing the `pdfminer.six` installation.fixVerify that `pdfminer.six` is the only PDFMiner-related package installed and is accessible in your Python environment. Uninstall any older `pdfminer` installations (e.g., `pip uninstall pdfminer`) and ensure `pdfminer.six` is properly installed (`pip install pdfminer.six`). Also, ensure you are importing `from pdfminer.high_level import extract_text` or similar. -
(cid:x) values in textual output
cause This is a common issue where `pdfminer.six` cannot map a character ID (CID) to a Unicode character, often due to custom fonts, non-standard PDF encoding, or embedded fonts not providing sufficient information for proper decoding.fixThis is often a limitation of the PDF itself. A quick check is to copy-paste the text from a PDF viewer; if it's gibberish, `pdfminer.six` likely won't do better. For programmatic solutions, one might need to apply custom character mapping or use OCR for such PDFs. -
KeyError: 'N'
cause This `KeyError` (or similar for keys like 'Type', 'Resources', 'MediaBox') often indicates that the PDF document is malformed or does not strictly adhere to the PDF specification, missing expected dictionary keys that `pdfminer.six` anticipates.fixThis typically points to an issue with a specific, non-standard PDF file. There isn't a universal code fix, but sometimes updating `pdfminer.six` to the latest version can resolve issues with certain malformed PDFs, as the library often adds robustness for such cases.
Warnings
- breaking Arbitrary Code Execution Vulnerabilities (CVE-2025-64512 and CVE-2025-70559) due to insecure deserialization of CMap cache files via Python's `pickle` module. This allowed attackers to execute arbitrary code by providing malicious PDF files or pickle files. [cite: 2 (release notes 20251230), 11, 12]
- deprecated The third argument (generation number) to `PDFObjRef` was deprecated.
- gotcha Textual output may contain raw character ID's (e.g., `(cid:x)` values) instead of readable characters for certain PDFs, especially those with non-standard font encodings or missing font data. This often happens when text cannot be properly converted to Unicode.
- gotcha Processing very large PDF files can lead to significant memory consumption and performance issues.
Install
-
pip install pdfminer.six -
pip install 'pdfminer.six[image]'
Imports
- extract_text
from pdfminer.high_level import extract_text
- PDFParser, PDFDocument, PDFResourceManager, PDFPageInterpreter, TextConverter, LAParams
from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage
Quickstart
import io
from pdfminer.high_level import extract_text
# For demonstration, let's create a dummy PDF file path.
# In a real scenario, this would be the path to your .pdf file.
dummy_pdf_path = "example.pdf"
# Create a dummy PDF file for the example to run without error
# In a real application, replace this with actual PDF file handling.
try:
with open(dummy_pdf_path, 'wb') as f:
f.write(b'%PDF-1.4\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj 3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]/Contents 4 0 R>>endobj 4 0 obj<</Length 41>>stream\nBT /F1 24 Tf 100 700 Td (Hello, PDFMiner.six!) Tj ET\nendstream\nxref\n0 5\n0000000000 65535 f\n0000000009 00000 n\n0000000056 00000 n\n0000000114 00000 n\n0000000213 00000 n\ntrailer<</Size 5/Root 1 0 R>>startxref\n296\n%%EOF')
# Extract text from the PDF
text = extract_text(dummy_pdf_path)
print("Extracted Text:")
print(text)
except Exception as e:
print(f"An error occurred: {e}")
finally:
# Clean up the dummy PDF file
import os
if os.path.exists(dummy_pdf_path):
os.remove(dummy_pdf_path)