PDFMiner.six

20260107 verified Tue May 12 auth: no python install: verified

PDFMiner.six is a community-maintained fork of the original PDFMiner, a powerful Python library for parsing and analyzing PDF documents. It focuses on extracting text data, layout information, and other elements like images, and supports various PDF specifications, CJK languages, and encryption. The library is actively maintained, with frequent releases addressing bug fixes, new features, and security enhancements.

pip install pdfminer.six

Common errors

error ModuleNotFoundError: No module named 'pdfminer.high_level' ↓

cause This error often occurs because the `pdfminer.six` library, which contains the `high_level` module, is either not installed, or there's a conflict with the older, unmaintained `pdfminer` package.

fix

Ensure you have pdfminer.six installed correctly and not the old pdfminer. If you have both, uninstall the old one. Use pip install pdfminer.six or pip install --upgrade pdfminer.six. If using a virtual environment, activate it first.

error pdfminer.pdfdocument.PDFTextExtractionNotAllowed: Text extraction is not allowed ↓

cause This error indicates that the PDF document is encrypted or has usage restrictions that prevent text extraction.

fix

You can attempt to extract text by providing the correct password using the password argument in functions like extract_text. If there's no password, you might be able to bypass the check by setting check_extractable=False in some lower-level functions, though this is not always recommended for security-restricted documents.

error AttributeError: module 'pdfminer' has no attribute 'high_level' ↓

cause This typically arises when code written for `pdfminer.six` attempts to use the `high_level` module, but an older `pdfminer` library (which does not have this module) is being imported or is shadowing the `pdfminer.six` installation.

fix

Verify that pdfminer.six is the only PDFMiner-related package installed and is accessible in your Python environment. Uninstall any older pdfminer installations (e.g., pip uninstall pdfminer) and ensure pdfminer.six is properly installed (pip install pdfminer.six). Also, ensure you are importing from pdfminer.high_level import extract_text or similar.

error (cid:x) values in textual output ↓

cause This is a common issue where `pdfminer.six` cannot map a character ID (CID) to a Unicode character, often due to custom fonts, non-standard PDF encoding, or embedded fonts not providing sufficient information for proper decoding.

fix

This is often a limitation of the PDF itself. A quick check is to copy-paste the text from a PDF viewer; if it's gibberish, pdfminer.six likely won't do better. For programmatic solutions, one might need to apply custom character mapping or use OCR for such PDFs.

error KeyError: 'N' ↓

cause This `KeyError` (or similar for keys like 'Type', 'Resources', 'MediaBox') often indicates that the PDF document is malformed or does not strictly adhere to the PDF specification, missing expected dictionary keys that `pdfminer.six` anticipates.

fix

This typically points to an issue with a specific, non-standard PDF file. There isn't a universal code fix, but sometimes updating pdfminer.six to the latest version can resolve issues with certain malformed PDFs, as the library often adds robustness for such cases.

Warnings

breaking Arbitrary Code Execution Vulnerabilities (CVE-2025-64512 and CVE-2025-70559) due to insecure deserialization of CMap cache files via Python's `pickle` module. This allowed attackers to execute arbitrary code by providing malicious PDF files or pickle files. [cite: 2 (release notes 20251230), 11, 12] ↓

fix Upgrade to version `20251230` or newer. This version replaces `pickle` with `json` for CMap storage. If you have custom `pickle` CMaps, you must convert them to JSON format using `tools/convert_cmaps_to_json.py` (included in the library). [cite: 2 (release notes 20251230), 11]

deprecated The third argument (generation number) to `PDFObjRef` was deprecated. ↓

fix Avoid using the third argument for `PDFObjRef` as it is no longer supported and can lead to `TypeError` with corrupt PDF object references.

gotcha Textual output may contain raw character ID's (e.g., `(cid:x)` values) instead of readable characters for certain PDFs, especially those with non-standard font encodings or missing font data. This often happens when text cannot be properly converted to Unicode. ↓

fix Verify if text can be copy-pasted correctly from a PDF viewer; if it's gibberish there, `pdfminer.six` will also struggle. For scanned PDFs, combine `pdfminer.six` with an OCR library (e.g., `pytesseract`). Consider adjusting `LAParams` for layout analysis.

gotcha Processing very large PDF files can lead to significant memory consumption and performance issues. ↓

fix For large PDFs, consider extracting text page by page or in chunks using the `page_numbers` argument in functions like `extract_text` to manage memory usage more effectively.

Install

pip install 'pdfminer.six[image]'

Install compatibility verified last tested: 2026-05-12

python os / libc variant status wheel install import disk

3.10 alpine (musl) image wheel - 0.59s 64.1M

3.10 alpine (musl) image - - 0.55s 63.0M

3.10 alpine (musl) pdfminer.six wheel - 0.61s 44.6M

3.10 alpine (musl) pdfminer.six - - 0.57s 43.5M

3.10 slim (glibc) image wheel 3.6s 0.48s 65M

3.10 slim (glibc) image - - 0.51s 64M

3.10 slim (glibc) pdfminer.six wheel 3.0s 0.48s 45M

3.10 slim (glibc) pdfminer.six - - 0.53s 44M

3.11 alpine (musl) image wheel - 0.62s 66.8M

3.11 alpine (musl) image - - 0.68s 65.7M

3.11 alpine (musl) pdfminer.six wheel - 0.64s 46.7M

3.11 alpine (musl) pdfminer.six - - 0.68s 45.7M

3.11 slim (glibc) image wheel 3.2s 0.53s 68M

3.11 slim (glibc) image - - 0.53s 67M

3.11 slim (glibc) pdfminer.six wheel 2.6s 0.54s 47M

3.11 slim (glibc) pdfminer.six - - 0.54s 46M

3.12 alpine (musl) image wheel - 0.62s 58.4M

3.12 alpine (musl) image - - 0.62s 57.4M

3.12 alpine (musl) pdfminer.six wheel - 0.58s 38.5M

3.12 alpine (musl) pdfminer.six - - 0.70s 37.4M

3.12 slim (glibc) image wheel 2.8s 0.59s 59M

3.12 slim (glibc) image - - 0.60s 58M

3.12 slim (glibc) pdfminer.six wheel 2.4s 0.59s 39M

3.12 slim (glibc) pdfminer.six - - 0.63s 38M

3.13 alpine (musl) image wheel - 0.63s 58.2M

3.13 alpine (musl) image - - 0.59s 57.0M

3.13 alpine (musl) pdfminer.six wheel - 0.57s 38.2M

3.13 alpine (musl) pdfminer.six - - 0.63s 37.0M

3.13 slim (glibc) image wheel 2.8s 0.55s 59M

3.13 slim (glibc) image - - 0.59s 58M

3.13 slim (glibc) pdfminer.six wheel 2.4s 0.56s 39M

3.13 slim (glibc) pdfminer.six - - 0.63s 37M

3.9 alpine (musl) image wheel - 0.38s 60.5M

3.9 alpine (musl) image - - 0.41s 59.5M

3.9 alpine (musl) pdfminer.six wheel - 0.39s 43.1M

3.9 alpine (musl) pdfminer.six - - 0.42s 42.1M

3.9 slim (glibc) image wheel 4.3s 0.33s 62M

3.9 slim (glibc) image - - 0.33s 60M

3.9 slim (glibc) pdfminer.six wheel 3.5s 0.31s 44M

3.9 slim (glibc) pdfminer.six - - 0.35s 42M

Imports

extract_text
```
from pdfminer.high_level import extract_text
```
This is the recommended high-level API for simple text extraction.

PDFParser, PDFDocument, PDFResourceManager, PDFPageInterpreter, TextConverter, LAParams

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

These are components of the composable (lower-level) API for more granular control over PDF processing.

Quickstart last tested: 2026-04-24

This quickstart demonstrates the simplest way to extract all text from a PDF file using the high-level `extract_text` function. The example includes creating a dummy PDF for demonstration purposes.

import io
from pdfminer.high_level import extract_text

# For demonstration, let's create a dummy PDF file path.
# In a real scenario, this would be the path to your .pdf file.
dummy_pdf_path = "example.pdf"

# Create a dummy PDF file for the example to run without error
# In a real application, replace this with actual PDF file handling.
try:
    with open(dummy_pdf_path, 'wb') as f:
        f.write(b'%PDF-1.4\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj 3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]/Contents 4 0 R>>endobj 4 0 obj<</Length 41>>stream\nBT /F1 24 Tf 100 700 Td (Hello, PDFMiner.six!) Tj ET\nendstream\nxref\n0 5\n0000000000 65535 f\n0000000009 00000 n\n0000000056 00000 n\n0000000114 00000 n\n0000000213 00000 n\ntrailer<</Size 5/Root 1 0 R>>startxref\n296\n%%EOF')
    
    # Extract text from the PDF
    text = extract_text(dummy_pdf_path)
    print("Extracted Text:")
    print(text)

except Exception as e:
    print(f"An error occurred: {e}")
finally:
    # Clean up the dummy PDF file
    import os
    if os.path.exists(dummy_pdf_path):
        os.remove(dummy_pdf_path)