pdftotext
pdftotext is a Python wrapper for the `pdftotext` command-line utility (part of the Poppler PDF rendering library). It provides a simple, efficient way to extract text from PDF documents. The current version is 3.0.0, and it has a moderate release cadence, with major updates happening less frequently than minor bug fixes.
Common errors
-
FileNotFoundError: [Errno 2] No such file or directory: 'pdftotext'
cause The underlying `pdftotext` command-line utility from Poppler is not installed or not in your system's PATH.fixInstall `poppler-utils` (Linux) or `poppler` (macOS) on your operating system and ensure the `pdftotext` executable is accessible via your system's PATH. -
AttributeError: 'list' object has no attribute 'pages'
cause You are attempting to access `pdf.pages` on an object of type `pdftotext.PDF` with `pdftotext` library version 3.0.0 or higher. The `pages` attribute was removed.fixRemove the `.pages` attribute. The `pdftotext.PDF` object itself is now directly iterable and indexable. For example, use `for page in pdf:` instead of `for page in pdf.pages:` and `pdf[0]` instead of `pdf.pages[0]`. -
UnicodeEncodeError: 'charmap' codec can't encode character...
cause While `pdftotext` (especially v3.0.0+) generally handles UTF-8, some system default encodings or malformed PDFs can still lead to encoding issues when printing or writing extracted text.fixExplicitly specify UTF-8 encoding when writing to files: `with open('output.txt', 'w', encoding='utf-8') as f: f.write(text)`. For printing, ensure your terminal is configured for UTF-8.
Warnings
- breaking The `pdf.pages` attribute was removed in version 3.0.0. The `pdftotext.PDF` object now behaves like a list of strings, where each string is the text of a page. Old code referencing `pdf.pages` will break.
- gotcha This library is a wrapper for the `pdftotext` command-line utility, which is part of the Poppler PDF rendering library. You must install Poppler (e.g., `poppler-utils` on Linux, `poppler` on macOS) on your system for `pdftotext` to function.
- gotcha Processing very large or complex PDF documents can be memory-intensive, as the library often loads the entire document into memory before extraction. This can lead to `MemoryError` or slow performance.
Install
-
pip install pdftotext -
sudo apt-get install poppler-utils # Debian/Ubuntu sudo dnf install poppler-utils # Fedora brew install poppler # macOS (Homebrew)
Imports
- pdftotext
import pdftotext
Quickstart
import pdftotext
import os
# Create a dummy PDF file for demonstration
dummy_pdf_content = b"%PDF-1.4\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj 3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]/Contents 4 0 R>>endobj 4 0 obj<</Length 44>>stream\nBT /F1 24 Tf 100 700 Td (Hello, pdftotext!) Tj ET\nendstream\nendobj\nxref\n0 5\n0000000000 65535 f\n0000000009 00000 n\n0000000055 00000 n\n0000000109 00000 n\n0000000216 00000 n\ntrailer<</Size 5/Root 1 0 R>>startxref 303\n%%EOF"
with open("dummy.pdf", "wb") as f:
f.write(dummy_pdf_content)
# Load your PDF file
try:
with open("dummy.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# Get all text from the document (each element is a page)
full_text = "\n\n".join(pdf)
print("--- Full PDF Text ---")
print(full_text)
# Get text from a specific page (e.g., the first page)
if len(pdf) > 0:
first_page_text = pdf[0]
print("\n--- First Page Text ---")
print(first_page_text)
else:
print("\nNo pages found in PDF.")
except pdftotext.Error as e:
print(f"Error processing PDF: {e}. Make sure poppler-utils is installed.")
finally:
# Clean up the dummy file
if os.path.exists("dummy.pdf"):
os.remove("dummy.pdf")