pdftotext

3.0.0 · active · verified Fri Apr 17

pdftotext is a Python wrapper for the `pdftotext` command-line utility (part of the Poppler PDF rendering library). It provides a simple, efficient way to extract text from PDF documents. The current version is 3.0.0, and it has a moderate release cadence, with major updates happening less frequently than minor bug fixes.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to load a PDF, extract all text by joining its pages, and access text from individual pages using list-like indexing. It also includes error handling for the common case where the underlying poppler-utils `pdftotext` command is not found.

import pdftotext
import os

# Create a dummy PDF file for demonstration
dummy_pdf_content = b"%PDF-1.4\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj 3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]/Contents 4 0 R>>endobj 4 0 obj<</Length 44>>stream\nBT /F1 24 Tf 100 700 Td (Hello, pdftotext!) Tj ET\nendstream\nendobj\nxref\n0 5\n0000000000 65535 f\n0000000009 00000 n\n0000000055 00000 n\n0000000109 00000 n\n0000000216 00000 n\ntrailer<</Size 5/Root 1 0 R>>startxref 303\n%%EOF"
with open("dummy.pdf", "wb") as f:
    f.write(dummy_pdf_content)

# Load your PDF file
try:
    with open("dummy.pdf", "rb") as f:
        pdf = pdftotext.PDF(f)

    # Get all text from the document (each element is a page)
    full_text = "\n\n".join(pdf)
    print("--- Full PDF Text ---")
    print(full_text)

    # Get text from a specific page (e.g., the first page)
    if len(pdf) > 0:
        first_page_text = pdf[0]
        print("\n--- First Page Text ---")
        print(first_page_text)
    else:
        print("\nNo pages found in PDF.")
except pdftotext.Error as e:
    print(f"Error processing PDF: {e}. Make sure poppler-utils is installed.")
finally:
    # Clean up the dummy file
    if os.path.exists("dummy.pdf"):
        os.remove("dummy.pdf")

view raw JSON →