PDFPlumber

0.11.9 · active · verified Sun Mar 29

PDFPlumber is a powerful Python library designed for high-precision extraction of text, tables, and detailed layout information from PDF documents. Built on `pdfminer.six`, it offers fine-grained control over PDF elements like characters, lines, rectangles, and curves, and includes robust visual debugging tools. The library is currently at version 0.11.9 and maintains an active development and release cadence with frequent updates to its core dependencies.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to open a PDF, extract text from its first page, and find tables. For `to_image()` features and visual debugging, Poppler must be installed on your system. Remember to replace 'dummy.pdf' with the path to an actual PDF file.

import pdfplumber
import os

# Create a dummy PDF file for demonstration
# In a real scenario, you'd replace 'dummy.pdf' with your actual file path.
# This example just shows the API usage.

try:
    # This part would typically involve a real PDF file
    # For a runnable example, we'll assume 'dummy.pdf' exists or create a simple one (conceptually)
    # For local testing, you might use a library like FPDF to generate a simple PDF
    # or download a sample PDF.

    # Placeholder: Replace with path to your actual PDF file
    pdf_path = os.environ.get('PDFPLUMBER_DEMO_PDF', 'dummy.pdf')

    # Example of how to use pdfplumber
    with pdfplumber.open(pdf_path) as pdf:
        print(f"Number of pages: {len(pdf.pages)}")
        first_page = pdf.pages[0]
        print(f"Text from first page:\n{first_page.extract_text()}")

        # Extract tables from the first page
        tables = first_page.extract_tables()
        if tables:
            print(f"\nTables found on first page (first table):\n{tables[0]}")
        else:
            print("\nNo tables found on the first page.")

        # Optional: Visual debugging (requires Poppler installed)
        # im = first_page.to_image()
        # im.draw_rects(first_page.chars)
        # im.save("first_page_debug.png")

except FileNotFoundError:
    print(f"Error: PDF file '{pdf_path}' not found. Please provide a valid PDF for the quickstart.")
except Exception as e:
    print(f"An error occurred: {e}")

view raw JSON →