PDFPlumber
PDFPlumber is a powerful Python library designed for high-precision extraction of text, tables, and detailed layout information from PDF documents. Built on `pdfminer.six`, it offers fine-grained control over PDF elements like characters, lines, rectangles, and curves, and includes robust visual debugging tools. The library is currently at version 0.11.9 and maintains an active development and release cadence with frequent updates to its core dependencies.
Warnings
- breaking In `v0.11.7`, `stroking_pattern` and `non_stroking_pattern` object attributes were removed due to underlying changes in `pdfminer.six`. Code relying on these attributes will break.
- breaking Version `v0.11.0` introduced new `line_dir` and `char_dir` parameters for better control over text directionality (e.g., non-left-to-right, top-to-bottom text). While enhancing support for complex PDFs, these changes might subtly alter text extraction behavior for certain documents if relying on previous implicit direction handling.
- gotcha `pdfplumber` is built on `pdfminer.six`, and updates to `pdfminer.six` can sometimes introduce breaking changes or altered behavior in `pdfplumber`. `pdfplumber` often pins `pdfminer.six` versions, but large jumps can still have impacts.
- gotcha When processing large PDF files or many documents, cached page and object properties can consume significant memory. Not properly closing PDF objects can lead to memory leaks.
- gotcha Image-based features like `Page.to_image()` (used for visual debugging) require the Poppler utility library to be installed on your system. Without it, these methods will raise an error.
- breaking The table extraction algorithm in `pdfplumber` underwent a radical redesign in `v0.5.0`. This introduced significant breaking changes to the table extraction API and configuration, meaning code written for versions prior to `v0.5.0` will likely not work with newer versions.
Install
-
pip install pdfplumber -
sudo apt-get install poppler-utils # Debian/Ubuntu brew install poppler # macOS
Imports
- pdfplumber
import pdfplumber
- open
from pdfplumber import open
Quickstart
import pdfplumber
import os
# Create a dummy PDF file for demonstration
# In a real scenario, you'd replace 'dummy.pdf' with your actual file path.
# This example just shows the API usage.
try:
# This part would typically involve a real PDF file
# For a runnable example, we'll assume 'dummy.pdf' exists or create a simple one (conceptually)
# For local testing, you might use a library like FPDF to generate a simple PDF
# or download a sample PDF.
# Placeholder: Replace with path to your actual PDF file
pdf_path = os.environ.get('PDFPLUMBER_DEMO_PDF', 'dummy.pdf')
# Example of how to use pdfplumber
with pdfplumber.open(pdf_path) as pdf:
print(f"Number of pages: {len(pdf.pages)}")
first_page = pdf.pages[0]
print(f"Text from first page:\n{first_page.extract_text()}")
# Extract tables from the first page
tables = first_page.extract_tables()
if tables:
print(f"\nTables found on first page (first table):\n{tables[0]}")
else:
print("\nNo tables found on the first page.")
# Optional: Visual debugging (requires Poppler installed)
# im = first_page.to_image()
# im.draw_rects(first_page.chars)
# im.save("first_page_debug.png")
except FileNotFoundError:
print(f"Error: PDF file '{pdf_path}' not found. Please provide a valid PDF for the quickstart.")
except Exception as e:
print(f"An error occurred: {e}")