PDFText
pdftext is a Python library designed for fast and accurate extraction of structured text from PDF documents. It focuses on efficiently parsing text, detecting elements like tables and links, and handling complex layouts. The current version is 0.6.3, and it's actively maintained with frequent minor releases addressing bug fixes and introducing new features.
Common errors
-
ImportError: cannot import name 'PDFText' from 'pdftext'
cause The `pdftext` library is either not installed, or there is a typo in the import statement.fixEnsure the library is installed with `pip install pdftext` and use the correct import statement: `from pdftext import PDFText`. -
pypdfium2.errors.PdfiumError: Failed to load PDF document
cause This error typically indicates an issue with the underlying `pypdfium2` dependency. This could be due to an incompatible version, a corrupt `pypdfium2` installation, or missing system dependencies required by `pypdfium2`.fixVerify that `pypdfium2` is installed and compatible. Try reinstalling `pypdfium2` with `pip install --force-reinstall pypdfium2`. Also, check the `pdftext` `pyproject.toml` for the exact `pypdfium2` version range it expects. -
FileNotFoundError: [Errno 2] No such file or directory: '/path/to/your.pdf'
cause The PDF file specified in the `PDFText()` constructor does not exist at the given path.fixDouble-check the file path for typos. Ensure the file exists and that the path is either absolute or correct relative to the script's execution directory.
Warnings
- breaking Version 0.4.0 introduced a significant change in text segmentation, moving from a decision tree to a heuristic-based approach. This may result in different text output, especially regarding how spans, lines, and blocks are segmented compared to previous versions.
- gotcha The library pins specific versions of its core dependency, `pypdfium2` (e.g., v0.4.1 pinned to a previous version due to a bug). Using an incompatible `pypdfium2` version in your environment can lead to errors or incorrect text extraction.
- gotcha Minor versions, like v0.6.2 and v0.6.3, introduce changes to text span breaking (e.g., more aggressive breaking on newlines) and rotation issue fixes. These improvements, while beneficial, can slightly alter the resulting extracted text structure or content for some PDFs.
Install
-
pip install pdftext
Imports
- PDFText
from pdftext import PDFText
Quickstart
import os
from pdftext import PDFText
# Assuming 'example.pdf' is in the same directory
# For a real application, replace with a valid path to your PDF file
pdf_path = os.path.join(os.path.dirname(__file__), 'example.pdf') # Replace or create example.pdf
# Create a dummy PDF for demonstration if it doesn't exist
# In a real scenario, you'd have your actual PDF here.
# For a proper quickstart, you'd need a real PDF. This is just to make it runnable.
# For local testing, ensure 'example.pdf' exists.
# You can create a simple one: print('Hello PDF') > example.pdf (then convert to actual PDF)
# --- This part requires an actual PDF file ---
# To make this truly runnable for testing, one would need to create a dummy PDF file
# or specify a path to an existing one.
# For local testing, ensure a file named 'example.pdf' exists in the script's directory.
# For a quick dummy, if you have FPDF installed:
# from fpdf import FPDF
# pdf = FPDF()
# pdf.add_page()
# pdf.set_font('Arial', 'B', 16)
# pdf.cell(40, 10, 'Hello, pdftext!')
# pdf.output(pdf_path)
# Let's assume pdf_path points to an existing PDF for this example.
# If you don't have an example.pdf, this will fail with FileNotFoundError.
try:
# Initialize PDFText with the path to your PDF
pdf_processor = PDFText(pdf_path)
# Extract all text as a single string
full_text = pdf_processor.as_text()
print("--- Full Text ---")
print(full_text)
# Extract text as blocks
text_blocks = pdf_processor.as_blocks()
print("\n--- Text Blocks ---")
for i, block in enumerate(text_blocks[:2]): # Print first 2 blocks
print(f"Block {i+1}: {block.text[:100]}...")
# Extract text as lines (for detailed layout analysis)
text_lines = pdf_processor.as_lines()
print("\n--- Text Lines (first 5) ---")
for i, line in enumerate(text_lines[:5]):
print(f"Line {i+1}: {line.text}")
# Extract tables (if any)
tables = pdf_processor.as_tables()
if tables:
print("\n--- Tables (first) ---")
print(tables[0].to_csv())
else:
print("\nNo tables found.")
except FileNotFoundError:
print(f"Error: PDF file not found at {pdf_path}. Please create or specify a valid PDF.")
except Exception as e:
print(f"An error occurred: {e}")