PDFText

0.6.3 · active · verified Thu Apr 16

pdftext is a Python library designed for fast and accurate extraction of structured text from PDF documents. It focuses on efficiently parsing text, detecting elements like tables and links, and handling complex layouts. The current version is 0.6.3, and it's actively maintained with frequent minor releases addressing bug fixes and introducing new features.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize `PDFText` with a PDF file, extract the full text, retrieve text as structured blocks and lines, and extract tables. It assumes a PDF file named 'example.pdf' exists at the specified path for successful execution.

import os
from pdftext import PDFText

# Assuming 'example.pdf' is in the same directory
# For a real application, replace with a valid path to your PDF file
pdf_path = os.path.join(os.path.dirname(__file__), 'example.pdf') # Replace or create example.pdf

# Create a dummy PDF for demonstration if it doesn't exist
# In a real scenario, you'd have your actual PDF here.
# For a proper quickstart, you'd need a real PDF. This is just to make it runnable.
# For local testing, ensure 'example.pdf' exists.
# You can create a simple one: print('Hello PDF') > example.pdf (then convert to actual PDF)

# --- This part requires an actual PDF file ---
# To make this truly runnable for testing, one would need to create a dummy PDF file
# or specify a path to an existing one.

# For local testing, ensure a file named 'example.pdf' exists in the script's directory.
# For a quick dummy, if you have FPDF installed:
# from fpdf import FPDF
# pdf = FPDF()
# pdf.add_page()
# pdf.set_font('Arial', 'B', 16)
# pdf.cell(40, 10, 'Hello, pdftext!')
# pdf.output(pdf_path)

# Let's assume pdf_path points to an existing PDF for this example.
# If you don't have an example.pdf, this will fail with FileNotFoundError.

try:
    # Initialize PDFText with the path to your PDF
    pdf_processor = PDFText(pdf_path)

    # Extract all text as a single string
    full_text = pdf_processor.as_text()
    print("--- Full Text ---")
    print(full_text)

    # Extract text as blocks
    text_blocks = pdf_processor.as_blocks()
    print("\n--- Text Blocks ---")
    for i, block in enumerate(text_blocks[:2]): # Print first 2 blocks
        print(f"Block {i+1}: {block.text[:100]}...")

    # Extract text as lines (for detailed layout analysis)
    text_lines = pdf_processor.as_lines()
    print("\n--- Text Lines (first 5) ---")
    for i, line in enumerate(text_lines[:5]):
        print(f"Line {i+1}: {line.text}")

    # Extract tables (if any)
    tables = pdf_processor.as_tables()
    if tables:
        print("\n--- Tables (first) ---")
        print(tables[0].to_csv())
    else:
        print("\nNo tables found.")

except FileNotFoundError:
    print(f"Error: PDF file not found at {pdf_path}. Please create or specify a valid PDF.")
except Exception as e:
    print(f"An error occurred: {e}")

view raw JSON →