Camelot

1.0.9 · active · verified Sat Apr 11

Camelot is a Python library designed for extracting tabular data from PDF files. It provides fine-grained control over the extraction process, offering two distinct parsing methods: Lattice (for tables with clearly defined lines) and Stream (for tables with whitespace as delimiters). Extracted tables are converted into pandas DataFrames, enabling seamless integration into data analysis workflows, and can be exported to various formats including CSV, JSON, Excel, HTML, Markdown, and SQLite. The library is actively maintained, with the current version being 1.0.9, and features frequent patch releases.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to read a PDF file, extract tables using Camelot's default settings, inspect the parsing report, convert an extracted table to a pandas DataFrame, and export it to a CSV file. It assumes a 'foo.pdf' file with at least one table exists in the execution directory.

import camelot
import pandas as pd
import os

# NOTE: Replace 'foo.pdf' with the path to your actual PDF file.
# You can create a dummy PDF for testing or use an existing one.
# Example: A simple PDF with a table on page 1.

# Ensure the PDF exists for the example to run
if not os.path.exists('foo.pdf'):
    print("Please create a 'foo.pdf' with at least one table for this example.")
    # For a truly runnable example, one might generate a PDF using ReportLab or FPDF
    # For simplicity here, we assume the user provides foo.pdf
    exit()

# Read tables from the PDF (defaults to 'lattice' flavor and first page)
tables = camelot.read_pdf('foo.pdf') #

# Print the number of tables found
print(f"Found {tables.n} tables.\n")

if tables.n > 0:
    # Access the first extracted table
    first_table = tables[0]

    # Print parsing report for insights on accuracy and whitespace
    print("Parsing Report for the first table:")
    print(first_table.parsing_report) #

    # Convert the table to a pandas DataFrame
    df = first_table.df #
    print("\nExtracted DataFrame (first 5 rows):\n", df.head())

    # Export the table to CSV
    first_table.to_csv('foo_table.csv', index=False) #
    print("\nTable exported to foo_table.csv")

    # Alternatively, export all tables to a compressed zip file
    tables.export('all_tables.zip', f='csv', compress=True) #
    print("All tables exported to all_tables.zip")
else:
    print("No tables found in 'foo.pdf'. You may need to adjust parameters like 'flavor' or 'pages'.")

view raw JSON →