Camelot
Camelot is a Python library designed for extracting tabular data from PDF files. It provides fine-grained control over the extraction process, offering two distinct parsing methods: Lattice (for tables with clearly defined lines) and Stream (for tables with whitespace as delimiters). Extracted tables are converted into pandas DataFrames, enabling seamless integration into data analysis workflows, and can be exported to various formats including CSV, JSON, Excel, HTML, Markdown, and SQLite. The library is actively maintained, with the current version being 1.0.9, and features frequent patch releases.
Warnings
- gotcha Camelot primarily works with text-based PDFs. It cannot reliably extract tables from scanned documents or image-based PDFs where text is not selectable. Always verify if text in your PDF is selectable via a PDF viewer.
- breaking Installation issues with Ghostscript: Prior to v1.0.0, Ghostscript was a mandatory external dependency, often leading to installation complexities due to system-level setup and PATH configuration, especially on Windows and macOS. While v1.0.0 introduced pypdfium2 as the default Python-installable backend to mitigate this, Ghostscript is still an optional backend and problems can arise if it's explicitly chosen or needed for specific environments.
- gotcha Choosing the correct parsing 'flavor' is crucial for accurate extraction. 'lattice' (default) is best for tables with clearly defined lines. 'stream' is better for tables where columns and rows are separated by whitespace, not explicit lines. Using the wrong flavor can lead to no tables being found or incorrect data extraction.
- gotcha For PDFs with complex layouts, tables spanning multiple pages, or multiple tables on a single page, Camelot might fail to autodetect all tables or merge unrelated data. The 'stream' flavor, in particular, may treat an entire page as a single table.
- gotcha Complex tables with merged cells, multi-line text within cells, or inconsistent spacing can lead to data being incorrectly grouped into single rows or having unwanted newline characters.
Install
-
pip install "camelot-py[base]" -
pip install "camelot-py[cv]" -
conda install -c conda-forge camelot-py
Imports
- camelot
import camelot
Quickstart
import camelot
import pandas as pd
import os
# NOTE: Replace 'foo.pdf' with the path to your actual PDF file.
# You can create a dummy PDF for testing or use an existing one.
# Example: A simple PDF with a table on page 1.
# Ensure the PDF exists for the example to run
if not os.path.exists('foo.pdf'):
print("Please create a 'foo.pdf' with at least one table for this example.")
# For a truly runnable example, one might generate a PDF using ReportLab or FPDF
# For simplicity here, we assume the user provides foo.pdf
exit()
# Read tables from the PDF (defaults to 'lattice' flavor and first page)
tables = camelot.read_pdf('foo.pdf') #
# Print the number of tables found
print(f"Found {tables.n} tables.\n")
if tables.n > 0:
# Access the first extracted table
first_table = tables[0]
# Print parsing report for insights on accuracy and whitespace
print("Parsing Report for the first table:")
print(first_table.parsing_report) #
# Convert the table to a pandas DataFrame
df = first_table.df #
print("\nExtracted DataFrame (first 5 rows):\n", df.head())
# Export the table to CSV
first_table.to_csv('foo_table.csv', index=False) #
print("\nTable exported to foo_table.csv")
# Alternatively, export all tables to a compressed zip file
tables.export('all_tables.zip', f='csv', compress=True) #
print("All tables exported to all_tables.zip")
else:
print("No tables found in 'foo.pdf'. You may need to adjust parameters like 'flavor' or 'pages'.")