tabula-py

2.10.0 · active · verified Sat Apr 11

tabula-py is a simple Python wrapper for tabula-java, a tool that extracts tabular data from PDF files. It allows users to read tables directly into pandas DataFrames or convert PDF tables into CSV, TSV, or JSON files. The library is currently at version 2.10.0 and receives regular maintenance and updates, including support for newer Python versions.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to extract tables from a remote PDF file into a list of pandas DataFrames using `tabula.read_pdf()`. It also shows how to directly convert PDF tables to a CSV file using `tabula.convert_into()`. Error handling for the common `JavaNotFoundError` is included, as a Java Runtime Environment is a prerequisite.

import tabula
import pandas as pd # often used with tabula-py results

# Example PDF URL with tables
pdf_url = "https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf"

try:
    # Read tables from the PDF into a list of DataFrames
    # pages='all' extracts from all pages. '1' is default.
    # multiple_tables=True is the default from v2.0.0, returning a list even if only one table.
    dfs = tabula.read_pdf(pdf_url, pages='all', multiple_tables=True)

    if dfs:
        print(f"Successfully extracted {len(dfs)} tables.")
        for i, df in enumerate(dfs):
            print(f"\nTable {i+1}:")
            print(df.head()) # Print first few rows of each DataFrame
    else:
        print("No tables found in the PDF.")

    # You can also convert to CSV directly
    output_csv_path = "output.csv"
    tabula.convert_into(pdf_url, output_csv_path, output_format="csv", pages='all')
    print(f"\nTables converted and saved to {output_csv_path}")

except tabula.errors.JavaNotFoundError:
    print("Error: Java Runtime Environment (JRE) not found. Please install Java 8+ and ensure it's in your PATH.")
except Exception as e:
    print(f"An error occurred: {e}")

view raw JSON →