tabula-py

2.10.0 · active · verified Sat Apr 11

tabula-py is a simple Python wrapper for tabula-java, a tool that extracts tabular data from PDF files. It allows users to read tables directly into pandas DataFrames or convert PDF tables into CSV, TSV, or JSON files. The library is currently at version 2.10.0 and receives regular maintenance and updates, including support for newer Python versions.

Warnings

breaking tabula-py is a wrapper for tabula-java and requires a Java Runtime Environment (JRE 8+) to be installed on your system and accessible in your system's PATH. Without it, `tabula-py` functions will raise a `tabula.errors.JavaNotFoundError`.
Fix: Install a Java Runtime Environment (JRE) 8 or newer and ensure its `bin` directory is added to your system's PATH environment variable. You can verify Java availability using `tabula.environment_info()`.
breaking Python version compatibility has changed across recent releases. Version 2.10.0 dropped support for Python 3.8 and added support for Python 3.13. Version 2.9.0 introduced support for Python 3.12, making `jpype` optional due to its lack of 3.12 support at the time.
Fix: If using Python 3.8, restrict `tabula-py` to `<2.10.0`. For Python 3.12 or 3.13, install `tabula-py` without the `[jpype]` extra if `jpype` does not yet support your specific Python version (e.g., `pip install tabula-py`). Monitor `jpype`'s compatibility for optimal performance with `[jpype]` extra.
gotcha Since v2.9.0, JPype1 is an optional dependency. While `tabula-py` can function without it by falling back to subprocess mode, installing with `pip install tabula-py[jpype]` is recommended for significantly faster execution on compatible Python versions (up to 3.11, and newer once JPype1 adds support).
Fix: For performance-critical applications, install `tabula-py` with `pip install tabula-py[jpype]`. If you encounter `jpype` related issues, you can explicitly force subprocess mode by passing `force_subprocess=True` to `read_pdf()` and related functions.
gotcha Installing a separate Python package named `tabula` (instead of `tabula-py`) can lead to a namespace conflict, causing `AttributeError: module 'tabula' has no attribute 'read_pdf'` when trying to use `tabula-py` functions.
Fix: Ensure you have `tabula-py` installed and not a conflicting `tabula` package. If a conflict exists, `pip uninstall tabula` before `pip install tabula-py`. `tabula-py`'s functions are typically accessed directly from the `tabula` submodule (e.g., `from tabula import read_pdf`).
gotcha tabula-py (and its underlying tabula-java) cannot extract tables from image-based PDFs; the PDF must contain text-based table information. Additionally, by default, `read_pdf()` only extracts from page 1, and prior to v2.0.0, `multiple_tables` was `False` by default.
Fix: Verify that your PDF contains selectable text, not just images. For multi-page PDFs, always specify `pages='all'` or a list of desired page numbers (e.g., `pages=[1, 2, 5]`). Be aware that `read_pdf()` returns a list of DataFrames (or dicts for JSON output) when `multiple_tables=True` (default since v2.0.0). For complex PDFs, use the `area`, `stream`, or `lattice` options for more precise extraction, and consider trying the Tabula App (GUI tool) to debug extraction logic.

Install

pip install tabula-py Basic installation
pip install tabula-py[jpype] Installation with JPype for faster execution

Imports

read_pdf
```
from tabula import read_pdf
```
While `tabula.io.read_pdf` works, direct import from `tabula` is the common and documented pattern.
convert_into
```
from tabula import convert_into
```
environment_info
```
from tabula import environment_info
```
Useful for debugging Java environment issues.

Quickstart

This quickstart demonstrates how to extract tables from a remote PDF file into a list of pandas DataFrames using `tabula.read_pdf()`. It also shows how to directly convert PDF tables to a CSV file using `tabula.convert_into()`. Error handling for the common `JavaNotFoundError` is included, as a Java Runtime Environment is a prerequisite.

import tabula
import pandas as pd # often used with tabula-py results

# Example PDF URL with tables
pdf_url = "https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf"

try:
    # Read tables from the PDF into a list of DataFrames
    # pages='all' extracts from all pages. '1' is default.
    # multiple_tables=True is the default from v2.0.0, returning a list even if only one table.
    dfs = tabula.read_pdf(pdf_url, pages='all', multiple_tables=True)

    if dfs:
        print(f"Successfully extracted {len(dfs)} tables.")
        for i, df in enumerate(dfs):
            print(f"\nTable {i+1}:")
            print(df.head()) # Print first few rows of each DataFrame
    else:
        print("No tables found in the PDF.")

    # You can also convert to CSV directly
    output_csv_path = "output.csv"
    tabula.convert_into(pdf_url, output_csv_path, output_format="csv", pages='all')
    print(f"\nTables converted and saved to {output_csv_path}")

except tabula.errors.JavaNotFoundError:
    print("Error: Java Runtime Environment (JRE) not found. Please install Java 8+ and ensure it's in your PATH.")
except Exception as e:
    print(f"An error occurred: {e}")

view raw JSON →