tabula-py
tabula-py is a simple Python wrapper for tabula-java, a tool that extracts tabular data from PDF files. It allows users to read tables directly into pandas DataFrames or convert PDF tables into CSV, TSV, or JSON files. The library is currently at version 2.10.0 and receives regular maintenance and updates, including support for newer Python versions.
Warnings
- breaking tabula-py is a wrapper for tabula-java and requires a Java Runtime Environment (JRE 8+) to be installed on your system and accessible in your system's PATH. Without it, `tabula-py` functions will raise a `tabula.errors.JavaNotFoundError`.
- breaking Python version compatibility has changed across recent releases. Version 2.10.0 dropped support for Python 3.8 and added support for Python 3.13. Version 2.9.0 introduced support for Python 3.12, making `jpype` optional due to its lack of 3.12 support at the time.
- gotcha Since v2.9.0, JPype1 is an optional dependency. While `tabula-py` can function without it by falling back to subprocess mode, installing with `pip install tabula-py[jpype]` is recommended for significantly faster execution on compatible Python versions (up to 3.11, and newer once JPype1 adds support).
- gotcha Installing a separate Python package named `tabula` (instead of `tabula-py`) can lead to a namespace conflict, causing `AttributeError: module 'tabula' has no attribute 'read_pdf'` when trying to use `tabula-py` functions.
- gotcha tabula-py (and its underlying tabula-java) cannot extract tables from image-based PDFs; the PDF must contain text-based table information. Additionally, by default, `read_pdf()` only extracts from page 1, and prior to v2.0.0, `multiple_tables` was `False` by default.
Install
-
pip install tabula-py -
pip install tabula-py[jpype]
Imports
- read_pdf
from tabula import read_pdf
- convert_into
from tabula import convert_into
- environment_info
from tabula import environment_info
Quickstart
import tabula
import pandas as pd # often used with tabula-py results
# Example PDF URL with tables
pdf_url = "https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf"
try:
# Read tables from the PDF into a list of DataFrames
# pages='all' extracts from all pages. '1' is default.
# multiple_tables=True is the default from v2.0.0, returning a list even if only one table.
dfs = tabula.read_pdf(pdf_url, pages='all', multiple_tables=True)
if dfs:
print(f"Successfully extracted {len(dfs)} tables.")
for i, df in enumerate(dfs):
print(f"\nTable {i+1}:")
print(df.head()) # Print first few rows of each DataFrame
else:
print("No tables found in the PDF.")
# You can also convert to CSV directly
output_csv_path = "output.csv"
tabula.convert_into(pdf_url, output_csv_path, output_format="csv", pages='all')
print(f"\nTables converted and saved to {output_csv_path}")
except tabula.errors.JavaNotFoundError:
print("Error: Java Runtime Environment (JRE) not found. Please install Java 8+ and ensure it's in your PATH.")
except Exception as e:
print(f"An error occurred: {e}")