{"id":3823,"library":"tabula-py","title":"tabula-py","description":"tabula-py is a simple Python wrapper for tabula-java, a tool that extracts tabular data from PDF files. It allows users to read tables directly into pandas DataFrames or convert PDF tables into CSV, TSV, or JSON files. The library is currently at version 2.10.0 and receives regular maintenance and updates, including support for newer Python versions.","status":"active","version":"2.10.0","language":"en","source_language":"en","source_url":"https://github.com/chezou/tabula-py","tags":["PDF","table extraction","data extraction","pandas","java","wrapper"],"install":[{"cmd":"pip install tabula-py","lang":"bash","label":"Basic installation"},{"cmd":"pip install tabula-py[jpype]","lang":"bash","label":"Installation with JPype for faster execution"}],"dependencies":[{"reason":"For outputting extracted tables as DataFrames.","package":"pandas"},{"reason":"Optional dependency for faster execution via direct JVM communication.","package":"JPype1","optional":true},{"reason":"External dependency required as tabula-py wraps tabula-java. Must be installed and accessible in system PATH.","package":"Java Runtime Environment (JRE) 8+"}],"imports":[{"note":"While `tabula.io.read_pdf` works, direct import from `tabula` is the common and documented pattern.","wrong":"import tabula; tabula.io.read_pdf()","symbol":"read_pdf","correct":"from tabula import read_pdf"},{"symbol":"convert_into","correct":"from tabula import convert_into"},{"note":"Useful for debugging Java environment issues.","symbol":"environment_info","correct":"from tabula import environment_info"}],"quickstart":{"code":"import tabula\nimport pandas as pd # often used with tabula-py results\n\n# Example PDF URL with tables\npdf_url = \"https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf\"\n\ntry:\n    # Read tables from the PDF into a list of DataFrames\n    # pages='all' extracts from all pages. '1' is default.\n    # multiple_tables=True is the default from v2.0.0, returning a list even if only one table.\n    dfs = tabula.read_pdf(pdf_url, pages='all', multiple_tables=True)\n\n    if dfs:\n        print(f\"Successfully extracted {len(dfs)} tables.\")\n        for i, df in enumerate(dfs):\n            print(f\"\\nTable {i+1}:\")\n            print(df.head()) # Print first few rows of each DataFrame\n    else:\n        print(\"No tables found in the PDF.\")\n\n    # You can also convert to CSV directly\n    output_csv_path = \"output.csv\"\n    tabula.convert_into(pdf_url, output_csv_path, output_format=\"csv\", pages='all')\n    print(f\"\\nTables converted and saved to {output_csv_path}\")\n\nexcept tabula.errors.JavaNotFoundError:\n    print(\"Error: Java Runtime Environment (JRE) not found. Please install Java 8+ and ensure it's in your PATH.\")\nexcept Exception as e:\n    print(f\"An error occurred: {e}\")","lang":"python","description":"This quickstart demonstrates how to extract tables from a remote PDF file into a list of pandas DataFrames using `tabula.read_pdf()`. It also shows how to directly convert PDF tables to a CSV file using `tabula.convert_into()`. Error handling for the common `JavaNotFoundError` is included, as a Java Runtime Environment is a prerequisite."},"warnings":[{"fix":"Install a Java Runtime Environment (JRE) 8 or newer and ensure its `bin` directory is added to your system's PATH environment variable. You can verify Java availability using `tabula.environment_info()`.","message":"tabula-py is a wrapper for tabula-java and requires a Java Runtime Environment (JRE 8+) to be installed on your system and accessible in your system's PATH. Without it, `tabula-py` functions will raise a `tabula.errors.JavaNotFoundError`.","severity":"breaking","affected_versions":"All versions"},{"fix":"If using Python 3.8, restrict `tabula-py` to `<2.10.0`. For Python 3.12 or 3.13, install `tabula-py` without the `[jpype]` extra if `jpype` does not yet support your specific Python version (e.g., `pip install tabula-py`). Monitor `jpype`'s compatibility for optimal performance with `[jpype]` extra.","message":"Python version compatibility has changed across recent releases. Version 2.10.0 dropped support for Python 3.8 and added support for Python 3.13. Version 2.9.0 introduced support for Python 3.12, making `jpype` optional due to its lack of 3.12 support at the time.","severity":"breaking","affected_versions":">=2.9.0 (Python 3.12+ users), >=2.10.0 (Python 3.8 users)"},{"fix":"For performance-critical applications, install `tabula-py` with `pip install tabula-py[jpype]`. If you encounter `jpype` related issues, you can explicitly force subprocess mode by passing `force_subprocess=True` to `read_pdf()` and related functions.","message":"Since v2.9.0, JPype1 is an optional dependency. While `tabula-py` can function without it by falling back to subprocess mode, installing with `pip install tabula-py[jpype]` is recommended for significantly faster execution on compatible Python versions (up to 3.11, and newer once JPype1 adds support).","severity":"gotcha","affected_versions":">=2.9.0"},{"fix":"Ensure you have `tabula-py` installed and not a conflicting `tabula` package. If a conflict exists, `pip uninstall tabula` before `pip install tabula-py`. `tabula-py`'s functions are typically accessed directly from the `tabula` submodule (e.g., `from tabula import read_pdf`).","message":"Installing a separate Python package named `tabula` (instead of `tabula-py`) can lead to a namespace conflict, causing `AttributeError: module 'tabula' has no attribute 'read_pdf'` when trying to use `tabula-py` functions.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Verify that your PDF contains selectable text, not just images. For multi-page PDFs, always specify `pages='all'` or a list of desired page numbers (e.g., `pages=[1, 2, 5]`). Be aware that `read_pdf()` returns a list of DataFrames (or dicts for JSON output) when `multiple_tables=True` (default since v2.0.0). For complex PDFs, use the `area`, `stream`, or `lattice` options for more precise extraction, and consider trying the Tabula App (GUI tool) to debug extraction logic.","message":"tabula-py (and its underlying tabula-java) cannot extract tables from image-based PDFs; the PDF must contain text-based table information. Additionally, by default, `read_pdf()` only extracts from page 1, and prior to v2.0.0, `multiple_tables` was `False` by default.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}