img2table

raw JSON →
1.4.2 verified Mon Apr 27 auth: no python

img2table is a table identification and extraction library for PDFs and images, based on OpenCV image processing. Current version: 1.4.2. Supports Python 3.9-3.13. Released on PyPI with moderate cadence.

pip install img2table
error ModuleNotFoundError: No module named 'paddleocr'
cause PaddleOCR is an extra dependency, not installed by default with img2table.
fix
pip install paddleocr
error ImportError: cannot import name 'PaddleOCR' from 'img2table.ocr'
cause Using wrong casing; correct class name is PaddleOCR (capital O, C, R).
fix
Use: from img2table.ocr import PaddleOCR
error AttributeError: 'Image' object has no attribute 'extract_tables'
cause Incorrect import; Image class is not in top-level package.
fix
Use: from img2table.document import Image
breaking In v1.4.0, the PDF backend was migrated from PyMuPDF/fitz to pypdfium2 for license compliance. Existing code expecting fitz will break.
fix No action needed if using Document classes; only affects direct use of PDF library internals.
deprecated The old TesseractOCR class used Tesseract 4.x; future versions may remove support. Recommended to migrate to PaddleOCR or SuryaOCR.
fix Switch to PaddleOCR or SuryaOCR via pip install paddleocr or pip install surya-ocr (separate).
gotcha OCR initialization is heavy; avoid recreating OCR instance per image in loops. Reuse the same OCR object for multiple documents.
fix Create one OCR object and pass it to multiple extract_tables calls.

Extract tables from an image using PaddleOCR.

import os
from img2table.document import Image
from img2table.ocr import PaddleOCR

# Use environment variable for API key if needed
ocr = PaddleOCR(lang='en', api_key=os.environ.get('PADDLE_OCR_KEY', ''))

img = Image(src='table.png')
tables = img.extract_tables(ocr=ocr)
print(tables)