Document Text Recognition (docTR)
docTR (Document Text Recognition) is an open-source Python library leveraging deep learning for high-performance Optical Character Recognition (OCR) on documents. It provides state-of-the-art text detection and recognition for scanned documents, images, and PDFs. Actively maintained by Mindee, it supports multi-language recognition, handwriting, and GPU acceleration, currently at version 1.0.1.
Common errors
-
OSError: cannot load library 'gobject-2.0-0'
cause Missing system-level dependencies for `weasyprint`, which is used by docTR's `html` and `viz` extras for PDF/HTML processing.fixInstall the required system packages. For Debian/Ubuntu: `sudo apt-get install -y libglib2.0-0 libpango-1.0-0 libpangoft2-1.0-0`. Other Linux distributions, macOS, or Windows will have different prerequisites for `weasyprint`. -
ModuleNotFoundError: No module named 'doctr.io'
cause The `python-doctr` library is either not installed, or the Python interpreter in use does not have access to the installed package (e.g., wrong virtual environment).fixEnsure `python-doctr` is installed in your active environment: `pip install python-doctr`. If in an IDE like PyCharm, verify the correct Python interpreter is selected for your project. -
git clone ... then pip install -e doctr/ fails due to SSL certificate verification issues.
cause Corporate proxies or misconfigured Git installations can block secure connections (SSL/TLS) when cloning repositories or fetching packages.fixTemporarily disable SSL verification for Git: `git config --global http.sslVerify false` *before* cloning. Remember to re-enable it afterwards: `git config --global http.sslVerify true` for security.
Warnings
- breaking docTR v1.0.0 removed TensorFlow as a supported backend. The library now exclusively uses PyTorch. Old `python-doctr[tf]` installation options are no longer valid, and training scripts have been updated.
- gotcha Processing PDFs or HTML documents with `DocumentFile.from_pdf` or `DocumentFile.from_url` (via `html` extra) often relies on `weasyprint`, which itself has system-level dependencies (e.g., `libglib2.0-0`, `libpango-1.0-0` on Linux) that are not automatically installed by `pip`.
- gotcha GPU acceleration requires manually installing `torch` and `torchvision` with CUDA support, which `pip install python-doctr` does not automatically handle to keep the base package lightweight.
Install
-
pip install python-doctr -
pip install "python-doctr[viz,html,contrib]" -
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118 pip install python-doctr
Imports
- DocumentFile
from doctr.io import DocumentFile
- ocr_predictor
from doctr.models import ocr_predictor
- from_hub
from doctr.models.pre_trained import from_hub
from doctr.models import from_hub
Quickstart
import os
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
# For demonstration, create a dummy image file if it doesn't exist
# In a real scenario, you'd have an actual image or PDF path
dummy_image_path = 'sample.png'
if not os.path.exists(dummy_image_path):
try:
from PIL import Image
# Create a simple image with text
img = Image.new('RGB', (200, 100), color = (255, 255, 255))
from PIL import ImageDraw, ImageFont
d = ImageDraw.Draw(img)
try:
# Try a common font, or fallback
font = ImageFont.truetype("arial.ttf", 20)
except IOError:
font = ImageFont.load_default()
d.text((10,10), "Hello docTR!", fill=(0,0,0), font=font)
img.save(dummy_image_path)
print(f"Created dummy image: {dummy_image_path}")
except ImportError:
print("Pillow not installed, cannot create dummy image. Please provide a real image file.")
print("Skipping quickstart example as no image is available.")
dummy_image_path = None
if dummy_image_path and os.path.exists(dummy_image_path):
# Load your document (image or PDF)
# For a PDF: doc = DocumentFile.from_pdf("path/to/your/document.pdf")
# For multiple images: doc = DocumentFile.from_images(["path/to/img1.jpg", "path/to/img2.png"])
doc = DocumentFile.from_images(dummy_image_path)
# Load a pre-trained OCR model
# Since v1.0.0, PyTorch is the default and only backend.
model = ocr_predictor(pretrained=True)
# Analyze the document
result = model(doc)
# Print the extracted text content
# The result object contains detailed information about words, lines, blocks, and pages.
print("\n--- OCR Result ---")
for page in result.pages:
for block in page.blocks:
for line in block.lines:
print(" ".join([word.value for word in line.words]))
# You can also export the full structured output as JSON
# print(result.export())
else:
print("Quickstart skipped due to missing image.")