Pytesseract
Pytesseract is a Python wrapper for Google's Tesseract-OCR Engine, providing an optical character recognition (OCR) tool for Python. It enables users to recognize and extract text embedded in images, supporting various image types through the Pillow library. The project is actively maintained with regular updates to support newer Python versions and improve functionality. Its current version is 0.3.13.
Warnings
- breaking Python 2 and Python 3.5 support was dropped in `v0.3.7`. Python 3.6 support was dropped in `v0.3.9` as it reached End of Life. Users on older Python versions must upgrade to at least Python 3.7+ (preferably 3.8+).
- gotcha Pytesseract is a wrapper; the Tesseract-OCR engine must be installed separately on your operating system (e.g., via apt, brew, or Windows installer). Failing to install the Tesseract engine is the most common reason for errors like `TesseractNotFoundError`.
- gotcha OCR accuracy is highly dependent on image quality, resolution, contrast, and text style. Pytesseract may struggle with low-quality, noisy, complex layouts, or handwritten text, often returning gibberish or incorrect results.
- gotcha Explicitly setting the `tesseract_cmd` path can be necessary, especially on Windows or if the Tesseract executable is not automatically found in your system's PATH environment variable. Forgetting this can lead to `TesseractNotFoundError`.
- deprecated The caching of `get_tesseract_version` was made optional and disabled by default in `v0.3.11`. If you relied on this caching behavior, you might notice a performance difference or need to re-enable it manually.
Install
-
pip install pytesseract Pillow -
sudo apt install tesseract-ocr # For specific languages: sudo apt install tesseract-ocr-eng -
brew install tesseract -
Follow installer at https://github.com/UB-Mannheim/tesseract/wiki
Imports
- pytesseract
import pytesseract
- Image
from PIL import Image
Quickstart
from PIL import Image, ImageDraw, ImageFont
import pytesseract
import os
# NOTE: Ensure Tesseract-OCR is installed on your system and its executable is in your PATH.
# If not, you might need to specify the path to tesseract.exe:
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
# Create a dummy image for demonstration
img_width, img_height = 400, 100
img = Image.new('RGB', (img_width, img_height), color = 'white')
draw = ImageDraw.Draw(img)
try:
# Try to use a common system font
font = ImageFont.truetype("arial.ttf", 24)
except IOError:
# Fallback if Arial is not found (e.g., on some Linux systems without it)
font = ImageFont.load_default()
text = "Hello, Pytesseract OCR!"
draw.text((50, 30), text, fill='black', font=font)
# Perform OCR on the image
extracted_text = pytesseract.image_to_string(img)
print(f"Extracted Text: {extracted_text.strip()}")
# Example of getting Tesseract version
tesseract_version = pytesseract.get_tesseract_version()
print(f"Tesseract Version: {tesseract_version}")