Tesseract.js - Pure JavaScript OCR

7.0.0 · active · verified Sun Apr 19

Tesseract.js is a JavaScript library that provides Optical Character Recognition (OCR) capabilities directly in both browser and Node.js environments. It functions by wrapping a WebAssembly port of the popular Tesseract OCR engine, enabling it to extract text from images in nearly any language. The current stable version is `7.0.0`, which brings significant recognition speed improvements (15-35% faster) through optimized WebAssembly and hardware capabilities. The project generally follows a regular release cadence, with major versions often introducing performance enhancements and minor versions addressing bugs or adding small features. A key differentiator is its ability to run Tesseract purely in JavaScript, without requiring native system dependencies. However, it explicitly states that it does not provide direct PDF file support or modify the core Tesseract recognition model to improve accuracy.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to create a Tesseract.js worker, load the English language model, recognize text from a remote image, log the output, and properly terminate the worker.

import { createWorker } from 'tesseract.js';

(async () => {
  const worker = await createWorker('eng', 1, {
    logger: m => console.log(m) // Optional: Log progress to console
  });

  // Example image URL
  const imageUrl = 'https://tesseract.projectnaptha.com/img/eng_bw.png';

  console.log('Recognizing text from:', imageUrl);
  const ret = await worker.recognize(imageUrl);

  console.log('Detected text:\n', ret.data.text);

  // Accessing other output formats (if enabled in worker config)
  // console.log('Words:', ret.data.words.map(w => w.text));

  await worker.terminate();
  console.log('Worker terminated.');
})();

view raw JSON →