Tesseract.js - Pure JavaScript OCR

7.0.0 · active · verified Sun Apr 19

Tesseract.js is a JavaScript library that provides Optical Character Recognition (OCR) capabilities directly in both browser and Node.js environments. It functions by wrapping a WebAssembly port of the popular Tesseract OCR engine, enabling it to extract text from images in nearly any language. The current stable version is `7.0.0`, which brings significant recognition speed improvements (15-35% faster) through optimized WebAssembly and hardware capabilities. The project generally follows a regular release cadence, with major versions often introducing performance enhancements and minor versions addressing bugs or adding small features. A key differentiator is its ability to run Tesseract purely in JavaScript, without requiring native system dependencies. However, it explicitly states that it does not provide direct PDF file support or modify the core Tesseract recognition model to improve accuracy.

Common errors

```
TypeError: fetch is not a function
```
cause Attempting to run Tesseract.js v7 on Node.js v14 or older environments that lack a native `fetch` implementation.

fix Upgrade your Node.js environment to v16 or newer. If you must use Node.js v14, you would need to use an older Tesseract.js version (e.g., v5) or provide a global `fetch` polyfill.
```
Error: 'eng' language data not found
```
cause The specified language data for the Tesseract worker could not be loaded, possibly due to an incorrect language code, network issues preventing download from the CDN, or an incorrect `langPath` configuration for local files.

fix Verify the language code is correct (e.g., 'eng' for English). Ensure there is network access to the Tesseract.js CDN. If loading local data, confirm the `langPath` option is correctly set when calling `createWorker` and that the language files exist at that location.
```
(node:...) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 recognize listeners added to [Worker]. Use emitter.setMaxListeners() to increase limit
```
cause This warning typically occurs when multiple `worker.recognize` calls are initiated on the same worker without awaiting previous calls, leading to too many event listeners being attached.

fix Use `await` for each `worker.recognize` call or, for parallel image processing, utilize `Tesseract.createScheduler()` to manage jobs across multiple workers efficiently.
```
ERR_REQUIRE_ESM is not defined in ES module scope
```
cause You are attempting to use CommonJS `require()` syntax in a JavaScript environment configured for ES Modules (e.g., Node.js with `"type": "module"` in `package.json` or modern browser environments) where Tesseract.js is primarily distributed as an ESM package.

fix Switch to ES Module `import` syntax: `import { createWorker } from 'tesseract.js';`. Ensure your environment supports ES Modules.

Warnings

breaking Tesseract.js v7.0.0 dropped support for Node.js v14.
Fix: Upgrade your Node.js environment to v16 or newer to use Tesseract.js v7.
breaking Starting with v6.0.0, all Tesseract output formats other than `text` are disabled by default to reduce runtime and memory usage.
Fix: To enable specific output formats (e.g., `blocks`, `words`), you must explicitly configure them when creating the worker or using `worker.setParameters`.
gotcha Running multiple `worker.recognize` calls concurrently on the same worker is not recommended and can lead to unexpected behavior or resource exhaustion, even though a bug related to this was fixed in v5.0.5.
Fix: For parallel processing of multiple images, use `createScheduler` to manage jobs across multiple workers, or ensure each `worker.recognize` call completes before initiating another on the same worker.
gotcha Tesseract.js does not provide direct support for PDF files; it operates on images. Additionally, the project focuses on bringing the Tesseract engine to JavaScript and does not modify the core Tesseract recognition model to improve accuracy.
Fix: For PDF processing, pre-convert PDF pages into image formats (e.g., PNG, JPEG) before passing them to Tesseract.js. For advanced model tuning or features beyond core Tesseract, consider other OCR solutions or pre-process images for better Tesseract results.
gotcha Tesseract.js v7 introduces a new `relaxedsimd` build that significantly improves recognition speed (15-35%) by leveraging the latest WebAssembly and hardware capabilities, especially on newer Intel processors.
Fix: Upgrade to Tesseract.js v7 to benefit from the performance enhancements and optimize OCR processing times.

Install

npm install tesseract.js npm
yarn add tesseract.js yarn
pnpm add tesseract.js pnpm

Imports

createWorker
wrong:
```
const { createWorker } = require('tesseract.js');
```
correct:
```
import { createWorker } from 'tesseract.js';
```
Use named import for ES Modules in Node.js (v16+) and modern browsers. The `require()` syntax is generally not recommended for Tesseract.js v7 and may lead to issues.
Tesseract (global)
```
Tesseract.createWorker('eng');
```
When Tesseract.js is included via a `<script>` tag from a CDN, the `Tesseract` object becomes globally available.
RecognizeResult (type)
```
import type { RecognizeResult } from 'tesseract.js';
```
For TypeScript users, import types explicitly using `import type` for clarity and better tree-shaking.

Quickstart

This quickstart demonstrates how to create a Tesseract.js worker, load the English language model, recognize text from a remote image, log the output, and properly terminate the worker.

import { createWorker } from 'tesseract.js';

(async () => {
  const worker = await createWorker('eng', 1, {
    logger: m => console.log(m) // Optional: Log progress to console
  });

  // Example image URL
  const imageUrl = 'https://tesseract.projectnaptha.com/img/eng_bw.png';

  console.log('Recognizing text from:', imageUrl);
  const ret = await worker.recognize(imageUrl);

  console.log('Detected text:\n', ret.data.text);

  // Accessing other output formats (if enabled in worker config)
  // console.log('Words:', ret.data.words.map(w => w.text));

  await worker.terminate();
  console.log('Worker terminated.');
})();

view raw JSON →