Tesseract.js - Pure JavaScript OCR
Tesseract.js is a JavaScript library that provides Optical Character Recognition (OCR) capabilities directly in both browser and Node.js environments. It functions by wrapping a WebAssembly port of the popular Tesseract OCR engine, enabling it to extract text from images in nearly any language. The current stable version is `7.0.0`, which brings significant recognition speed improvements (15-35% faster) through optimized WebAssembly and hardware capabilities. The project generally follows a regular release cadence, with major versions often introducing performance enhancements and minor versions addressing bugs or adding small features. A key differentiator is its ability to run Tesseract purely in JavaScript, without requiring native system dependencies. However, it explicitly states that it does not provide direct PDF file support or modify the core Tesseract recognition model to improve accuracy.
Common errors
-
TypeError: fetch is not a function
cause Attempting to run Tesseract.js v7 on Node.js v14 or older environments that lack a native `fetch` implementation.fixUpgrade your Node.js environment to v16 or newer. If you must use Node.js v14, you would need to use an older Tesseract.js version (e.g., v5) or provide a global `fetch` polyfill. -
Error: 'eng' language data not found
cause The specified language data for the Tesseract worker could not be loaded, possibly due to an incorrect language code, network issues preventing download from the CDN, or an incorrect `langPath` configuration for local files.fixVerify the language code is correct (e.g., 'eng' for English). Ensure there is network access to the Tesseract.js CDN. If loading local data, confirm the `langPath` option is correctly set when calling `createWorker` and that the language files exist at that location. -
(node:...) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 recognize listeners added to [Worker]. Use emitter.setMaxListeners() to increase limit
cause This warning typically occurs when multiple `worker.recognize` calls are initiated on the same worker without awaiting previous calls, leading to too many event listeners being attached.fixUse `await` for each `worker.recognize` call or, for parallel image processing, utilize `Tesseract.createScheduler()` to manage jobs across multiple workers efficiently. -
ERR_REQUIRE_ESM is not defined in ES module scope
cause You are attempting to use CommonJS `require()` syntax in a JavaScript environment configured for ES Modules (e.g., Node.js with `"type": "module"` in `package.json` or modern browser environments) where Tesseract.js is primarily distributed as an ESM package.fixSwitch to ES Module `import` syntax: `import { createWorker } from 'tesseract.js';`. Ensure your environment supports ES Modules.
Warnings
- breaking Tesseract.js v7.0.0 dropped support for Node.js v14.
- breaking Starting with v6.0.0, all Tesseract output formats other than `text` are disabled by default to reduce runtime and memory usage.
- gotcha Running multiple `worker.recognize` calls concurrently on the same worker is not recommended and can lead to unexpected behavior or resource exhaustion, even though a bug related to this was fixed in v5.0.5.
- gotcha Tesseract.js does not provide direct support for PDF files; it operates on images. Additionally, the project focuses on bringing the Tesseract engine to JavaScript and does not modify the core Tesseract recognition model to improve accuracy.
- gotcha Tesseract.js v7 introduces a new `relaxedsimd` build that significantly improves recognition speed (15-35%) by leveraging the latest WebAssembly and hardware capabilities, especially on newer Intel processors.
Install
-
npm install tesseract.js -
yarn add tesseract.js -
pnpm add tesseract.js
Imports
- createWorker
const { createWorker } = require('tesseract.js');import { createWorker } from 'tesseract.js'; - Tesseract (global)
Tesseract.createWorker('eng'); - RecognizeResult (type)
import type { RecognizeResult } from 'tesseract.js';
Quickstart
import { createWorker } from 'tesseract.js';
(async () => {
const worker = await createWorker('eng', 1, {
logger: m => console.log(m) // Optional: Log progress to console
});
// Example image URL
const imageUrl = 'https://tesseract.projectnaptha.com/img/eng_bw.png';
console.log('Recognizing text from:', imageUrl);
const ret = await worker.recognize(imageUrl);
console.log('Detected text:\n', ret.data.text);
// Accessing other output formats (if enabled in worker config)
// console.log('Words:', ret.data.words.map(w => w.text));
await worker.terminate();
console.log('Worker terminated.');
})();