Tesseract.js Core: WebAssembly OCR Engine

6.1.2 · active · verified Sun Apr 19

tesseract.js-core is the foundational WebAssembly (WASM) module that powers the higher-level tesseract.js OCR library. It compiles the original Tesseract C++ engine to JavaScript and WASM using Emscripten, enabling Optical Character Recognition directly in browser and Node.js environments. The current stable version is `7.0.0` as of December 2025. This package provides the low-level API for interacting with the Tesseract engine, offering optimized builds like 'Relaxed SIMD' for performance on supported hardware and 'LSTM-only' builds for reduced size when only the modern LSTM OCR engine is needed. It typically releases new major versions to incorporate updates from the upstream Tesseract C++ project and Emscripten. Key differentiators include its pure JavaScript/WASM implementation, enabling client-side OCR, and direct access to Tesseract's core functionality, which is crucial for custom integrations or highly performance-sensitive applications that bypass the abstractions of `tesseract.js`.

Common errors

Warnings

Install

Imports

Quickstart

Demonstrates how to load the `tesseract.js-core` WebAssembly module, initialize the Tesseract API, set an image buffer, perform OCR, and retrieve the recognized text using low-level methods. This example simulates a Node.js environment.

import { readFileSync } from 'fs';
import TesseractCoreInit from 'tesseract.js-core/tesseract-core';

const runOcr = async (imagePath) => {
  console.log('Loading TesseractCore module...');
  // The factory function returns a Promise that resolves to the Emscripten module
  const TesseractCore = await TesseractCoreInit();
  console.log('TesseractCore module loaded. Initializing...');

  // This is a minimal example showing low-level API interaction.
  // In a real application, consider using tesseract.js for a higher-level API.
  const core = new TesseractCore.TesseractApi();
  core.Init(process.env.LANG_PATH || '/usr/share/tessdata', 'eng'); // Initialize with language data

  const imageBuffer = readFileSync(imagePath);
  // Assuming imageBuffer is a PNG or JPG and Tesseract can handle its format directly
  // In a browser, you might pass a Canvas/ImageData object
  core.SetImage(imageBuffer, imageBuffer.width, imageBuffer.height, 4, imageBuffer.width * 4);

  core.Recognize();
  const text = core.GetUTF8Text();
  console.log('Recognized Text:', text);

  core.End();
  TesseractCore.destroy(core); // Clean up Emscripten instance
};

// Example usage: Assumes 'image.png' exists in the same directory
// For a real scenario, replace 'image.png' with your actual image path.
// process.env.LANG_PATH should point to directory containing .traineddata files.
runOcr('image.png').catch(console.error);

view raw JSON →