Skip to content

Tesseract Document Loader

The Tesseract loader uses the Tesseract OCR engine to extract text from images. It supports multiple languages and provides various OCR optimization options.

Supported Formats

  • jpeg/jpg
  • png
  • tiff
  • bmp
  • gif

Usage

Basic Usage

from extract_thinker import DocumentLoaderTesseract

# Initialize with default settings
loader = DocumentLoaderTesseract()

# Load document
pages = loader.load("path/to/your/image.png")

# Process extracted content
for page in pages:
    # Access text content
    text = page["content"]

Configuration-based Usage

from extract_thinker import DocumentLoaderTesseract, TesseractConfig

# Create configuration
config = TesseractConfig(
    lang="eng+fra",                # Use English and French
    psm=6,                         # Assume uniform block of text
    oem=3,                         # Default LSTM OCR Engine Mode
    config_params={                # Additional Tesseract parameters
        "tessedit_char_whitelist": "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
    },
    timeout=30,                    # OCR timeout in seconds
    cache_ttl=600                  # Cache results for 10 minutes
)

# Initialize loader with configuration
loader = DocumentLoaderTesseract(config)

# Load and process document
pages = loader.load("path/to/your/image.png")

Configuration Options

The TesseractConfig class supports the following options:

Option Type Default Description
content Any None Initial content to process
cache_ttl int 300 Cache time-to-live in seconds
lang str "eng" Language(s) for OCR
psm int 3 Page segmentation mode
oem int 3 OCR Engine Mode
config_params Dict None Additional Tesseract parameters
timeout int 0 OCR timeout in seconds

Features

  • Text extraction from images
  • Multi-language support
  • Configurable page segmentation
  • Multiple OCR engine modes
  • Custom Tesseract parameters
  • Timeout control
  • Caching support
  • No cloud service required

Notes

  • Vision mode is always enabled
  • Requires Tesseract installation
  • Performance depends on image quality
  • Local processing with no external API calls
  • Language data files must be installed separately