Vision Classification¶

A document is not only text but also structure, color, and other numerous features that disappear when OCR is used. Vision classification leverages these visual elements to improve accuracy, particularly important for specific document types.

Basic Usage¶

from extract_thinker import Process, Classification
from extract_thinker.document_loader import DocumentLoaderTesseract

# Define classifications with example images
classifications = [
    Classification(
        name="Driver License",
        description="This is a driver license",
        contract=DriverLicense,
        image="path/to/example_license.png"  # Example image helps model understand
    ),
    Classification(
        name="Invoice",
        description="This is an invoice",
        contract=InvoiceContract,
        image="path/to/example_invoice.png"
    )
]

# Initialize process with vision-capable model
process = Process()
process.add_classify_extractor([[
    Extractor(DocumentLoaderTesseract(tesseract_path))
    .load_llm("gpt-4o")  # Vision-capable model
]])

# Classify with vision enabled
result = process.classify(
    "document.pdf",
    classifications,
    image=True  # Enable vision processing
)

Benefits and Tradeoffs¶

Benefits¶

Better handling of document layouts
Recognition of visual patterns and structures
Improved accuracy for visually distinct documents
Ability to understand non-textual elements

Tradeoffs¶

Higher cost due to image processing
Larger context window requirements
Longer processing times
Higher token usage

Model Selection¶

Different models offer varying capabilities for vision tasks:

GPT-4 Vision: Supports low/high/auto quality settings (85 tokens for low)
Claude 3 Sonnet: Full vision capabilities without quality options
Azure Phi-3 Vision: Cost-effective alternative

Best Practices¶

Use compressed images when possible to reduce costs
Provide high-quality example images for each classification
Consider using a mix of vision and text-based classification
Use appropriate image quality settings based on needs
Cache vision results to avoid reprocessing

Example with Multiple Models¶

# Initialize extractors with different vision models
gpt4_vision = Extractor(document_loader)
gpt4_vision.load_llm("gpt-4-vision")

claude_vision = Extractor(document_loader)
claude_vision.load_llm("claude-3-sonnet")

phi3_vision = Extractor(document_loader)
phi3_vision.load_llm("phi-3-vision")

# Create process with vision models
process = Process()
process.add_classify_extractor([
    [phi3_vision],           # Cost-effective first attempt
    [claude_vision, gpt4_vision]  # More capable models if needed
])

# Classify with vision and consensus
result = process.classify(
    "document.pdf",
    classifications,
    strategy=ClassificationStrategy.CONSENSUS_WITH_THRESHOLD,
    threshold=9,
    image=True
)