Document Classification¶

In document intelligence, classification is often the crucial first step. It sets the stage for subsequent processes like data extraction and analysis. Before the rise of LLMs, this used to be accomplished (and still is) with AI models training in-house for certain use cases. Services such as Azure Document Intelligence give you this feature, but they are not dynamic and will set you up for "Vendor lock-in".

LLMs may not be the most efficient for this task, but they are agnostic and near-perfect for it.

Classification Techniques¶

Basic Classification

Simple yet powerful classification using a single LLM with contract mapping.

Learn More
Mixture of Models (MoM)

Enhance accuracy by combining multiple models with different strategies.

Learn More
Tree-Based Classification

Handle complex hierarchies and similar document types efficiently.

Learn More
Vision Classification

Leverage visual features for better accuracy.

Learn More

Classification Response¶

All classification methods return a standardized response:

from typing import Optional
from pydantic import BaseModel, Field

class ClassificationResponse(BaseModel):
    name: str
    confidence: Optional[int] = Field(
        description="From 1 to 10. 10 being the highest confidence",
        ge=1, 
        le=10
    )

Available Strategies¶

ExtractThinker supports three main classification strategies:

CONSENSUS: All models must agree on the classification
HIGHER_ORDER: Uses the result with highest confidence
CONSENSUS_WITH_THRESHOLD: Requires consensus and minimum confidence

For detailed implementation of each technique, visit their respective pages.