Document Classification¶
In document intelligence, classification is often the crucial first step. It sets the stage for subsequent processes like data extraction and analysis. Before the rise of LLMs, this used to be accomplished (and still is) with AI models training in-house for certain use cases. Services such as Azure Document Intelligence give you this feature, but they are not dynamic and will set you up for "Vendor lock-in".
LLMs may not be the most efficient for this task, but they are agnostic and near-perfect for it.

Classification Techniques¶
-
Basic Classification
Simple yet powerful classification using a single LLM with contract mapping.
-
Mixture of Models (MoM)
Enhance accuracy by combining multiple models with different strategies.
-
Tree-Based Classification
Handle complex hierarchies and similar document types efficiently.
-
Vision Classification
Leverage visual features for better accuracy.
Classification Response¶
All classification methods return a standardized response:
from typing import Optional
from pydantic import BaseModel, Field
class ClassificationResponse(BaseModel):
name: str
confidence: Optional[int] = Field(
description="From 1 to 10. 10 being the highest confidence",
ge=1,
le=10
)
Available Strategies¶
ExtractThinker supports three main classification strategies:
- CONSENSUS: All models must agree on the classification
- HIGHER_ORDER: Uses the result with highest confidence
- CONSENSUS_WITH_THRESHOLD: Requires consensus and minimum confidence
For detailed implementation of each technique, visit their respective pages.