Skip to content

Basic Classification

When classifying documents, the process involves extracting the content of the document and adding it to the prompt with several possible classifications. ExtractThinker simplifies this process using Pydantic models and instructor.

Simple Classification

The most straightforward way to classify documents:

from extract_thinker import Classification, Extractor
from extract_thinker.document_loader import DocumentLoaderTesseract

# Define classifications
classifications = [
    Classification(
        name="Driver License",
        description="This is a driver license",
        contract=DriverLicense,  # optional. Will be added to the prompt
    ),
    Classification(
        name="Invoice",
        description="This is an invoice",
        contract=InvoiceContract,  # optional. Will be added to the prompt
    ),
]

# Initialize extractor
tesseract_path = os.getenv("TESSERACT_PATH")
document_loader = DocumentLoaderTesseract(tesseract_path)
extractor = Extractor(document_loader)
extractor.load_llm("gpt-4o")

# Classify document
result = extractor.classify(INVOICE_FILE_PATH, classifications)
print(f"Document type: {result.name}, Confidence: {result.confidence}")

Type Mapping with Contract

Adding contract structure to the classification improves accuracy:

from typing import List
from extract_thinker.models.contract import Contract

class InvoiceContract(Contract):
    invoice_number: str
    invoice_date: str
    lines: List[LineItem]
    total_amount: float

class DriverLicense(Contract):
    name: str
    age: int
    license_number: str

The contract structure is automatically added to the prompt, helping the model understand the expected document structure.

Classification Response

All classifications return a standardized response:

from typing import Optional
from pydantic import BaseModel, Field

class ClassificationResponse(BaseModel):
    name: str
    confidence: Optional[int] = Field(
        description="From 1 to 10. 10 being the highest confidence",
        ge=1, 
        le=10
    )

Best Practices

  • Provide clear, distinctive descriptions for each classification
  • Use contract structures when possible
  • Consider using image classification for visual documents
  • Monitor confidence scores
  • Handle low-confidence cases appropriately

For more advanced classification techniques, see: - Mixture of Models (MoM) - Tree-Based Classification - Vision Classification