Basic Classification¶
When classifying documents, the process involves extracting the content of the document and adding it to the prompt with several possible classifications. ExtractThinker simplifies this process using Pydantic models and instructor.
Simple Classification¶
The most straightforward way to classify documents:
from extract_thinker import Classification, Extractor
from extract_thinker.document_loader import DocumentLoaderTesseract
# Define classifications
classifications = [
Classification(
name="Driver License",
description="This is a driver license",
contract=DriverLicense, # optional. Will be added to the prompt
),
Classification(
name="Invoice",
description="This is an invoice",
contract=InvoiceContract, # optional. Will be added to the prompt
),
]
# Initialize extractor
tesseract_path = os.getenv("TESSERACT_PATH")
document_loader = DocumentLoaderTesseract(tesseract_path)
extractor = Extractor(document_loader)
extractor.load_llm("gpt-4o")
# Classify document
result = extractor.classify(INVOICE_FILE_PATH, classifications)
print(f"Document type: {result.name}, Confidence: {result.confidence}")
Type Mapping with Contract¶
Adding contract structure to the classification improves accuracy:
from typing import List
from extract_thinker.models.contract import Contract
class InvoiceContract(Contract):
invoice_number: str
invoice_date: str
lines: List[LineItem]
total_amount: float
class DriverLicense(Contract):
name: str
age: int
license_number: str
The contract structure is automatically added to the prompt, helping the model understand the expected document structure.
Classification Response¶
All classifications return a standardized response:
from typing import Optional
from pydantic import BaseModel, Field
class ClassificationResponse(BaseModel):
name: str
confidence: Optional[int] = Field(
description="From 1 to 10. 10 being the highest confidence",
ge=1,
le=10
)
Best Practices¶
- Provide clear, distinctive descriptions for each classification
- Use contract structures when possible
- Consider using image classification for visual documents
- Monitor confidence scores
- Handle low-confidence cases appropriately
For more advanced classification techniques, see: - Mixture of Models (MoM) - Tree-Based Classification - Vision Classification