Skip to content

Evaluation Framework ๐Ÿงช In Beta

The evaluation framework helps measure the performance and reliability of your extraction models across different document types.

Overview

ExtractThinker's evaluation system provides comprehensive metrics to:

  • Measure extraction accuracy at both field and document levels
  • Track schema validation success rates
  • Monitor execution times
  • Detect potential hallucinations in extracted data
  • Track token usage and associated costs
  • Compare performance across different models or datasets

Required Components

To use the evaluation framework, you'll need:

  • An initialized Extractor instance
  • A Contract class that defines your extraction schema
  • A dataset containing documents and their expected outputs

Basic Usage

Here's how to set up and run a basic evaluation:

from extract_thinker import Extractor, Contract
from extract_thinker.eval import Evaluator, FileSystemDataset
from typing import List

# 1. Define your contract class
class InvoiceContract(Contract):
    invoice_number: str
    date: str
    total_amount: float
    line_items: List[dict]

# 2. Initialize your extractor
extractor = Extractor()
extractor.load_llm("gpt-4o")

# 3. Create a dataset
dataset = FileSystemDataset(
    documents_dir="./test_invoices/",
    labels_path="./test_invoices/labels.json",
    name="Invoice Test Set"
)

# 4. Set up evaluator
evaluator = Evaluator(
    extractor=extractor,
    response_model=InvoiceContract
)

# 5. Run evaluation
report = evaluator.evaluate(dataset)

# 6. Print summary and save detailed report
report.print_summary()
evaluator.save_report(report, "evaluation_results.json")

Tip: For consistent evaluations, use a temperature of 0.0 in your model configuration to ensure deterministic outputs.

๐Ÿ’ก Model Temperature

For consistent evaluations, use a temperature of 0.0 in your model configuration to ensure deterministic outputs.

Command Line Interface

ExtractThinker includes a CLI for running evaluations from configuration files:

extract_thinker-eval --config eval_config.json --output results.json

Example configuration file:

{
  "evaluation_name": "Invoice Extraction Test",
  "dataset_name": "Invoice Dataset",
  "contract_path": "./contracts/invoice_contract.py",
  "documents_dir": "./test_invoices/",
  "labels_path": "./test_invoices/labels.json",
  "file_pattern": "*.pdf",
  "llm": {
    "model": "gpt-4o",
    "params": {
      "temperature": 0.0
    }
  },
  "vision": false,
  "skip_failures": false
}

Available Metrics

ExtractThinker captures several key metrics during evaluation:

Metric Type Description Use Case
Field-level Precision, recall, F1 scores for each field Identify problematic fields
Document-level Overall accuracy across all documents General model performance
Schema validation Success rate of schema validation Data structure correctness
Execution time Average and per-document processing time Performance optimization
Hallucination Detection of fabricated information Trust and reliability
Cost Token usage and associated costs Budget optimization

Sample Output

=== Invoice Extraction Evaluation ===
Dataset: Invoice Test Set
Model: gpt-4o
Timestamp: 2023-08-15T14:30:45

=== Overall Metrics ===
Documents tested: 50
Document accuracy: 92.00%
Schema validation rate: 96.00%
Average precision: 95.40%
Average recall: 94.80%
Average F1 score: 95.10%
Average execution time: 2.34s

=== Field-Level Metrics ===
invoice_number (comparison: exact):
  Precision: 98.00%
  Recall: 98.00%
  F1 Score: 98.00%
  Accuracy: 98.00%
date:
  Precision: 94.00%
  Recall: 94.00%
  F1 Score: 94.00%
  Accuracy: 94.00%
...

Evaluation Features

ExtractThinker offers several specialized evaluation capabilities:

Field Comparison Types

Different fields may require different comparison methods:

from extract_thinker.eval import ComparisonType

evaluator = Evaluator(
    extractor=extractor,
    response_model=InvoiceContract,
    field_comparisons={
        "invoice_number": ComparisonType.EXACT,  # Exact match required
        "description": ComparisonType.SEMANTIC,  # Semantic similarity
        "total_amount": ComparisonType.NUMERIC   # Allows percentage tolerance
    }
)

Learn more about field comparison types โ†’

Teacher-Student Evaluation

Benchmark your model against a more capable "teacher" model:

from extract_thinker.eval import TeacherStudentEvaluator

evaluator = TeacherStudentEvaluator(
    student_extractor=student_extractor,
    teacher_extractor=teacher_extractor,
    response_model=InvoiceContract
)

Learn more about teacher-student evaluation โ†’

Hallucination Detection

Identify potentially hallucinated content:

evaluator = Evaluator(
    extractor=extractor,
    response_model=InvoiceContract,
    detect_hallucinations=True
)

Learn more about hallucination detection โ†’

Cost Tracking

Monitor token usage and costs:

evaluator = Evaluator(
    extractor=extractor,
    response_model=InvoiceContract,
    track_costs=True
)

Learn more about cost tracking โ†’

Best Practices

  • Dataset diversity: Include a wide range of document variations in your test set
  • Consistent formatting: Use consistent file formats and naming conventions
  • Benchmark different models: Run evaluations on different model configurations
  • Field-level analysis: Monitor field-level metrics to identify specific areas for improvement
  • Specialized test sets: Create separate test sets for different document types
  • Hallucination checks: Enable hallucination detection for critical applications
  • Cost optimization: Track costs to optimize the performance/price ratio
  • Version control: Keep evaluation datasets under version control to track improvements over time

Advanced Configuration

For more complex evaluation needs:

# Advanced evaluator setup with multiple features
evaluator = Evaluator(
    extractor=extractor,
    response_model=InvoiceContract,
    vision=True,  # Enable vision mode for image-based documents
    content="Focus on the header section for invoice number and date.",
    field_comparisons={
        "invoice_number": ComparisonType.EXACT,
        "description": ComparisonType.SEMANTIC
    },
    detect_hallucinations=True,
    track_costs=True
)

# Run evaluation with special options
report = evaluator.evaluate(
    dataset=dataset,
    evaluation_name="Comprehensive Invoice Evaluation",
    skip_failures=True  # Continue even when schema validation fails
)