Skip to content

Extract Thinker

The first Framework for Document Intelligence Processing (IDP) - for LLMs

Extract Thinker Overview

Is a flexible document intelligence framework that helps you extract and classify structured data from various documents, acting like an ORM for document processing workflows. One phrase you say is “Document Intelligence for LLMs” or “LangChain for Intelligent Document Processing.” The motivation is to create niche features required for document processing, like splitting large documents and advanced classification.


Installation

Install using pip:

pip install extract_thinker

Quick Start

Here's a simple example that extracts invoice data from a PDF:

from extract_thinker import Extractor, DocumentLoaderPyPdf, Contract

# Define what data you want to extract
class InvoiceContract(Contract):
    invoice_number: str
    invoice_date: str
    total_amount: float

# Initialize the extractor
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4")  # or any other supported model

# Extract data from your document
result = extractor.extract("invoice.pdf", InvoiceContract)

print(f"Invoice #{result.invoice_number}")
print(f"Date: {result.invoice_date}")
print(f"Total: ${result.total_amount}")

Native Features that you want

  • Extraction with Pydantic

    Extract structured data from any document type using Pydantic models for validation, custom features, and prompt engineering capabilities.

  • Classification & Split

    Intelligent document classification and splitting with support for consensus strategies, eager/lazy splitting, and confidence thresholds.

  • PII Detection

    Automatically detect and handle sensitive personal information in documents with privacy-first approach and advanced validation.

  • LLM and OCR Agnostic

    Freedom to choose and switch between different LLM providers and OCR engines based on your needs and cost requirements.