Extract Thinker¶

The first Framework for Document Intelligence Processing (IDP) - for LLMs¶

★ Star the Repo Examples Production Workflows

Is a flexible document intelligence framework that helps you extract and classify structured data from various documents, acting like an ORM for document processing workflows. One phrase you say is "Document Intelligence for LLMs" or "LangChain for Intelligent Document Processing." The motivation is to create niche features required for document processing, like splitting large documents and advanced classification.

Installation¶

Install using pip:

pip install extract_thinker

Quick Start¶

Here's a simple example that extracts invoice data from a PDF:

from extract_thinker import Extractor, DocumentLoaderPyPdf, Contract

# Define what data you want to extract
class InvoiceContract(Contract):
    invoice_number: str
    invoice_date: str
    total_amount: float

# Initialize the extractor
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4")  # or any other supported model

# Extract data from your document
result = extractor.extract("invoice.pdf", InvoiceContract)

print(f"Invoice #{result.invoice_number}")
print(f"Date: {result.invoice_date}")
print(f"Total: ${result.total_amount}")

Native Features that you want¶

Extraction with Pydantic

Extract structured data from any document type using Pydantic models for validation, custom features, and prompt engineering capabilities.
Classification & Split

Intelligent document classification and splitting with support for consensus strategies, eager/lazy splitting, and confidence thresholds.
PII Detection

Automatically detect and handle sensitive personal information in documents with privacy-first approach and advanced validation.
LLM and OCR Agnostic

Freedom to choose and switch between different LLM providers and OCR engines based on your needs and cost requirements.