Skip to content

Azure Document Intelligence Loader

The Azure Document Intelligence loader (formerly Form Recognizer) uses Azure's Document Intelligence service to extract text, tables, and structured information from documents.

Supported Formats

Supports PDF, JPEG/JPG, PNG, BMP, TIFF, HEIF, DOCX, XLSX, PPTX and HTML.

Usage

Basic Usage

from extract_thinker import DocumentLoaderAzureForm

# Initialize with Azure credentials
loader = DocumentLoaderAzureForm(
    subscription_key="your_subscription_key",
    endpoint="your_endpoint",
    model_id="prebuilt-document"  # Use prebuilt document model
)

# Load document
pages = loader.load("path/to/your/document.pdf")

# Process extracted content
for page in pages:
    # Access text content
    text = page["content"]
    # Access tables if available
    tables = page.get("tables", [])
    # Access form fields if available
    forms = page.get("forms", {})

Configuration-based Usage

from extract_thinker import DocumentLoaderAzureForm, AzureConfig

# Create configuration
config = AzureConfig(
    subscription_key="your_subscription_key",
    endpoint="your_endpoint",
    model_id="prebuilt-layout",     # Use layout model for enhanced layout analysis
    cache_ttl=600,                  # Cache results for 10 minutes
    features=["ocrHighResolution", "barcodes"]  # Enable advanced features
)

# Initialize loader with configuration
loader = DocumentLoaderAzureForm(config)

# Load and process document
pages = loader.load("path/to/your/document.pdf")

Advanced Features Usage

from extract_thinker import DocumentLoaderAzureForm, AzureConfig

# Configuration with multiple advanced features
config = AzureConfig(
    subscription_key="your_subscription_key",
    endpoint="your_endpoint",
    model_id="prebuilt-layout",
    features=[
        "ocrHighResolution",    # High resolution OCR for small text
        "formulas",             # Extract mathematical formulas in LaTeX
        "styleFont",            # Extract font properties
        "barcodes",             # Extract barcodes and QR codes
        "languages",            # Detect document languages
        "keyValuePairs"         # Extract key-value pairs from forms
    ]
)

loader = DocumentLoaderAzureForm(config)
pages = loader.load("document_with_advanced_content.pdf")

for page in pages:
    # Standard content
    print(f"Text content: {page['content']}")
    print(f"Tables: {page['tables']}")
    print(f"Forms: {page['forms']}")

    # Advanced features (if detected in document)
    if 'formulas' in page:
        print(f"Mathematical formulas: {page['formulas']}")

    if 'fonts' in page:
        print(f"Font information: {page['fonts']}")

    if 'barcodes' in page:
        print(f"Barcodes found: {page['barcodes']}")

    if 'languages' in page:
        print(f"Detected languages: {page['languages']}")

Specialized Models Usage

# Use specialized invoice model
config = AzureConfig(
    subscription_key="your_subscription_key",
    endpoint="your_endpoint",
    model_id="prebuilt-invoice"
)

loader = DocumentLoaderAzureForm(config)
pages = loader.load("invoice.pdf")

# Access extracted invoice fields
for page in pages:
    forms = page["forms"]
    vendor_name = forms.get("VendorName", "")
    invoice_total = forms.get("InvoiceTotal", "")
    print(f"Vendor: {vendor_name}, Total: {invoice_total}")

Configuration Options

The AzureConfig class supports the following options:

Option Type Default Description
subscription_key str Required Azure subscription key
endpoint str Required Azure endpoint URL
content Any None Initial content to process
cache_ttl int 300 Cache time-to-live in seconds
model_id str "prebuilt-layout" Model ID to use
max_retries int 3 Maximum retries for failed requests
features List[str] None Advanced features to enable

Available Models

General Purpose Models

Model ID Description Best For
prebuilt-read OCR/Read model Text extraction from printed and handwritten documents
prebuilt-layout Layout analysis Documents with tables, selection marks, and complex layouts
prebuilt-document General document Key-value pairs, tables, and general document structure

Specialized Models

Model ID Description
prebuilt-invoice Invoice processing
prebuilt-receipt Receipt processing
prebuilt-idDocument Identity documents
prebuilt-businessCard Business cards
prebuilt-tax.us.w2 US W2 tax forms
prebuilt-tax.us.1040 US 1040 tax forms
prebuilt-contract Contracts
prebuilt-healthInsurance US health insurance cards
prebuilt-bankStatement Bank statements
prebuilt-payStub Pay stubs

Advanced Features

The loader supports advanced extraction features that can be enabled via the features parameter:

Feature Description Output Field
ocrHighResolution High resolution OCR for better small text recognition Enhanced text in content
formulas Extract mathematical formulas in LaTeX format formulas array
styleFont Extract font properties (family, style, weight, color) fonts array
barcodes Extract barcodes and QR codes barcodes array
languages Detect document languages languages array
keyValuePairs Extract key-value pairs from forms Enhanced forms dict
queryFields Enable custom field extraction Enhanced extraction
searchablePDF Convert scanned PDFs to searchable format Enhanced OCR

Features

  • Text extraction with layout preservation
  • Table detection and extraction
  • Form field recognition with specialized models
  • Advanced OCR with high resolution support
  • Mathematical formula extraction (LaTeX format)
  • Font property extraction
  • Barcode and QR code detection
  • Multi-language document support
  • Caching support with configurable TTL
  • Vision mode support for image formats
  • Retry logic for robust processing

Notes

  • Azure subscription key and endpoint are required
  • Advanced features may increase processing time and costs
  • Specialized models are optimized for specific document types
  • Rate limits and quotas apply based on your Azure subscription
  • Vision mode is supported for image formats
  • High resolution OCR is recommended for documents with small text
  • Formula extraction works best with clear mathematical notation