Azure Document Intelligence Loader¶
The Azure Document Intelligence loader (formerly Form Recognizer) uses Azure's Document Intelligence service to extract text, tables, and structured information from documents.
Supported Formats¶
Supports PDF
, JPEG/JPG
, PNG
, BMP
, TIFF
, HEIF
, DOCX
, XLSX
, PPTX
and HTML
.
Usage¶
Basic Usage¶
from extract_thinker import DocumentLoaderAzureForm
# Initialize with Azure credentials
loader = DocumentLoaderAzureForm(
subscription_key="your_subscription_key",
endpoint="your_endpoint",
model_id="prebuilt-document" # Use prebuilt document model
)
# Load document
pages = loader.load("path/to/your/document.pdf")
# Process extracted content
for page in pages:
# Access text content
text = page["content"]
# Access tables if available
tables = page.get("tables", [])
# Access form fields if available
forms = page.get("forms", {})
Configuration-based Usage¶
from extract_thinker import DocumentLoaderAzureForm, AzureConfig
# Create configuration
config = AzureConfig(
subscription_key="your_subscription_key",
endpoint="your_endpoint",
model_id="prebuilt-layout", # Use layout model for enhanced layout analysis
cache_ttl=600, # Cache results for 10 minutes
features=["ocrHighResolution", "barcodes"] # Enable advanced features
)
# Initialize loader with configuration
loader = DocumentLoaderAzureForm(config)
# Load and process document
pages = loader.load("path/to/your/document.pdf")
Advanced Features Usage¶
from extract_thinker import DocumentLoaderAzureForm, AzureConfig
# Configuration with multiple advanced features
config = AzureConfig(
subscription_key="your_subscription_key",
endpoint="your_endpoint",
model_id="prebuilt-layout",
features=[
"ocrHighResolution", # High resolution OCR for small text
"formulas", # Extract mathematical formulas in LaTeX
"styleFont", # Extract font properties
"barcodes", # Extract barcodes and QR codes
"languages", # Detect document languages
"keyValuePairs" # Extract key-value pairs from forms
]
)
loader = DocumentLoaderAzureForm(config)
pages = loader.load("document_with_advanced_content.pdf")
for page in pages:
# Standard content
print(f"Text content: {page['content']}")
print(f"Tables: {page['tables']}")
print(f"Forms: {page['forms']}")
# Advanced features (if detected in document)
if 'formulas' in page:
print(f"Mathematical formulas: {page['formulas']}")
if 'fonts' in page:
print(f"Font information: {page['fonts']}")
if 'barcodes' in page:
print(f"Barcodes found: {page['barcodes']}")
if 'languages' in page:
print(f"Detected languages: {page['languages']}")
Specialized Models Usage¶
# Use specialized invoice model
config = AzureConfig(
subscription_key="your_subscription_key",
endpoint="your_endpoint",
model_id="prebuilt-invoice"
)
loader = DocumentLoaderAzureForm(config)
pages = loader.load("invoice.pdf")
# Access extracted invoice fields
for page in pages:
forms = page["forms"]
vendor_name = forms.get("VendorName", "")
invoice_total = forms.get("InvoiceTotal", "")
print(f"Vendor: {vendor_name}, Total: {invoice_total}")
Configuration Options¶
The AzureConfig
class supports the following options:
Option | Type | Default | Description |
---|---|---|---|
subscription_key |
str | Required | Azure subscription key |
endpoint |
str | Required | Azure endpoint URL |
content |
Any | None | Initial content to process |
cache_ttl |
int | 300 | Cache time-to-live in seconds |
model_id |
str | "prebuilt-layout" | Model ID to use |
max_retries |
int | 3 | Maximum retries for failed requests |
features |
List[str] | None | Advanced features to enable |
Available Models¶
General Purpose Models¶
Model ID | Description | Best For |
---|---|---|
prebuilt-read |
OCR/Read model | Text extraction from printed and handwritten documents |
prebuilt-layout |
Layout analysis | Documents with tables, selection marks, and complex layouts |
prebuilt-document |
General document | Key-value pairs, tables, and general document structure |
Specialized Models¶
Model ID | Description |
---|---|
prebuilt-invoice |
Invoice processing |
prebuilt-receipt |
Receipt processing |
prebuilt-idDocument |
Identity documents |
prebuilt-businessCard |
Business cards |
prebuilt-tax.us.w2 |
US W2 tax forms |
prebuilt-tax.us.1040 |
US 1040 tax forms |
prebuilt-contract |
Contracts |
prebuilt-healthInsurance |
US health insurance cards |
prebuilt-bankStatement |
Bank statements |
prebuilt-payStub |
Pay stubs |
Advanced Features¶
The loader supports advanced extraction features that can be enabled via the features
parameter:
Feature | Description | Output Field |
---|---|---|
ocrHighResolution |
High resolution OCR for better small text recognition | Enhanced text in content |
formulas |
Extract mathematical formulas in LaTeX format | formulas array |
styleFont |
Extract font properties (family, style, weight, color) | fonts array |
barcodes |
Extract barcodes and QR codes | barcodes array |
languages |
Detect document languages | languages array |
keyValuePairs |
Extract key-value pairs from forms | Enhanced forms dict |
queryFields |
Enable custom field extraction | Enhanced extraction |
searchablePDF |
Convert scanned PDFs to searchable format | Enhanced OCR |
Features¶
- Text extraction with layout preservation
- Table detection and extraction
- Form field recognition with specialized models
- Advanced OCR with high resolution support
- Mathematical formula extraction (LaTeX format)
- Font property extraction
- Barcode and QR code detection
- Multi-language document support
- Caching support with configurable TTL
- Vision mode support for image formats
- Retry logic for robust processing
Notes¶
- Azure subscription key and endpoint are required
- Advanced features may increase processing time and costs
- Specialized models are optimized for specific document types
- Rate limits and quotas apply based on your Azure subscription
- Vision mode is supported for image formats
- High resolution OCR is recommended for documents with small text
- Formula extraction works best with clear mathematical notation