Azure Document Intelligence Loader¶
The Azure Document Intelligence loader (formerly Form Recognizer) uses Azure's Document Intelligence service to extract text, tables, and structured information from documents.
Supported Formats¶
Supports PDF, JPEG/JPG, PNG, BMP, TIFF, HEIF, DOCX, XLSX, PPTX and HTML.
Usage¶
Basic Usage¶
from extract_thinker import DocumentLoaderAzureForm
# Initialize with Azure credentials
loader = DocumentLoaderAzureForm(
subscription_key="your_subscription_key",
endpoint="your_endpoint",
model_id="prebuilt-document" # Use prebuilt document model
)
# Load document
pages = loader.load("path/to/your/document.pdf")
# Process extracted content
for page in pages:
# Access text content
text = page["content"]
# Access tables if available
tables = page.get("tables", [])
# Access form fields if available
forms = page.get("forms", {})
Configuration-based Usage¶
from extract_thinker import DocumentLoaderAzureForm, AzureConfig
# Create configuration
config = AzureConfig(
subscription_key="your_subscription_key",
endpoint="your_endpoint",
model_id="prebuilt-layout", # Use layout model for enhanced layout analysis
cache_ttl=600, # Cache results for 10 minutes
features=["ocrHighResolution", "barcodes"] # Enable advanced features
)
# Initialize loader with configuration
loader = DocumentLoaderAzureForm(config)
# Load and process document
pages = loader.load("path/to/your/document.pdf")
Advanced Features Usage¶
from extract_thinker import DocumentLoaderAzureForm, AzureConfig
# Configuration with multiple advanced features
config = AzureConfig(
subscription_key="your_subscription_key",
endpoint="your_endpoint",
model_id="prebuilt-layout",
features=[
"ocrHighResolution", # High resolution OCR for small text
"formulas", # Extract mathematical formulas in LaTeX
"styleFont", # Extract font properties
"barcodes", # Extract barcodes and QR codes
"languages", # Detect document languages
"keyValuePairs" # Extract key-value pairs from forms
]
)
loader = DocumentLoaderAzureForm(config)
pages = loader.load("document_with_advanced_content.pdf")
for page in pages:
# Standard content
print(f"Text content: {page['content']}")
print(f"Tables: {page['tables']}")
print(f"Forms: {page['forms']}")
# Advanced features (if detected in document)
if 'formulas' in page:
print(f"Mathematical formulas: {page['formulas']}")
if 'fonts' in page:
print(f"Font information: {page['fonts']}")
if 'barcodes' in page:
print(f"Barcodes found: {page['barcodes']}")
if 'languages' in page:
print(f"Detected languages: {page['languages']}")
Specialized Models Usage¶
# Use specialized invoice model
config = AzureConfig(
subscription_key="your_subscription_key",
endpoint="your_endpoint",
model_id="prebuilt-invoice"
)
loader = DocumentLoaderAzureForm(config)
pages = loader.load("invoice.pdf")
# Access extracted invoice fields
for page in pages:
forms = page["forms"]
vendor_name = forms.get("VendorName", "")
invoice_total = forms.get("InvoiceTotal", "")
print(f"Vendor: {vendor_name}, Total: {invoice_total}")
Configuration Options¶
The AzureConfig class supports the following options:
| Option | Type | Default | Description |
|---|---|---|---|
subscription_key |
str | Required | Azure subscription key |
endpoint |
str | Required | Azure endpoint URL |
content |
Any | None | Initial content to process |
cache_ttl |
int | 300 | Cache time-to-live in seconds |
model_id |
str | "prebuilt-layout" | Model ID to use |
max_retries |
int | 3 | Maximum retries for failed requests |
features |
List[str] | None | Advanced features to enable |
Available Models¶
General Purpose Models¶
| Model ID | Description | Best For |
|---|---|---|
prebuilt-read |
OCR/Read model | Text extraction from printed and handwritten documents |
prebuilt-layout |
Layout analysis | Documents with tables, selection marks, and complex layouts |
prebuilt-document |
General document | Key-value pairs, tables, and general document structure |
Specialized Models¶
| Model ID | Description |
|---|---|
prebuilt-invoice |
Invoice processing |
prebuilt-receipt |
Receipt processing |
prebuilt-idDocument |
Identity documents |
prebuilt-businessCard |
Business cards |
prebuilt-tax.us.w2 |
US W2 tax forms |
prebuilt-tax.us.1040 |
US 1040 tax forms |
prebuilt-contract |
Contracts |
prebuilt-healthInsurance |
US health insurance cards |
prebuilt-bankStatement |
Bank statements |
prebuilt-payStub |
Pay stubs |
Advanced Features¶
The loader supports advanced extraction features that can be enabled via the features parameter:
| Feature | Description | Output Field |
|---|---|---|
ocrHighResolution |
High resolution OCR for better small text recognition | Enhanced text in content |
formulas |
Extract mathematical formulas in LaTeX format | formulas array |
styleFont |
Extract font properties (family, style, weight, color) | fonts array |
barcodes |
Extract barcodes and QR codes | barcodes array |
languages |
Detect document languages | languages array |
keyValuePairs |
Extract key-value pairs from forms | Enhanced forms dict |
queryFields |
Enable custom field extraction | Enhanced extraction |
searchablePDF |
Convert scanned PDFs to searchable format | Enhanced OCR |
Features¶
- Text extraction with layout preservation
- Table detection and extraction
- Form field recognition with specialized models
- Advanced OCR with high resolution support
- Mathematical formula extraction (LaTeX format)
- Font property extraction
- Barcode and QR code detection
- Multi-language document support
- Caching support with configurable TTL
- Vision mode support for image formats
- Retry logic for robust processing
Notes¶
- Azure subscription key and endpoint are required
- Advanced features may increase processing time and costs
- Specialized models are optimized for specific document types
- Rate limits and quotas apply based on your Azure subscription
- Vision mode is supported for image formats
- High resolution OCR is recommended for documents with small text
- Formula extraction works best with clear mathematical notation