Mistral OCR Document Loader¶
The Mistral OCR document loader leverages the Mistral OCR API to extract text and images from various document formats. It provides high-quality OCR capabilities with advanced machine learning models.
About Mistral OCR¶
Mistral OCR is an industry-leading Optical Character Recognition API that sets a new standard in document understanding. Unlike other models, Mistral OCR comprehends each element of documents—media, text, tables, equations—with unprecedented accuracy and cognition. It takes images and PDFs as input and extracts content in an ordered interleaved text and images format.
Key capabilities: - State-of-the-art understanding of complex documents (tables, equations, layouts) - Natively multilingual support across thousands of scripts and languages - Superior performance on benchmarks - Fast processing (up to 2000 pages per minute on a single node) - Structured output in markdown format
Performance Benchmarks¶
Mistral OCR consistently outperforms other leading OCR models in benchmark tests:
Model | Overall | Math | Multilingual | Scanned | Tables |
---|---|---|---|---|---|
Google Document AI | 83.42 | 80.29 | 86.42 | 92.77 | 78.16 |
Azure OCR | 89.52 | 85.72 | 87.52 | 94.65 | 89.52 |
Gemini-1.5-Flash-002 | 90.23 | 89.11 | 86.76 | 94.87 | 90.48 |
Gemini-1.5-Pro-002 | 89.92 | 88.48 | 86.33 | 96.15 | 89.71 |
Gemini-2.0-Flash-001 | 88.69 | 84.18 | 85.80 | 95.11 | 91.46 |
GPT-4o-2024-11-20 | 89.77 | 87.55 | 86.00 | 94.58 | 91.70 |
Mistral OCR 2503 | 94.89 | 94.29 | 89.55 | 98.96 | 96.12 |
Multilingual Performance¶
Mistral OCR excels at processing documents in multiple languages:
Language | Azure OCR | Google Doc AI | Gemini-2.0-Flash-001 | Mistral OCR 2503 |
---|---|---|---|---|
ru | 97.35 | 95.56 | 96.58 | 99.09 |
fr | 97.50 | 96.36 | 97.06 | 99.20 |
hi | 96.45 | 95.65 | 94.99 | 97.55 |
zh | 91.40 | 90.89 | 91.85 | 97.11 |
pt | 97.96 | 96.24 | 97.25 | 99.42 |
de | 98.39 | 97.09 | 97.19 | 99.51 |
es | 98.54 | 97.52 | 97.75 | 99.54 |
tr | 95.91 | 93.85 | 94.66 | 97.00 |
uk | 97.81 | 96.24 | 96.70 | 99.29 |
it | 98.31 | 97.69 | 97.68 | 99.42 |
ro | 96.45 | 95.14 | 95.88 | 98.79 |
Supported Formats¶
- PDF documents
- Image files:
- JPG/JPEG
- PNG
- TIFF
- BMP
Usage¶
Basic Usage¶
from extract_thinker import DocumentLoaderMistralOCR, MistralOCRConfig
# Create configuration
config = MistralOCRConfig(
api_key="your_mistral_api_key",
model="mistral-ocr-latest"
)
# Initialize loader
loader = DocumentLoaderMistralOCR(config)
# Load from URL
pages = loader.load("https://example.com/document.pdf")
# Load from file path
pages = loader.load("path/to/your/document.pdf")
# Process extracted content
for page in pages:
# Access text content (in markdown format)
markdown_text = page["content"]
# Access images if available
if "images" in page:
for image in page["images"]:
image_id = image["id"]
image_base64 = image["image_base64"] # If include_image_base64=True
Configuration Options¶
The MistralOCRConfig
class supports the following options:
Option | Type | Default | Description |
---|---|---|---|
api_key |
str | Required | Mistral API key |
model |
str | "mistral-ocr-latest" | OCR model to use |
content |
Any | None | Initial content to process |
cache_ttl |
int | 300 | Cache time-to-live in seconds |
include_image_base64 |
bool | False | Include image base64 in response |
pages |
List[int] | None | Specific pages to process (PDF only) |
image_limit |
int | None | Maximum number of images to extract |
image_min_size |
int | None | Minimum image size to extract |
Features¶
- High-quality OCR with Mistral AI's models
- Support for PDF and image formats
- Text extraction in markdown format
- Image extraction with positioning information
- Support for pagination in PDF documents
- Caching for improved performance
- URL, file path, and BytesIO input support
- Processing speed up to 2000 pages per minute
- Superior handling of complex elements like tables, math equations, and diagrams
- Native support for thousands of languages and scripts
How It Works¶
When processing a document with the Mistral OCR loader:
- For URLs: The URL is sent directly to the Mistral OCR API
- For file paths or BytesIO objects:
- The file is first uploaded to Mistral's file storage system
- A signed URL is generated for the uploaded file
- The OCR API processes the document using the signed URL
This approach follows Mistral's recommended workflow for document processing and complies with their API requirements.
Common Use Cases¶
Mistral OCR can be used for a variety of document processing tasks:
- Scientific research: Convert scientific papers with complex equations and diagrams into AI-ready formats
- Historical document preservation: Digitize historical documents and artifacts
- Customer service enhancement: Transform documentation and manuals into indexed knowledge
- Educational content processing: Extract information from lecture notes, presentations, and educational materials
- Legal document analysis: Process regulatory filings and legal documents with high accuracy
- Multilingual document handling: Process documents in multiple languages with superior accuracy
API Usage Notes¶
- The Mistral OCR API requires authentication with an API key
- API usage is subject to Mistral AI's terms and pricing (approximately 1000 pages / $)
- Response time depends on document size and complexity
- Extracted text is returned in markdown format
- Image positions and dimensions are provided for visual context
- Pagination is only supported for PDF documents
- Maximum document size: 50 MB
- Maximum page limit: 1,000 pages
- Local files are uploaded to Mistral's file storage with a purpose of "ocr"
Requirements¶
- An active Mistral AI API key
requests
library for API communication- Internet connectivity for API access