MarkItDown Document Loader¶
MarkItDown is a versatile document processing library from Microsoft that can handle multiple file formats. The MarkItDown loader provides a robust interface for text extraction with optional vision mode support.
Supported Formats¶
Documents¶
- doc/docx
- ppt/pptx
- xls/xlsx
Text¶
- txt
- html
- xml
- json
Images¶
- jpg/jpeg
- png
- bmp
- gif
Audio¶
- wav
- mp3
- m4a
Others¶
- csv
- tsv
- zip
Usage¶
Basic Usage¶
from extract_thinker import DocumentLoaderMarkItDown
# Initialize with default settings
loader = DocumentLoaderMarkItDown()
# Load document
pages = loader.load("path/to/your/document.pdf")
# Process extracted content
for page in pages:
# Access text content
text = page["content"]
Configuration-based Usage¶
from extract_thinker import DocumentLoaderMarkItDown, MarkItDownConfig
# Create configuration
config = MarkItDownConfig(
page_separator="---", # Custom page separator
preserve_whitespace=True, # Preserve original whitespace
mime_type_detection=True, # Enable MIME type detection
default_extension=".md", # Default file extension
llm_client="gpt-4", # LLM client for enhanced parsing
cache_ttl=600 # Cache results for 10 minutes
)
# Initialize loader with configuration
loader = DocumentLoaderMarkItDown(config)
# Load and process document
pages = loader.load("path/to/your/document.md")
Configuration Options¶
The MarkItDownConfig
class supports the following options:
Option | Type | Default | Description |
---|---|---|---|
content |
Any | None | Initial content to process |
cache_ttl |
int | 300 | Cache time-to-live in seconds |
page_separator |
str | "\n\n" | Text to use as page separator |
preserve_whitespace |
bool | False | Whether to preserve whitespace |
mime_type_detection |
bool | True | Enable MIME type detection |
default_extension |
str | ".txt" | Default file extension |
llm_client |
str | None | LLM client for enhanced parsing |
llm_model |
str | None | LLM model for enhanced parsing |
Features¶
- Multi-format document processing
- Text and layout preservation
- MIME type detection
- Custom page separation
- Whitespace preservation
- LLM-enhanced parsing
- Caching support
- Stream-based loading
Notes¶
- Vision mode is supported for image formats
- LLM enhancement is optional
- Local processing with no external dependencies
- Preserves document structure
- Handles a wide variety of file formats