Skip to content

MarkItDown Document Loader

MarkItDown is a versatile document processing library from Microsoft that can handle multiple file formats. The MarkItDown loader provides a robust interface for text extraction with optional vision mode support.

Supported Formats

Documents

  • pdf
  • doc/docx
  • ppt/pptx
  • xls/xlsx

Text

  • txt
  • html
  • xml
  • json

Images

  • jpg/jpeg
  • png
  • bmp
  • gif

Audio

  • wav
  • mp3
  • m4a

Others

  • csv
  • tsv
  • zip

Usage

Basic Usage

from extract_thinker import DocumentLoaderMarkItDown

# Initialize with default settings
loader = DocumentLoaderMarkItDown()

# Load document
pages = loader.load("path/to/your/document.pdf")

# Process extracted content
for page in pages:
    # Access text content
    text = page["content"]

Configuration-based Usage

from extract_thinker import DocumentLoaderMarkItDown, MarkItDownConfig

# Create configuration
config = MarkItDownConfig(
    page_separator="---",          # Custom page separator
    preserve_whitespace=True,      # Preserve original whitespace
    mime_type_detection=True,      # Enable MIME type detection
    default_extension=".md",       # Default file extension
    llm_client="gpt-4",           # LLM client for enhanced parsing
    cache_ttl=600                  # Cache results for 10 minutes
)

# Initialize loader with configuration
loader = DocumentLoaderMarkItDown(config)

# Load and process document
pages = loader.load("path/to/your/document.md")

Configuration Options

The MarkItDownConfig class supports the following options:

Option Type Default Description
content Any None Initial content to process
cache_ttl int 300 Cache time-to-live in seconds
page_separator str "\n\n" Text to use as page separator
preserve_whitespace bool False Whether to preserve whitespace
mime_type_detection bool True Enable MIME type detection
default_extension str ".txt" Default file extension
llm_client str None LLM client for enhanced parsing
llm_model str None LLM model for enhanced parsing

Features

  • Multi-format document processing
  • Text and layout preservation
  • MIME type detection
  • Custom page separation
  • Whitespace preservation
  • LLM-enhanced parsing
  • Caching support
  • Stream-based loading

Notes

  • Vision mode is supported for image formats
  • LLM enhancement is optional
  • Local processing with no external dependencies
  • Preserves document structure
  • Handles a wide variety of file formats