Doc2txt Document Loader¶

The Doc2txt loader extracts text from Microsoft Word documents. It supports both legacy (.doc) and modern (.docx) file formats.

Supported Formats¶

doc
docx

Usage¶

Basic Usage¶

from extract_thinker import DocumentLoaderDoc2txt

# Initialize with default settings
loader = DocumentLoaderDoc2txt()

# Load document
pages = loader.load("path/to/your/document.docx")

# Process extracted content
for page in pages:
    # Access text content
    text = page["content"]

Configuration-based Usage¶

from extract_thinker import DocumentLoaderDoc2txt, Doc2txtConfig

# Create configuration
config = Doc2txtConfig(
    page_separator="\n\n---\n\n",  # Custom page separator
    preserve_whitespace=True,      # Preserve original whitespace
    extract_images=True,           # Extract embedded images
    cache_ttl=600                  # Cache results for 10 minutes
)

# Initialize loader with configuration
loader = DocumentLoaderDoc2txt(config)

# Load and process document
pages = loader.load("path/to/your/document.docx")

Configuration Options¶

The Doc2txtConfig class supports the following options:

Option	Type	Default	Description
`content`	Any	None	Initial content to process
`cache_ttl`	int	300	Cache time-to-live in seconds
`page_separator`	str	"\n\n"	Text to use as page separator
`preserve_whitespace`	bool	False	Whether to preserve whitespace
`extract_images`	bool	False	Whether to extract embedded images

Features¶

Text extraction from Word documents
Support for both .doc and .docx
Custom page separation
Whitespace preservation
Image extraction (optional)
Caching support
No cloud service required

Notes¶

Vision mode is not supported
Image extraction requires additional memory
Local processing with no external dependencies
May not preserve complex formatting
Handles both legacy and modern Word formats