PyPDF Document Loader¶

The PyPDF loader uses the PyPDF library to extract text and images from PDF documents. It provides basic text extraction and supports password-protected PDFs.

Supported Formats¶

pdf

Usage¶

Basic Usage¶

from extract_thinker import DocumentLoaderPyPdf

# Initialize with default settings
loader = DocumentLoaderPyPdf()

# Load document
pages = loader.load("path/to/your/document.pdf")

# Process extracted content
for page in pages:
    # Access text content
    text = page["content"]

Configuration-based Usage¶

from extract_thinker import DocumentLoaderPyPdf, PyPDFConfig

# Create configuration
config = PyPDFConfig(
    password="your_password",      # For password-protected PDFs
    vision_enabled=True,           # Enable vision mode for images
    extract_text=True,             # Enable text extraction
    cache_ttl=600                  # Cache results for 10 minutes
)

# Initialize loader with configuration
loader = DocumentLoaderPyPdf(config)

# Load and process document
pages = loader.load("path/to/your/document.pdf")

Configuration Options¶

The PyPDFConfig class supports the following options:

Option	Type	Default	Description
`content`	Any	None	Initial content to process
`cache_ttl`	int	300	Cache time-to-live in seconds
`password`	str	None	Password for protected PDFs
`vision_enabled`	bool	False	Enable vision mode for images
`extract_text`	bool	True	Enable text extraction

Features¶

Basic text extraction
Password-protected PDF support
Image extraction (with vision mode)
Caching support
No cloud service required
Lightweight and fast processing

Notes¶

Vision mode can be enabled for image extraction
Text extraction can be disabled for better performance
Supports encrypted/password-protected PDFs
Local processing with no external dependencies
May not preserve complex layouts or tables