PyPDF Document Loader¶
The PyPDF loader uses the PyPDF library to extract text and images from PDF documents. It provides basic text extraction and supports password-protected PDFs.
Supported Formats¶
Usage¶
Basic Usage¶
from extract_thinker import DocumentLoaderPyPdf
# Initialize with default settings
loader = DocumentLoaderPyPdf()
# Load document
pages = loader.load("path/to/your/document.pdf")
# Process extracted content
for page in pages:
# Access text content
text = page["content"]
Configuration-based Usage¶
from extract_thinker import DocumentLoaderPyPdf, PyPDFConfig
# Create configuration
config = PyPDFConfig(
password="your_password", # For password-protected PDFs
vision_enabled=True, # Enable vision mode for images
extract_text=True, # Enable text extraction
cache_ttl=600 # Cache results for 10 minutes
)
# Initialize loader with configuration
loader = DocumentLoaderPyPdf(config)
# Load and process document
pages = loader.load("path/to/your/document.pdf")
Configuration Options¶
The PyPDFConfig
class supports the following options:
Option | Type | Default | Description |
---|---|---|---|
content |
Any | None | Initial content to process |
cache_ttl |
int | 300 | Cache time-to-live in seconds |
password |
str | None | Password for protected PDFs |
vision_enabled |
bool | False | Enable vision mode for images |
extract_text |
bool | True | Enable text extraction |
Features¶
- Basic text extraction
- Password-protected PDF support
- Image extraction (with vision mode)
- Caching support
- No cloud service required
- Lightweight and fast processing
Notes¶
- Vision mode can be enabled for image extraction
- Text extraction can be disabled for better performance
- Supports encrypted/password-protected PDFs
- Local processing with no external dependencies
- May not preserve complex layouts or tables