PDFPlumber Document Loader¶
The PDFPlumber loader uses the pdfplumber library to extract text and tables from PDF documents with precise layout preservation.
Supported Formats¶
Usage¶
Basic Usage¶
from extract_thinker import DocumentLoaderPdfPlumber
# Initialize with default settings
loader = DocumentLoaderPdfPlumber()
# Load document
pages = loader.load("path/to/your/document.pdf")
# Process extracted content
for page in pages:
# Access text content
text = page["content"]
# Access tables if extracted
tables = page.get("tables", [])
Configuration-based Usage¶
from extract_thinker import DocumentLoaderPdfPlumber, PDFPlumberConfig
# Create configuration
config = PDFPlumberConfig(
table_settings={ # Custom table extraction settings
"vertical_strategy": "text",
"horizontal_strategy": "lines",
"intersection_y_tolerance": 10
},
vision_enabled=True, # Enable vision mode for images
extract_tables=True, # Enable table extraction
cache_ttl=600 # Cache results for 10 minutes
)
# Initialize loader with configuration
loader = DocumentLoaderPdfPlumber(config)
# Load and process document
pages = loader.load("path/to/your/document.pdf")
Configuration Options¶
The PDFPlumberConfig
class supports the following options:
Option | Type | Default | Description |
---|---|---|---|
content |
Any | None | Initial content to process |
cache_ttl |
int | 300 | Cache time-to-live in seconds |
table_settings |
Dict | None | Custom table extraction settings |
vision_enabled |
bool | False | Enable vision mode for images |
extract_tables |
bool | True | Enable table extraction |
Features¶
- Text extraction with layout preservation
- Table detection and extraction
- Custom table extraction settings
- Vision mode support
- Precise positioning information
- Caching support
- No cloud service required
Notes¶
- Vision mode can be enabled for image extraction
- Table extraction can be disabled for better performance
- Custom table settings can improve extraction accuracy
- Local processing with no external dependencies