Skip to content

PDFPlumber Document Loader

The PDFPlumber loader uses the pdfplumber library to extract text and tables from PDF documents with precise layout preservation.

Supported Formats

  • pdf

Usage

Basic Usage

from extract_thinker import DocumentLoaderPdfPlumber

# Initialize with default settings
loader = DocumentLoaderPdfPlumber()

# Load document
pages = loader.load("path/to/your/document.pdf")

# Process extracted content
for page in pages:
    # Access text content
    text = page["content"]
    # Access tables if extracted
    tables = page.get("tables", [])

Configuration-based Usage

from extract_thinker import DocumentLoaderPdfPlumber, PDFPlumberConfig

# Create configuration
config = PDFPlumberConfig(
    table_settings={                # Custom table extraction settings
        "vertical_strategy": "text",
        "horizontal_strategy": "lines",
        "intersection_y_tolerance": 10
    },
    vision_enabled=True,           # Enable vision mode for images
    extract_tables=True,           # Enable table extraction
    cache_ttl=600                  # Cache results for 10 minutes
)

# Initialize loader with configuration
loader = DocumentLoaderPdfPlumber(config)

# Load and process document
pages = loader.load("path/to/your/document.pdf")

Configuration Options

The PDFPlumberConfig class supports the following options:

Option Type Default Description
content Any None Initial content to process
cache_ttl int 300 Cache time-to-live in seconds
table_settings Dict None Custom table extraction settings
vision_enabled bool False Enable vision mode for images
extract_tables bool True Enable table extraction

Features

  • Text extraction with layout preservation
  • Table detection and extraction
  • Custom table extraction settings
  • Vision mode support
  • Precise positioning information
  • Caching support
  • No cloud service required

Notes

  • Vision mode can be enabled for image extraction
  • Table extraction can be disabled for better performance
  • Custom table settings can improve extraction accuracy
  • Local processing with no external dependencies