Skip to content

Web Document Loader

The Web loader extracts content from web pages using BeautifulSoup. It supports HTML parsing, content cleaning, and custom element handling.

Supported Formats

  • html
  • htm
  • xhtml
  • url

Usage

Basic Usage

from extract_thinker import DocumentLoaderBeautifulSoup

# Initialize with default settings
loader = DocumentLoaderBeautifulSoup()

# Load document
pages = loader.load("https://example.com")

# Process extracted content
for page in pages:
    # Access text content
    text = page["content"]

Configuration-based Usage

from extract_thinker import DocumentLoaderBeautifulSoup, BeautifulSoupConfig

# Create configuration
config = BeautifulSoupConfig(
    header_handling="extract",     # Extract headers as separate content
    parser="lxml",                # Use lxml parser
    remove_elements=[             # Elements to remove
        "script", "style", "nav", "footer"
    ],
    max_tokens=8192,             # Maximum tokens per page
    request_timeout=30,          # Request timeout in seconds
    cache_ttl=600               # Cache results for 10 minutes
)

# Initialize loader with configuration
loader = DocumentLoaderBeautifulSoup(config)

# Load and process document
pages = loader.load("https://example.com")

Configuration Options

The BeautifulSoupConfig class supports the following options:

Option Type Default Description
content Any None Initial content to process
cache_ttl int 300 Cache time-to-live in seconds
header_handling str "ignore" How to handle headers
parser str "html.parser" HTML parser to use
remove_elements List[str] None Elements to remove
max_tokens int None Maximum tokens per page
request_timeout int 10 Request timeout in seconds

Features

  • Web page content extraction
  • Header handling options
  • Custom element removal
  • Multiple parser support
  • Token limit control
  • Request timeout control
  • Caching support
  • Stream-based loading

Notes

  • Vision mode is not supported
  • Requires internet connection for URLs
  • Local HTML files are supported
  • Respects robots.txt
  • May require custom headers for some sites