Web Document Loader¶
The Web loader extracts content from web pages using BeautifulSoup. It supports HTML parsing, content cleaning, and custom element handling.
Supported Formats¶
- html
- htm
- xhtml
- url
Usage¶
Basic Usage¶
from extract_thinker import DocumentLoaderBeautifulSoup
# Initialize with default settings
loader = DocumentLoaderBeautifulSoup()
# Load document
pages = loader.load("https://example.com")
# Process extracted content
for page in pages:
# Access text content
text = page["content"]
Configuration-based Usage¶
from extract_thinker import DocumentLoaderBeautifulSoup, BeautifulSoupConfig
# Create configuration
config = BeautifulSoupConfig(
header_handling="extract", # Extract headers as separate content
parser="lxml", # Use lxml parser
remove_elements=[ # Elements to remove
"script", "style", "nav", "footer"
],
max_tokens=8192, # Maximum tokens per page
request_timeout=30, # Request timeout in seconds
cache_ttl=600 # Cache results for 10 minutes
)
# Initialize loader with configuration
loader = DocumentLoaderBeautifulSoup(config)
# Load and process document
pages = loader.load("https://example.com")
Configuration Options¶
The BeautifulSoupConfig
class supports the following options:
Option | Type | Default | Description |
---|---|---|---|
content |
Any | None | Initial content to process |
cache_ttl |
int | 300 | Cache time-to-live in seconds |
header_handling |
str | "ignore" | How to handle headers |
parser |
str | "html.parser" | HTML parser to use |
remove_elements |
List[str] | None | Elements to remove |
max_tokens |
int | None | Maximum tokens per page |
request_timeout |
int | 10 | Request timeout in seconds |
Features¶
- Web page content extraction
- Header handling options
- Custom element removal
- Multiple parser support
- Token limit control
- Request timeout control
- Caching support
- Stream-based loading
Notes¶
- Vision mode is not supported
- Requires internet connection for URLs
- Local HTML files are supported
- Respects robots.txt
- May require custom headers for some sites