Splitters¶

In document processing, splitting enables the separation of individual documents or sections within a combined file. This task is especially crucial when handling batches of documents where different parts may need distinct processing, and always with Sonnet. This can be done with two strategies: Eager and Lazy.

Page-Level Processing¶

Splitters work at the page level, determining which pages belong together as a single document. For example:

A 10-page PDF might contain three separate invoices
A scanned document might contain multiple forms
A batch of documents might need to be separated by document type

The challenge is determining where one document ends and another begins, which is where our splitting strategies come in.

Eager vs. Lazy Approaches¶

Eager and Lazy splitting have distinct use cases based on document size and the complexity of relationships between pages.

Eager Splitting¶

Eager splitting processes all pages in a single pass, identifying and dividing all sections at once. It's efficient for small to medium-sized documents where context size does not limit performance.

from extract_thinker import Splitter, SplittingStrategy

splitter = Splitter()
result = splitter.split(
    document,
    strategy=SplittingStrategy.EAGER
)

Benefits of Eager Splitting: - Speed: Faster processing since all split points are determined upfront - Simplicity: Ideal for documents that fit entirely within the model's context window - Consistency: Better for documents where relationships between pages are important

Lazy Splitting¶

Lazy splitting processes pages incrementally in chunks, assessing smaller groups of pages at a time to decide if they belong together. In this use case, groups of two pages are processed and checked for continuity, allowing it to scale efficiently for larger documents.

result = splitter.split(
    document,
    strategy=SplittingStrategy.LAZY
)

Benefits of Lazy Splitting: - Scalability: Well-suited for documents that exceed the model's context window - Memory Efficiency: Processes only what's needed when needed - Flexibility: Better for streaming or real-time processing

Base Splitter Implementation

The base Splitter class provides both eager and lazy implementations:

name="__codelineno-2-1" href="#__codelineno-2-1">import asyncio class="kn">from typing import Any, List class="kn">from abc import ABC, abstractmethod class="kn">from extract_thinker.models.classification import Classification class="kn">from extract_thinker.models.doc_group import DocGroups, DocGroup class="kn">from extract_thinker.models.doc_groups2 import DocGroups2 class="kn">from extract_thinker.models.eager_doc_group import EagerDocGroup class="k">class Splitter(ABC): @abstractmethod def belongs_to_same_document(self, page1: Any, page2: Any, contract: str) -> DocGroups2: pass @abstractmethod def split_lazy_doc_group(self, lazy_doc_group: List[Any], classifications: List[Classification]) -> DocGroups: pass @abstractmethod def split_eager_doc_group(self, lazy_doc_group: List[Any], classifications: List[Classification]) -> DocGroups: pass def split_document_into_groups(self, document: List[Any]) -> List[List[Any]]: page_per_split = 2 split = [] if len(document) == 1: return [document] for i in range(0, len(document) - 1): group = document[i: i + page_per_split] split.append(group) return split async def process_split_groups(self, split: List[List[Any]], contract: str) -> List[DocGroups2]: # Create asynchronous tasks for processing each group tasks = [self.process_group(x, contract) for x in split] try: # Execute all tasks concurrently and wait for all to complete doc_groups = await asyncio.gather(*tasks) return doc_groups except Exception as e: # Handle possible exceptions that might occur during task execution print(f"An error occurred: {e}") raise async def process_group(self, group: List[Any], contract: str) -> DocGroups2: page2 = group[1] if len(group) > 1 else None return self.belongs_to_same_document(group[0], page2, contract) def aggregate_doc_groups(self, doc_groups_tasks: List[DocGroups2]) -> DocGroups: class="w"> """ class="sd"> Aggregate the results from belongs_to_same_document comparisons into final document groups. class="sd"> This is the base implementation that can be used by all splitter implementations. class="sd"> """ doc_groups = DocGroups() current_group = DocGroup(pages=[], classification="") page_number = 1 if not doc_groups_tasks: return doc_groups # Handle the first group doc_group = doc_groups_tasks[0] if doc_group.belongs_to_same_document: current_group.pages = [1, 2] current_group.classification = doc_group.classification_page1 else: # First page is its own document current_group.pages = [1] current_group.classification = doc_group.classification_page1 doc_groups.doc_groups.append(current_group) # Start new group with second page current_group = DocGroup(pages=[2], classification=doc_group.classification_page2) page_number += 1 # Process remaining groups for doc_group in doc_groups_tasks[1:]: if doc_group.belongs_to_same_document: current_group.pages.append(page_number + 1) else: doc_groups.doc_groups.append(current_group) current_group = DocGroup( pages=[page_number + 1], classification=doc_group.classification_page2 ) page_number += 1 # Add the last group doc_groups.doc_groups.append(current_group) return doc_groups

Available Splitters¶

ExtractThinker provides two main splitter implementations:

Text Splitter: For text-based document splitting
Image Splitter: For image-based document splitting

Recommended Approach¶

For most IDP use cases, Eager Splitting is appropriate since it offers: - Simpler implementation - Better handling of page relationships - Faster processing for typical document sizes (under 50 pages)

However, consider Lazy Splitting when: - Processing very large documents (50+ pages) - Working with limited memory - Handling streaming document inputs

Best Practices¶

Choose strategy based on document size and page count
Consider context window limitations of your LLM