Skip to content

Splitters

In document processing, splitting enables the separation of individual documents or sections within a combined file. This task is especially crucial when handling batches of documents where different parts may need distinct processing, and always with Sonnet. This can be done with two strategies: Eager and Lazy.

Splitter Flow

Page-Level Processing

Splitters work at the page level, determining which pages belong together as a single document. For example:

  • A 10-page PDF might contain three separate invoices

  • A scanned document might contain multiple forms

  • A batch of documents might need to be separated by document type

The challenge is determining where one document ends and another begins, which is where our splitting strategies come in.

Eager vs. Lazy Approaches

Eager and Lazy splitting have distinct use cases based on document size and the complexity of relationships between pages.

Eager Splitting

Eager splitting processes all pages in a single pass, identifying and dividing all sections at once. It's efficient for small to medium-sized documents where context size does not limit performance.

from extract_thinker import Splitter, SplittingStrategy

splitter = Splitter()
result = splitter.split(
    document,
    strategy=SplittingStrategy.EAGER
)

Benefits of Eager Splitting: - Speed: Faster processing since all split points are determined upfront - Simplicity: Ideal for documents that fit entirely within the model's context window - Consistency: Better for documents where relationships between pages are important

Lazy Splitting

Lazy splitting processes pages incrementally in chunks, assessing smaller groups of pages at a time to decide if they belong together. In this use case, groups of two pages are processed and checked for continuity, allowing it to scale efficiently for larger documents.

result = splitter.split(
    document,
    strategy=SplittingStrategy.LAZY
)

Benefits of Lazy Splitting: - Scalability: Well-suited for documents that exceed the model's context window - Memory Efficiency: Processes only what's needed when needed - Flexibility: Better for streaming or real-time processing

Base Splitter Implementation

The base Splitter class provides both eager and lazy implementations:

import asyncio
from typing import Any, List
from abc import ABC, abstractmethod

from extract_thinker.models.classification import Classification
from extract_thinker.models.doc_group import DocGroups, DocGroup
from extract_thinker.models.doc_groups2 import DocGroups2
from extract_thinker.models.eager_doc_group import EagerDocGroup


class Splitter(ABC):
    @abstractmethod
    def belongs_to_same_document(self, page1: Any, page2: Any, contract: str) -> DocGroups2:
        pass

    @abstractmethod
    def split_lazy_doc_group(self, lazy_doc_group: List[Any], classifications: List[Classification]) -> DocGroups:
        pass

    @abstractmethod
    def split_eager_doc_group(self, lazy_doc_group: List[Any], classifications: List[Classification]) -> DocGroups:
        pass

    def split_document_into_groups(self, document: List[Any]) -> List[List[Any]]:
        page_per_split = 2
        split = []
        if len(document) == 1:
            return [document]
        for i in range(0, len(document) - 1):
            group = document[i: i + page_per_split]
            split.append(group)
        return split

    async def process_split_groups(self, split: List[List[Any]], contract: str) -> List[DocGroups2]:
        # Create asynchronous tasks for processing each group
        tasks = [self.process_group(x, contract) for x in split]
        try:
            # Execute all tasks concurrently and wait for all to complete
            doc_groups = await asyncio.gather(*tasks)
            return doc_groups
        except Exception as e:
            # Handle possible exceptions that might occur during task execution
            print(f"An error occurred: {e}")
            raise

    async def process_group(self, group: List[Any], contract: str) -> DocGroups2:
        page2 = group[1] if len(group) > 1 else None
        return self.belongs_to_same_document(group[0], page2, contract)

    def aggregate_doc_groups(self, doc_groups_tasks: List[DocGroups2]) -> DocGroups:
        """
        Aggregate the results from belongs_to_same_document comparisons into final document groups.
        This is the base implementation that can be used by all splitter implementations.
        """
        doc_groups = DocGroups()
        current_group = DocGroup(pages=[], classification="")
        page_number = 1

        if not doc_groups_tasks:
            return doc_groups

        # Handle the first group
        doc_group = doc_groups_tasks[0]
        if doc_group.belongs_to_same_document:
            current_group.pages = [1, 2]
            current_group.classification = doc_group.classification_page1
        else:
            # First page is its own document
            current_group.pages = [1]
            current_group.classification = doc_group.classification_page1
            doc_groups.doc_groups.append(current_group)

            # Start new group with second page
            current_group = DocGroup(pages=[2], classification=doc_group.classification_page2)

        page_number += 1

        # Process remaining groups
        for doc_group in doc_groups_tasks[1:]:
            if doc_group.belongs_to_same_document:
                current_group.pages.append(page_number + 1)
            else:
                doc_groups.doc_groups.append(current_group)
                current_group = DocGroup(
                    pages=[page_number + 1],
                    classification=doc_group.classification_page2
                )
            page_number += 1

        # Add the last group
        doc_groups.doc_groups.append(current_group)

        return doc_groups

Available Splitters

ExtractThinker provides two main splitter implementations:

For most IDP use cases, Eager Splitting is appropriate since it offers: - Simpler implementation - Better handling of page relationships - Faster processing for typical document sizes (under 50 pages)

However, consider Lazy Splitting when: - Processing very large documents (50+ pages) - Working with limited memory - Handling streaming document inputs

Best Practices

  • Choose strategy based on document size and page count
  • Consider context window limitations of your LLM