Skip to content

Markdown Conversion

Process Flow

The MarkdownConverter class provides functionality to convert documents (including text and images) into Markdown format. It leverages a configured Language Model (LLM) for sophisticated conversion, especially when dealing with images or requiring structured output.

Core Concepts

  • LLM Integration: The converter requires a configured LLM (extract_thinker.llm.LLM) to interpret document content and generate well-formatted Markdown. This is essential for both text and vision-based tasks (processing images) and for generating structured JSON output alongside Markdown.
  • Document Loader: It relies on a DocumentLoader (extract_thinker.document_loader.DocumentLoader) to load the source document(s) and potentially extract text and images. The behavior might vary depending on the specific loader used.
  • Vision Support: The to_markdown and to_markdown_structured methods have a vision parameter or operate in vision mode by default. When enabled, the converter attempts to process images within the document using the LLM's vision capabilities (if the LLM supports it).
  • Structured Output: The to_markdown_structured method specifically instructs the LLM to provide not only the Markdown content but also a JSON structure breaking down the content with certainty scores. This method inherently requires vision capabilities in the LLM.
  • Page Selection: Both to_markdown and to_markdown_structured methods support selective page processing through the pages parameter, allowing you to convert specific pages from multi-page documents.

Initialization

from extract_thinker.markdown import MarkdownConverter
from extract_thinker.document_loader import DocumentLoaderPyPdf # Example loader
from extract_thinker.llm import LLM
from extract_thinker.global_models import get_lite_model, get_big_model # Helpers for model config

# Initialize with or without components
markdown_converter = MarkdownConverter()

# Load components later
loader = DocumentLoaderPyPdf() # Configure as needed
# Use helper functions to get model configurations
# Replace with your actual logic for selecting/configuring models if needed
llm = LLM(get_lite_model()) 

markdown_converter.load_document_loader(loader)
markdown_converter.load_llm(llm)

# Or initialize directly
markdown_converter = MarkdownConverter(document_loader=loader, llm=llm)

Usage

Simple Markdown Conversion (LLM Required)

This method uses the configured LLM to generate Markdown. If vision=True, it processes images (requires an LLM with vision capabilities). Note: An LLM must be configured via load_llm() or during initialization for this method to work.

# Assuming markdown_converter is initialized with loader and LLM
source_path = "path/to/your/document.pdf" # Or image file like .png, .jpg

# Convert with vision disabled (processes text only using LLM)
markdown_pages_text = markdown_converter.to_markdown(source_path, vision=False) 
# Returns List[str]

# Convert with vision enabled (processes text and images using LLM)
markdown_pages_vision = markdown_converter.to_markdown(source_path, vision=True) 
# Returns List[str]

# Convert specific pages (1-indexed)
markdown_specific_pages = markdown_converter.to_markdown(source_path, vision=True, pages=[1, 3, 5])
# Returns List[str] with only the specified pages

for i, page_md in enumerate(markdown_pages_vision):
    print(f"--- Page {i+1} ---")
    print(page_md)

# Async version
markdown_pages_vision_async = await markdown_converter.to_markdown_async(source_path, vision=True)
# Async version with page selection
markdown_specific_pages_async = await markdown_converter.to_markdown_async(source_path, vision=True, pages=[1, 3, 5])

Structured Markdown Conversion (LLM Vision Required)

This method requires an LLM with vision capabilities and a document containing images. It returns structured data including Markdown and a JSON breakdown. Note: An LLM must be configured via load_llm() or during initialization for this method to work.

from extract_thinker.markdown import PageContent

# Assuming markdown_converter is initialized with loader and LLM (with vision)
image_path = "path/to/your/image.png" 

try:
    # This method inherently uses vision
    structured_output: List[PageContent] = markdown_converter.to_markdown_structured(image_path)
    # Returns List[PageContent]

    # Process specific pages (1-indexed)
    structured_output_specific: List[PageContent] = markdown_converter.to_markdown_structured(
        image_path, 
        pages=[1, 3, 5]
    )
    # Returns List[PageContent] with only the specified pages

    for i, page_content in enumerate(structured_output):
        print(f"--- Page {i+1} ---")
        # Access structured items
        for item in page_content.items:
            print(f"Certainty: {item.certainty}, Content: {item.content[:50]}...") # Print snippet

except ValueError as e:
    print(f"Error: {e}") # e.g., if no images found or LLM not set

# Async version
structured_output_async: List[PageContent] = await markdown_converter.to_markdown_structured_async(image_path)
# Async version with page selection
structured_output_specific_async: List[PageContent] = await markdown_converter.to_markdown_structured_async(
    image_path, 
    pages=[1, 3, 5]
)
Note: The to_markdown_structured method expects the LLM to return both Markdown and a specific JSON format. The extract_thinking_json utility is used internally to parse this.

Prompts

The converter uses specific system prompts depending on the method called: - DEFAULT_PAGE_PROMPT: Used by to_markdown_structured. Instructs the LLM to output Markdown and a JSON structure. - DEFAULT_MARKDOWN_PROMPT: Used by to_markdown (when using LLM). Instructs the LLM to output only well-formatted Markdown. - MARKDOWN_VERIFICATION_PROMPT: Potentially used for refining existing text (internal flag allow_verification).

These prompts guide the LLM's output format.