OpenContracts Pipeline Architecture¶

The OpenContracts pipeline system is a modular and extensible architecture for processing documents through various stages: parsing, thumbnail generation, and embedding. This document provides an overview of the system architecture and guides you through creating new pipeline components.

Architecture Overview¶

The pipeline system consists of three main component types:

Parsers: Extract text and structure from documents
Thumbnailers: Generate visual previews of documents
Embedders: Create vector embeddings for semantic search

Each component type has a base abstract class that defines the interface and common functionality:

```mermaid graph TD A[Document Upload] → B[Parser] B → C[Thumbnailer] B → D[Embedder] B → PP[Post-Processor]

subgraph "Pipeline Components"
    B --> B1[DoclingParser REST]
    B --> B2[NLMIngestParser]
    B --> B3[TxtParser]

    C --> C1[PdfThumbnailGenerator]
    C --> C2[TextThumbnailGenerator]

    D --> D1[MicroserviceEmbedder]
    D --> D2[ModernBERTEmbedder]
    D --> D3[MinnModernBERTEmbedder]

    PP --> PP1[PDFRedactor]
end

C1 --> E[Document Preview]
C2 --> E
D1 --> F[Vector Database]
D2 --> F
D3 --> F
PP1 --> G[Processed Document]

```

Component Registration¶

Components are registered in settings/base.py through configuration dictionaries:

PREFERRED_PARSERS = {
    "application/pdf": "opencontractserver.pipeline.parsers.docling_parser_rest.DoclingParser",
    "text/plain": "opencontractserver.pipeline.parsers.oc_text_parser.TxtParser",
    # ... other mime types
}

THUMBNAIL_TASKS = {
    "application/pdf": "opencontractserver.tasks.doc_tasks.extract_pdf_thumbnail",
    "text/plain": "opencontractserver.tasks.doc_tasks.extract_txt_thumbnail",
    # ... other mime types
}

PREFERRED_EMBEDDERS = {
    "application/pdf": "opencontractserver.pipeline.embedders.sent_transformer_microservice.MicroserviceEmbedder",
    # ... other mime types
}

Component Types¶

Parsers¶

Parsers inherit from BaseParser and implement the parse_document method:

class BaseParser(ABC):
    title: str = ""
    description: str = ""
    author: str = ""
    dependencies: list[str] = []
    supported_file_types: list[FileTypeEnum] = []

    @abstractmethod
    def parse_document(
        self, user_id: int, doc_id: int, **kwargs
    ) -> Optional[OpenContractDocExport]:
        pass

Current implementations: - DoclingParser: Advanced PDF parser using machine learning (REST microservice) - NLMIngestParser: Alternative PDF parser using NLM Ingest library - TxtParser: Simple text file parser

Thumbnailers¶

Thumbnailers inherit from BaseThumbnailGenerator and implement the _generate_thumbnail method:

class BaseThumbnailGenerator(ABC):
    title: str = ""
    description: str = ""
    author: str = ""
    dependencies: list[str] = []
    supported_file_types: list[FileTypeEnum] = []

    @abstractmethod
    def _generate_thumbnail(
        self,
        txt_content: Optional[str],
        pdf_bytes: Optional[bytes],
        height: int = 300,
        width: int = 300,
    ) -> Optional[tuple[bytes, str]]:
        pass

Current implementations: - PdfThumbnailGenerator: Generates thumbnails from PDF first pages - TextThumbnailGenerator: Creates text-based preview images

Embedders¶

Embedders inherit from BaseEmbedder and implement the embed_text method:

class BaseEmbedder(ABC):
    title: str = ""
    description: str = ""
    author: str = ""
    dependencies: list[str] = []
    vector_size: int = 0
    supported_file_types: list[FileTypeEnum] = []

    @abstractmethod
    def embed_text(self, text: str) -> Optional[list[float]]:
        pass

Current implementations: - MicroserviceEmbedder: Generates embeddings using a remote service - ModernBERTEmbedder: Local ModernBERT embeddings generation - MinnModernBERTEmbedder: Minnesota Case Law specialized ModernBERT embedder - CloudMinnModernBERTEmbedder: Cloud-based Minnesota ModernBERT embedder

Creating New Components¶

To create a new pipeline component:

Choose the appropriate base class (BaseParser, BaseThumbnailGenerator, or BaseEmbedder)
Create a new class inheriting from the base class
Implement required abstract methods
Set component metadata (title, description, author, etc.)
Register the component in the appropriate settings dictionary

Example of a new parser:

from opencontractserver.pipeline.base.parser import BaseParser
from opencontractserver.pipeline.base.file_types import FileTypeEnum

class MyCustomParser(BaseParser):
    title = "My Custom Parser"
    description = "Parses documents in a custom way"
    author = "Your Name"
    dependencies = ["custom-lib>=1.0.0"]
    supported_file_types = [FileTypeEnum.PDF]

    def parse_document(
        self, user_id: int, doc_id: int, **kwargs
    ) -> Optional[OpenContractDocExport]:
        # Implementation here
        pass

Then register it in settings:

PREFERRED_PARSERS = {
    "application/pdf": "path.to.your.MyCustomParser",
    # ... other parsers
}

Best Practices¶

Error Handling: Always handle exceptions gracefully and return None on failure
Dependencies: List all required dependencies in the component's dependencies list
Documentation: Provide clear docstrings and type hints
Testing: Create unit tests for your component in the tests directory
Metadata: Fill out all metadata fields (title, description, author)

Advanced Topics¶

Parallel Processing¶

The pipeline system supports parallel processing through Celery tasks. Each component can be executed asynchronously:

from opencontractserver.tasks.doc_tasks import process_document

# Async document processing
process_document.delay(user_id, doc_id)

Custom File Types¶

To add support for new file types:

Add the MIME type to ALLOWED_DOCUMENT_MIMETYPES in settings
Update FileTypeEnum in base/file_types.py
Create appropriate parser/thumbnailer/embedder implementations
Register the implementations in settings

Error Handling¶

Components should implement robust error handling:

def parse_document(self, user_id: int, doc_id: int, **kwargs):
    try:
        # Implementation
        return result
    except Exception as e:
        logger.error(f"Error parsing document {doc_id}: {e}")
        return None

Contributing¶

When contributing new pipeline components:

Follow the project's coding style
Add comprehensive tests
Update this documentation
Submit a pull request with a clear description

For questions or support, please open an issue on the GitHub repository.