OpenContracts Pipeline Architecture¶
The OpenContracts pipeline system is a modular and extensible architecture for processing documents through various stages: parsing, thumbnail generation, and embedding. This document provides an overview of the system architecture and guides you through creating new pipeline components.
Architecture Overview¶
The pipeline system consists of three main component types:
- Parsers: Extract text and structure from documents
- Thumbnailers: Generate visual previews of documents
- Embedders: Create vector embeddings for semantic search
Each component type has a base abstract class that defines the interface and common functionality:
```mermaid graph TD A[Document Upload] → B[Parser] B → C[Thumbnailer] B → D[Embedder] B → PP[Post-Processor]
subgraph "Pipeline Components"
B --> B1[DoclingParser REST]
B --> B2[NLMIngestParser]
B --> B3[TxtParser]
C --> C1[PdfThumbnailGenerator]
C --> C2[TextThumbnailGenerator]
D --> D1[MicroserviceEmbedder]
D --> D2[ModernBERTEmbedder]
D --> D3[MinnModernBERTEmbedder]
PP --> PP1[PDFRedactor]
end
C1 --> E[Document Preview]
C2 --> E
D1 --> F[Vector Database]
D2 --> F
D3 --> F
PP1 --> G[Processed Document]
```
Component Registration¶
Components are registered in settings/base.py
through configuration dictionaries:
PREFERRED_PARSERS = {
"application/pdf": "opencontractserver.pipeline.parsers.docling_parser_rest.DoclingParser",
"text/plain": "opencontractserver.pipeline.parsers.oc_text_parser.TxtParser",
# ... other mime types
}
THUMBNAIL_TASKS = {
"application/pdf": "opencontractserver.tasks.doc_tasks.extract_pdf_thumbnail",
"text/plain": "opencontractserver.tasks.doc_tasks.extract_txt_thumbnail",
# ... other mime types
}
PREFERRED_EMBEDDERS = {
"application/pdf": "opencontractserver.pipeline.embedders.sent_transformer_microservice.MicroserviceEmbedder",
# ... other mime types
}
Component Types¶
Parsers¶
Parsers inherit from BaseParser
and implement the parse_document
method:
class BaseParser(ABC):
title: str = ""
description: str = ""
author: str = ""
dependencies: list[str] = []
supported_file_types: list[FileTypeEnum] = []
@abstractmethod
def parse_document(
self, user_id: int, doc_id: int, **kwargs
) -> Optional[OpenContractDocExport]:
pass
Current implementations: - DoclingParser: Advanced PDF parser using machine learning (REST microservice) - NLMIngestParser: Alternative PDF parser using NLM Ingest library - TxtParser: Simple text file parser
Thumbnailers¶
Thumbnailers inherit from BaseThumbnailGenerator
and implement the _generate_thumbnail
method:
class BaseThumbnailGenerator(ABC):
title: str = ""
description: str = ""
author: str = ""
dependencies: list[str] = []
supported_file_types: list[FileTypeEnum] = []
@abstractmethod
def _generate_thumbnail(
self,
txt_content: Optional[str],
pdf_bytes: Optional[bytes],
height: int = 300,
width: int = 300,
) -> Optional[tuple[bytes, str]]:
pass
Current implementations: - PdfThumbnailGenerator: Generates thumbnails from PDF first pages - TextThumbnailGenerator: Creates text-based preview images
Embedders¶
Embedders inherit from BaseEmbedder
and implement the embed_text
method:
class BaseEmbedder(ABC):
title: str = ""
description: str = ""
author: str = ""
dependencies: list[str] = []
vector_size: int = 0
supported_file_types: list[FileTypeEnum] = []
@abstractmethod
def embed_text(self, text: str) -> Optional[list[float]]:
pass
Current implementations: - MicroserviceEmbedder: Generates embeddings using a remote service - ModernBERTEmbedder: Local ModernBERT embeddings generation - MinnModernBERTEmbedder: Minnesota Case Law specialized ModernBERT embedder - CloudMinnModernBERTEmbedder: Cloud-based Minnesota ModernBERT embedder
Creating New Components¶
To create a new pipeline component:
- Choose the appropriate base class (
BaseParser
,BaseThumbnailGenerator
, orBaseEmbedder
) - Create a new class inheriting from the base class
- Implement required abstract methods
- Set component metadata (title, description, author, etc.)
- Register the component in the appropriate settings dictionary
Example of a new parser:
from opencontractserver.pipeline.base.parser import BaseParser
from opencontractserver.pipeline.base.file_types import FileTypeEnum
class MyCustomParser(BaseParser):
title = "My Custom Parser"
description = "Parses documents in a custom way"
author = "Your Name"
dependencies = ["custom-lib>=1.0.0"]
supported_file_types = [FileTypeEnum.PDF]
def parse_document(
self, user_id: int, doc_id: int, **kwargs
) -> Optional[OpenContractDocExport]:
# Implementation here
pass
Then register it in settings:
PREFERRED_PARSERS = {
"application/pdf": "path.to.your.MyCustomParser",
# ... other parsers
}
Best Practices¶
- Error Handling: Always handle exceptions gracefully and return None on failure
- Dependencies: List all required dependencies in the component's
dependencies
list - Documentation: Provide clear docstrings and type hints
- Testing: Create unit tests for your component in the
tests
directory - Metadata: Fill out all metadata fields (title, description, author)
Advanced Topics¶
Parallel Processing¶
The pipeline system supports parallel processing through Celery tasks. Each component can be executed asynchronously:
from opencontractserver.tasks.doc_tasks import process_document
# Async document processing
process_document.delay(user_id, doc_id)
Custom File Types¶
To add support for new file types:
- Add the MIME type to
ALLOWED_DOCUMENT_MIMETYPES
in settings - Update
FileTypeEnum
inbase/file_types.py
- Create appropriate parser/thumbnailer/embedder implementations
- Register the implementations in settings
Error Handling¶
Components should implement robust error handling:
def parse_document(self, user_id: int, doc_id: int, **kwargs):
try:
# Implementation
return result
except Exception as e:
logger.error(f"Error parsing document {doc_id}: {e}")
return None
Contributing¶
When contributing new pipeline components:
- Follow the project's coding style
- Add comprehensive tests
- Update this documentation
- Submit a pull request with a clear description
For questions or support, please open an issue on the GitHub repository.