PDF Data Processing Architecture¶
Overview¶
OpenContracts uses a modern, pluggable document processing pipeline that has evolved from the original PAWLs/Grobid approach to support multiple advanced parsing backends. The system now leverages state-of-the-art machine learning models while maintaining backward compatibility with the PAWLs data format.
Current Architecture¶
Parser Pipeline System¶
OpenContracts implements a modular pipeline architecture with three main parser options:
- Docling Parser (Primary) - IBM's advanced ML-based parser running as a REST microservice
- Superior layout understanding and table extraction
- Intelligent OCR with automatic detection
- Hierarchical document structure extraction
-
Group relationship detection for contract clauses
-
NLM Ingest Parser - Alternative parser using NLM Ingest library
- Faster processing for standard PDFs
- Good layout analysis without ML overhead
-
Suitable for documents not requiring OCR
-
Text Parser - Simple parser for plain text and markdown files
- Direct text extraction
- Minimal processing overhead
- Preserves original formatting
Data Layers¶
OpenContracts maintains a multi-layered data architecture that provides a consistent interface regardless of the parsing backend used:
1. PAWLs Layer (JSON)¶
The PAWLs (PDF Annotation With Labels) layer remains the core data format, storing: - Individual tokens (words) with precise bounding box coordinates - Page dimensions and layout information - Token-level positional data enabling pixel-perfect annotation overlay - Hierarchical structure information (headers, paragraphs, lists)
{
"pawls_file_content": [
{
"page": {"width": 612, "height": 792, "index": 0},
"tokens": [
{
"text": "Contract",
"bbox": {"x": 100, "y": 50, "width": 80, "height": 15}
}
]
}
]
}
2. Text Layer¶
A pure text extraction built from the PAWLs layer that: - Preserves reading order - Maintains paragraph and section boundaries - Enables full-text search and NLP processing - Provides character-level position mapping back to PAWLs tokens
3. Annotation Layer¶
Structural and semantic annotations including: - Document structure (headers, sections, paragraphs) - Detected entities and labels - User-created annotations - Relationships between document elements
4. Relationship Layer (New)¶
Advanced parsers like Docling can detect relationships between document elements: - Parent-child hierarchies (section → subsection) - Cross-references between clauses - Grouped elements (related paragraphs, list items) - Table cell relationships
Processing Pipeline¶
```mermaid graph LR A[PDF Upload] → B{Parser Selection} B → C[Docling REST API] B → D[NLM Ingest] B → E[Text Parser]
C --> F[PAWLs Generation]
D --> F
E --> F
F --> G[Text Extraction]
F --> H[Annotation Creation]
F --> I[Relationship Mapping]
G --> J[Searchable Text Layer]
H --> K[Visual Annotations]
I --> L[Document Graph]
```
Evolution from Original PAWLs¶
The original OpenContracts implementation used: - Grobid for layout analysis - Tesseract for OCR - Re-OCR of every document for consistency
The current system has evolved to: - Use modern ML models for better accuracy - Support multiple parsing backends - Preserve embedded text when appropriate - Only apply OCR when needed (configurable) - Extract richer structural information
Key Improvements¶
- Better Accuracy: ML-based parsers provide superior layout understanding
- Flexibility: Choose the right parser for your document types
- Performance: Microservice architecture enables better scaling
- Rich Structure: Extract hierarchies and relationships, not just text
- Selective OCR: Only OCR when needed, preserving original text quality
Maintaining Compatibility¶
Despite the architectural evolution, OpenContracts maintains full compatibility: - PAWLs format remains the standard interface - All parsers output to the same data structure - Existing annotations and tools continue to work - Text-to-position mapping preserved
Configuration¶
Parsers are configured in Django settings:
PREFERRED_PARSERS = {
"application/pdf": "opencontractserver.pipeline.parsers.docling_parser_rest.DoclingParser",
"text/plain": "opencontractserver.pipeline.parsers.oc_text_parser.TxtParser",
}
# Parser-specific settings
DOCLING_PARSER_SERVICE_URL = "http://docling-parser:8000/parse/"
DOCLING_PARSER_TIMEOUT = 300
Limitations and Trade-offs¶
Current Limitations¶
- OCR Quality: While improved, OCR can still make errors (O vs 0, I vs 1)
- Processing Time: ML-based parsers are slower than simple text extraction
- Resource Usage: Advanced parsers require more memory and CPU
- Format Support: Currently limited to PDF and text formats
Design Trade-offs¶
- Accuracy vs Speed: ML parsers are more accurate but slower
- Flexibility vs Complexity: Multiple parsers add configuration complexity
- Consistency vs Fidelity: Standardizing to PAWLs format may lose some format-specific features
Future Directions¶
- Additional Format Support: DOCX, XLSX, HTML parsing
- Streaming Processing: Handle very large documents efficiently
- Custom Parser Plugins: Easy integration of domain-specific parsers
- Enhanced Relationships: More sophisticated document graph analysis
- Hybrid Processing: Combine multiple parsers for optimal results