Skip to content

Docling Parser (REST API)

Intro

The Docling Parser is an advanced PDF document parser based on IBM's docling document processing pipeline. As of 3.0.0-alpha1, it is the primary parser for PDF documents in OpenContracts. The parser runs as a microservice and is accessed via REST API, providing better dependency isolation and scalability.

Perhaps its coolest feature, besides its ability to support multiple cutting-edge OCR engines and numerous formats, is its ability to group document features into groups. We've found this to be particularly useful for contract layouts and setup our Docling integration to import these groups as OpenContract "relationships" - which, if you're not familiar, map N source annotations to N target annotations. In the case of the Docling parser, these look like this (AWESOME):

Docling Group Relationships

Architecture

```mermaid sequenceDiagram participant U as User participant DP as DoclingParser (REST Client) participant DS as Docling Service (Microservice) participant DC as DocumentConverter participant OCR as Tesseract OCR participant DB as Database

U->>DP: parse_document(user_id, doc_id)
DP->>DB: Load document
DP->>DS: HTTP POST with document + settings
DS->>DC: Convert PDF

alt PDF needs OCR
    DC->>OCR: Process PDF
    OCR-->>DC: OCR results
end

DC-->>DS: DoclingDocument
DS->>DS: Process structure
DS->>DS: Generate PAWLS tokens
DS->>DS: Build relationships
DS-->>DP: JSON response
DP->>DB: Store parsed data
DP-->>U: OpenContractDocExport

```

Features

  • Microservice Architecture: Runs Docling in an isolated container for better dependency management
  • Intelligent OCR: Automatically detects when OCR is needed
  • Hierarchical Structure: Extracts document structure (headings, paragraphs, lists)
  • Token-based Annotations: Creates precise token-level annotations
  • Relationship Detection: Builds relationships between document elements
  • PAWLS Integration: Generates PAWLS-compatible token data
  • Async Processing: Non-blocking REST API calls with configurable timeouts

Configuration

The Docling Parser is configured through Django settings:

# Configure the parser in settings
INSTALLED_PARSERS = [
    "opencontractserver.pipeline.parsers.docling_parser_rest.DoclingParser",
]

# Configure the Docling microservice URL
DOCLING_PARSER_SERVICE_URL = "http://docling-parser:8000/parse/"

# Configure request timeout (in seconds)
DOCLING_PARSER_TIMEOUT = 300  # 5 minutes default

# Optional: Enable OCR for scanned documents
DOCLING_ENABLE_OCR = True

# The microservice itself needs models path configured via environment
# DOCLING_MODELS_PATH = "/models/docling"

Microservice Setup

The Docling microservice runs in a separate Docker container:

# docker-compose.yml
services:
  docling-parser:
    image: opencontracts/docling-parser:latest
    ports:
      - "8000:8000"
    environment:
      - DOCLING_MODELS_PATH=/models/docling
      - ENABLE_OCR=true
      - MAX_WORKERS=4
    volumes:
      - docling_models:/models/docling

Usage

Basic usage:

from opencontractserver.pipeline.parsers.docling_parser_rest import DoclingParser

parser = DoclingParser()
result = parser.parse_document(user_id=1, doc_id=123)

With options:

result = parser.parse_document(
    user_id=1,
    doc_id=123,
    force_ocr=True,  # Force OCR processing
    roll_up_groups=True,  # Combine related items into groups
)

Input

The parser expects: - A PDF document stored in Django's storage system - A valid user ID and document ID - Optional configuration parameters passed to the microservice

Output

The parser returns an OpenContractDocExport dictionary containing:

{
    "title": str,  # Extracted document title
    "description": str,  # Generated description
    "content": str,  # Full text content
    "page_count": int,  # Number of pages
    "pawls_file_content": List[dict],  # PAWLS token data
    "labelled_text": List[dict],  # Structural annotations
    "relationships": List[dict],  # Relationships between annotations
    "doc_labels": List[dict],  # Document-level labels
}

Processing Steps

  1. Document Loading
  2. Loads PDF from Django storage
  3. Encodes document as base64 for transmission

  4. REST API Call

  5. Sends document to Docling microservice
  6. Includes processing parameters (OCR, grouping, etc.)
  7. Handles timeout and retry logic

  8. Microservice Processing

  9. Converts PDF using Docling's DocumentConverter
  10. Applies OCR if needed
  11. Extracts document structure
  12. Creates PAWLS-compatible tokens
  13. Builds spatial indices for token lookup
  14. Transforms coordinates to screen space

  15. Response Processing

  16. Validates JSON response structure
  17. Converts to OpenContractDocExport format
  18. Handles errors gracefully

  19. Metadata Extraction

  20. Extracts document title
  21. Generates description
  22. Counts pages

Advanced Features

OCR Processing

The parser can use Tesseract OCR when needed:

# Force OCR processing
result = parser.parse_document(user_id=1, doc_id=123, force_ocr=True)

Group Relationships

Enable group relationship detection:

# Enable group rollup
result = parser.parse_document(user_id=1, doc_id=123, roll_up_groups=True)

Spatial Processing

The microservice uses Shapely for spatial operations: - Creates STRtrees for efficient spatial queries - Handles coordinate transformations - Manages token-annotation mapping

Error Handling

The REST client includes robust error handling:

  • Connection Errors: Logs error and returns None
  • Timeout Errors: Configurable timeout prevents hanging requests
  • Service Unavailable: Gracefully degrades to fallback parsers
  • Invalid Responses: Validates JSON structure before processing
  • Document Processing Errors: Detailed error messages from service

Example error handling:

try:
    response = requests.post(
        self.service_url,
        json=request_data,
        timeout=self.request_timeout
    )
    response.raise_for_status()
except requests.exceptions.Timeout:
    logger.error(f"Timeout parsing document {doc_id}")
    return None
except requests.exceptions.RequestException as e:
    logger.error(f"Error calling Docling service: {e}")
    return None

Performance Considerations

  • Network Latency: REST API adds minimal overhead (~100ms)
  • Service Scaling: Can run multiple Docling service instances
  • Memory Isolation: Service memory usage doesn't affect main application
  • Processing Time: Typically 5-10 seconds per page for complex documents
  • Timeout Handling: Configurable timeout prevents hanging requests
  • Large Documents: May require increased timeout settings
  • Concurrent Processing: Service can handle multiple requests

Best Practices

  1. OCR Usage
  2. Let the parser auto-detect OCR needs
  3. Only use force_ocr=True when necessary

  4. Group Relationships

  5. Start with roll_up_groups=False
  6. Enable if hierarchical grouping is needed

  7. Error Handling

  8. Always check return values
  9. Monitor service logs for issues
  10. Implement fallback parsers

  11. Memory Management

  12. Monitor service container memory
  13. Scale horizontally for large workloads
  14. Use appropriate timeout values

  15. Service Health

  16. Implement health checks for the service
  17. Monitor service availability
  18. Use container orchestration for resilience

Troubleshooting

Common issues and solutions:

  1. Connection Refused
    ConnectionError: Cannot connect to Docling service
    
  2. Check that Docling service is running
  3. Verify DOCLING_PARSER_SERVICE_URL is correct
  4. Check Docker network configuration

  5. Timeouts

    Timeout error after 300 seconds
    

  6. Increase DOCLING_PARSER_TIMEOUT for large documents
  7. Check service resource allocation
  8. Monitor service logs for errors

  9. Missing Models (Service)

    FileNotFoundError: Docling models path does not exist
    

  10. Verify service DOCLING_MODELS_PATH environment
  11. Check model volume mount
  12. Ensure models are downloaded

  13. OCR Failures

    Error: OCR processing failed
    

  14. Verify OCR is enabled in service
  15. Check Tesseract installation in container
  16. Review service logs for OCR errors

  17. Memory Issues (Service)

    Container killed: Out of memory
    

  18. Increase Docker container memory limits
  19. Reduce concurrent processing workers
  20. Process smaller batches

Dependencies

Client (Django app):

  • requests: HTTP client for REST API calls
  • Django storage system for document access

Service (Microservice container):

  • docling: Core document processing
  • pytesseract: OCR support
  • pdf2image: PDF rendering
  • shapely: Spatial operations
  • numpy: Numerical operations
  • fastapi: REST API framework

See Also