Skip to content

NLM Ingest Parser

Overview

The NLM Ingest Parser is an alternative PDF document parser that leverages the NLM Ingest library for document processing. It provides robust PDF parsing capabilities with a focus on layout analysis and text extraction.

Architecture

```mermaid sequenceDiagram participant U as User participant NP as NLMIngestParser participant NI as NLM Ingest Library participant PB as Poppler/PDFBox participant DB as Database

U->>NP: parse_document(user_id, doc_id)
NP->>DB: Load document
NP->>NI: Process PDF
NI->>PB: Extract layout/text
PB-->>NI: Layout blocks
NI-->>NP: Parsed document
NP->>NP: Generate PAWLS tokens
NP->>DB: Store parsed data
NP-->>U: OpenContractDocExport

```

Features

  • Layout Analysis: Extracts document structure including headers, paragraphs, and tables
  • Text Extraction: Reliable text extraction from PDF documents
  • PAWLS Compatibility: Generates PAWLS-format token data for annotation
  • Metadata Extraction: Extracts document metadata and properties
  • Page-level Processing: Processes documents page by page for memory efficiency

Configuration

The NLM Ingest Parser is configured through Django settings:

# Configure the parser in settings
INSTALLED_PARSERS = [
    "opencontractserver.pipeline.parsers.nlm_ingest_parser.NLMIngestParser",
]

# Optional: Configure NLM Ingest settings
NLM_INGEST_CONFIG = {
    "parse_method": "auto",  # "auto", "pdfplumber", or "pypdf2"
    "extract_tables": True,
    "extract_images": False,
}

Usage

Basic usage:

from opencontractserver.pipeline.parsers.nlm_ingest_parser import NLMIngestParser

parser = NLMIngestParser()
result = parser.parse_document(user_id=1, doc_id=123)

With options:

result = parser.parse_document(
    user_id=1,
    doc_id=123,
    extract_tables=True,  # Extract table structures
    parse_method="pdfplumber",  # Specify parsing backend
)

Input

The parser expects: - A PDF document stored in Django's storage system - A valid user ID and document ID - Optional configuration parameters

Output

The parser returns an OpenContractDocExport dictionary containing:

{
    "title": str,  # Document title if available
    "description": str,  # Generated description
    "content": str,  # Full text content
    "page_count": int,  # Number of pages
    "pawls_file_content": List[dict],  # PAWLS token data
    "labelled_text": List[dict],  # Structural annotations
    "doc_labels": List[dict],  # Document-level labels
}

PAWLS Token Format

Each page in pawls_file_content contains:

{
  "page": {
    "width": 612,
    "height": 792,
    "index": 0
  },
  "tokens": [
    {
      "text": "Example",
      "bbox": {
        "x": 100,
        "y": 100,
        "width": 50,
        "height": 12
      }
    }
  ]
}

Processing Steps

  1. Document Loading
  2. Loads PDF from Django storage
  3. Creates temporary file for processing

  4. NLM Ingest Processing

  5. Parses PDF using NLM Ingest library
  6. Extracts text blocks and layout information
  7. Identifies document structure

  8. Token Generation

  9. Converts text blocks to PAWLS tokens
  10. Calculates bounding boxes
  11. Preserves layout information

  12. Annotation Creation

  13. Creates structural annotations
  14. Labels sections, headers, paragraphs
  15. Preserves reading order

  16. Cleanup

  17. Removes temporary files
  18. Returns parsed data

Implementation Details

The parser extends the BaseParser class:

class NLMIngestParser(BaseParser):
    title = "NLM Ingest Parser"
    description = "Parses PDF documents using NLM Ingest library"
    supported_file_types = [FileTypeEnum.PDF]

    def _parse_document_impl(
        self, user_id: int, doc_id: int, **kwargs
    ) -> Optional[OpenContractDocExport]:
        # Implementation using NLM Ingest
        pass

Performance Considerations

  • Memory Usage: Processes pages sequentially to minimize memory
  • Processing Time: Typically 2-5 seconds per page
  • File Size: Can handle large PDF files efficiently
  • Concurrent Processing: Thread-safe for parallel processing

Comparison with Docling Parser

Feature NLM Ingest Docling
Speed Faster Slower
Accuracy Good Excellent
OCR Support Limited Full
Table Extraction Good Excellent
Memory Usage Lower Higher
Dependencies Simpler Complex

Best Practices

  1. Parser Selection
  2. Use NLM Ingest for standard PDFs without OCR needs
  3. Use Docling for complex layouts or scanned documents

  4. Configuration

  5. Start with default settings
  6. Enable table extraction only when needed

  7. Error Handling

  8. Always check return values
  9. Monitor logs for parsing errors

  10. Performance

  11. Process large batches asynchronously
  12. Monitor memory usage

Troubleshooting

Common Issues

  1. Import Errors
    ImportError: Cannot import nlm_ingestor
    
  2. Install NLM Ingest: pip install nlm-ingestor
  3. Check Python version compatibility

  4. Memory Issues

    MemoryError during parsing
    

  5. Reduce batch size
  6. Increase available memory
  7. Use page-by-page processing

  8. Layout Detection Failures

    Warning: Could not detect layout
    

  9. Try different parse_method settings
  10. Check PDF structure/format
  11. Consider using Docling parser

  12. Text Extraction Issues

    Error: No text extracted
    

  13. Check if PDF is scanned (needs OCR)
  14. Verify PDF is not corrupted
  15. Try force text extraction mode

Dependencies

Required Python packages: - nlm-ingestor: Core parsing library - pdfplumber: PDF processing backend - pypdf2: Alternative PDF backend - pillow: Image processing support

Limitations

  • Limited OCR support (use Docling for OCR)
  • May struggle with complex layouts
  • Table extraction less sophisticated than Docling
  • No support for non-PDF formats

See Also