NLM Ingest Parser¶
Overview¶
The NLM Ingest Parser is an alternative PDF document parser that leverages the NLM Ingest library for document processing. It provides robust PDF parsing capabilities with a focus on layout analysis and text extraction.
Architecture¶
```mermaid sequenceDiagram participant U as User participant NP as NLMIngestParser participant NI as NLM Ingest Library participant PB as Poppler/PDFBox participant DB as Database
U->>NP: parse_document(user_id, doc_id)
NP->>DB: Load document
NP->>NI: Process PDF
NI->>PB: Extract layout/text
PB-->>NI: Layout blocks
NI-->>NP: Parsed document
NP->>NP: Generate PAWLS tokens
NP->>DB: Store parsed data
NP-->>U: OpenContractDocExport
```
Features¶
- Layout Analysis: Extracts document structure including headers, paragraphs, and tables
- Text Extraction: Reliable text extraction from PDF documents
- PAWLS Compatibility: Generates PAWLS-format token data for annotation
- Metadata Extraction: Extracts document metadata and properties
- Page-level Processing: Processes documents page by page for memory efficiency
Configuration¶
The NLM Ingest Parser is configured through Django settings:
# Configure the parser in settings
INSTALLED_PARSERS = [
"opencontractserver.pipeline.parsers.nlm_ingest_parser.NLMIngestParser",
]
# Optional: Configure NLM Ingest settings
NLM_INGEST_CONFIG = {
"parse_method": "auto", # "auto", "pdfplumber", or "pypdf2"
"extract_tables": True,
"extract_images": False,
}
Usage¶
Basic usage:
from opencontractserver.pipeline.parsers.nlm_ingest_parser import NLMIngestParser
parser = NLMIngestParser()
result = parser.parse_document(user_id=1, doc_id=123)
With options:
result = parser.parse_document(
user_id=1,
doc_id=123,
extract_tables=True, # Extract table structures
parse_method="pdfplumber", # Specify parsing backend
)
Input¶
The parser expects: - A PDF document stored in Django's storage system - A valid user ID and document ID - Optional configuration parameters
Output¶
The parser returns an OpenContractDocExport
dictionary containing:
{
"title": str, # Document title if available
"description": str, # Generated description
"content": str, # Full text content
"page_count": int, # Number of pages
"pawls_file_content": List[dict], # PAWLS token data
"labelled_text": List[dict], # Structural annotations
"doc_labels": List[dict], # Document-level labels
}
PAWLS Token Format¶
Each page in pawls_file_content
contains:
{
"page": {
"width": 612,
"height": 792,
"index": 0
},
"tokens": [
{
"text": "Example",
"bbox": {
"x": 100,
"y": 100,
"width": 50,
"height": 12
}
}
]
}
Processing Steps¶
- Document Loading
- Loads PDF from Django storage
-
Creates temporary file for processing
-
NLM Ingest Processing
- Parses PDF using NLM Ingest library
- Extracts text blocks and layout information
-
Identifies document structure
-
Token Generation
- Converts text blocks to PAWLS tokens
- Calculates bounding boxes
-
Preserves layout information
-
Annotation Creation
- Creates structural annotations
- Labels sections, headers, paragraphs
-
Preserves reading order
-
Cleanup
- Removes temporary files
- Returns parsed data
Implementation Details¶
The parser extends the BaseParser
class:
class NLMIngestParser(BaseParser):
title = "NLM Ingest Parser"
description = "Parses PDF documents using NLM Ingest library"
supported_file_types = [FileTypeEnum.PDF]
def _parse_document_impl(
self, user_id: int, doc_id: int, **kwargs
) -> Optional[OpenContractDocExport]:
# Implementation using NLM Ingest
pass
Performance Considerations¶
- Memory Usage: Processes pages sequentially to minimize memory
- Processing Time: Typically 2-5 seconds per page
- File Size: Can handle large PDF files efficiently
- Concurrent Processing: Thread-safe for parallel processing
Comparison with Docling Parser¶
Feature | NLM Ingest | Docling |
---|---|---|
Speed | Faster | Slower |
Accuracy | Good | Excellent |
OCR Support | Limited | Full |
Table Extraction | Good | Excellent |
Memory Usage | Lower | Higher |
Dependencies | Simpler | Complex |
Best Practices¶
- Parser Selection
- Use NLM Ingest for standard PDFs without OCR needs
-
Use Docling for complex layouts or scanned documents
-
Configuration
- Start with default settings
-
Enable table extraction only when needed
-
Error Handling
- Always check return values
-
Monitor logs for parsing errors
-
Performance
- Process large batches asynchronously
- Monitor memory usage
Troubleshooting¶
Common Issues¶
- Import Errors
ImportError: Cannot import nlm_ingestor
- Install NLM Ingest:
pip install nlm-ingestor
-
Check Python version compatibility
-
Memory Issues
MemoryError during parsing
- Reduce batch size
- Increase available memory
-
Use page-by-page processing
-
Layout Detection Failures
Warning: Could not detect layout
- Try different parse_method settings
- Check PDF structure/format
-
Consider using Docling parser
-
Text Extraction Issues
Error: No text extracted
- Check if PDF is scanned (needs OCR)
- Verify PDF is not corrupted
- Try force text extraction mode
Dependencies¶
Required Python packages: - nlm-ingestor
: Core parsing library - pdfplumber
: PDF processing backend - pypdf2
: Alternative PDF backend - pillow
: Image processing support
Limitations¶
- Limited OCR support (use Docling for OCR)
- May struggle with complex layouts
- Table extraction less sophisticated than Docling
- No support for non-PDF formats