Comprehensive Guide to Document Analyzers¶
Overview¶
OpenContracts supports document analyzers that run as Celery tasks within the main application. These analyzers can automatically process documents and create annotations.
1. Database Structure¶
The Analyzer
model has these key fields: - id
: CharField (primary key, max length 1024) - manifest
: JSON field for analyzer configuration - description
: Text field - disabled
: Boolean flag - is_public
: Boolean for visibility - icon
: File field - task_name
: CharField pointing to the Celery task (required)
2. How Analyzers Work¶
Task-based Analyzers¶
- Run within the main application as Celery tasks
- Defined by
task_name
field pointing to a Python callable - Integrated with the main application environment
- Have access to all application models and utilities
- Full documentation on implementation available in register-doc-analyzer.md
3. Analysis Process¶
Analysis Flow:¶
- System creates Analysis record
- Celery task is dispatched using the
task_name
- Analysis runs in-process within the application
- Results stored directly in the database
- Annotations and metadata are created
4. Permissions & Security¶
Granular permissions available: - permission_analyzer - publish_analyzer - create_analyzer - read_analyzer - update_analyzer - remove_analyzer
Each analysis tracks: - Creator - Public/private status - Document access permissions
5. Implementation Requirements¶
Task-based Analyzer Requirements:¶
- Valid Python import path in
task_name
- Task must exist in codebase
- Must use
@doc_analyzer_task
decorator - Must return valid analysis results
- See register-doc-analyzer.md for detailed implementation guide
6. Analyzer Registration¶
Database Creation¶
analyzer = Analyzer.objects.create(
id="task.analyzer.unique.id", # Required unique identifier
description="Document Analyzer Description",
task_name="opencontractserver.tasks.module.task_name", # Python import path
creator=user,
manifest={}, # Optional configuration
is_public=True, # Optional visibility setting
)
Implementation Requirements¶
- Must be decorated with
@doc_analyzer_task()
- Must accept parameters:
doc_id: str # Document ID to analyze analysis_id: str # Analysis record ID corpus_id: str # Optional corpus ID
- Must return a tuple of four elements:
( doc_annotations: List[str], # Document-level labels span_label_pairs: List[Tuple[TextSpan, str]], # Text annotations with labels metadata: List[Dict[str, Any]], # Must include 'data' key task_pass: bool # Success indicator )
Example Implementation¶
@doc_analyzer_task()
def my_analyzer_task(doc_id, analysis_id, corpus_id=None, **kwargs):
# Task implementation
return [], [], [{"data": results}], True
Validation Rules¶
- Task name must be unique
- Task must exist at specified path
- Must use
@doc_analyzer_task
decorator - Return values must match schema
Execution Flow¶
- Analysis created referencing task-based analyzer
- System loads task by name
- Task executed through Celery
- Results processed and stored
- Analysis completion marked
Available Features¶
- Access to document content (PDF, text extracts, PAWLS tokens)
- Annotation and label creation
- Corpus-wide analysis integration
- Automatic result storage
- Error handling and retries