Metadata Fields¶
Overview¶
OpenContracts provides a unified data model for both extracted data and manual metadata entry at the corpus level. This feature allows you to define structured metadata schemas for documents within a corpus, ensuring consistent data collection and validation across your document sets.
Architecture¶
The metadata system is built on the same foundation as the data extraction system, providing a DRY (Don't Repeat Yourself) approach to structured data:
Core Components¶
- Fieldset: Container for columns, with different uses for metadata vs extraction
- Metadata Fieldsets: Linked to corpus via
OneToOneField(corpus)
, automatically created per corpus - Extraction Fieldsets: Independent fieldsets for data extraction tasks, used in Extract operations
- Columns: Define individual fields with data types, validation rules, and display configuration
- Metadata Columns: Have
is_manual_entry=True
anddata_type
set - Extraction Columns: Have
is_manual_entry=False
and usequery
,match_text
,output_type
- Datacells: Store the actual values for both metadata and extracted data
- Metadata Datacells: Have
extract=NULL
and are manually created/edited - Extraction Datacells: Have
extract
set and are created by extraction tasks - Unified Validation: Consistent validation logic shared between extracted data and manual metadata
Key Benefits¶
- Unified Architecture: Same models handle both extracted data and manual metadata
- Type Safety: Strong typing with comprehensive validation
- Corpus-Level Schema: Consistent metadata structure across all documents in a corpus
- Automatic Management: Metadata fieldsets are automatically created when first metadata column is added to a corpus
Fieldset Relationships: Metadata vs Extraction¶
OpenContracts uses the same Fieldset
model for both metadata schemas and extraction schemas, but they serve different purposes:
Metadata Fieldsets¶
- Relationship:
OneToOneField
between Corpus and Fieldset viacorpus.metadata_schema
- Creation: Automatically created when first metadata column is added to a corpus
- Naming: Typically named "{Corpus Title} Metadata"
- Purpose: Define manual metadata entry schema for all documents in the corpus
- Columns: All columns have
is_manual_entry=True
anddata_type
values - Lifecycle: Tied to the corpus lifecycle, deleted when corpus is deleted
Extraction Fieldsets¶
- Relationship: No direct corpus relationship, used via Extract objects
- Creation: Created manually or via extraction workflows
- Naming: User-defined or auto-generated based on extraction name
- Purpose: Define what data to extract from documents using AI/NLP
- Columns: All columns have
is_manual_entry=False
and extraction configuration (query
,match_text
, etc.) - Lifecycle: Independent, can be reused across multiple extractions
Key Differences¶
Aspect | Metadata Fieldsets | Extraction Fieldsets |
---|---|---|
Corpus Link | corpus.metadata_schema | None (via Extract) |
Column Type | is_manual_entry=True | is_manual_entry=False |
Data Source | Manual user input | AI/NLP extraction |
Schema Fields | data_type , validation_config | query , match_text , output_type |
Datacell Link | extract=NULL | extract references Extract |
Use Case | Document metadata | Document data extraction |
Supported Data Types¶
The system supports 12 different data types for metadata fields:
Data Type | Description | Example Values | Validation |
---|---|---|---|
STRING | Single-line text | "John Doe", "Contract-123" | Length, regex patterns |
TEXT | Multi-line text | Long descriptions, notes | Length constraints |
BOOLEAN | True/False values | true, false | Type validation |
INTEGER | Whole numbers | 42, -10, 0 | Range validation |
FLOAT | Decimal numbers | 3.14, -0.5, 100.0 | Range validation |
DATE | Date values (YYYY-MM-DD) | "2024-01-15" | Date format, range |
DATETIME | Date and time values (ISO 8601) | "2024-01-15T10:30:00Z" | DateTime format, range |
URL | Web addresses | "https://example.com" | URL format validation |
EMAIL | Email addresses | "user@example.com" | Email format validation |
CHOICE | Single selection from predefined options | "Draft", "Final" | Choice validation |
MULTI_CHOICE | Multiple selections from predefined options | ["Legal", "Financial"] | Multi-choice validation |
JSON | Arbitrary JSON data | {"key": "value"} | JSON format validation |
Column Configuration¶
Each metadata column can be configured with various constraints and options through the validation_config
field:
Common Configuration Options¶
{
"required": true, // Whether the field is mandatory
"help_text": "Enter the author's full name", // Help text for users
"default_value": "Unknown" // Default value if not provided
}
Type-Specific Configuration¶
String/Text Fields¶
{
"min_length": 3, // Minimum character length
"max_length": 100, // Maximum character length
"regex_pattern": "^[A-Z]{2}-\\d{4}$" // Regex validation pattern
}
Numeric Fields (Integer/Float)¶
{
"min_value": 0, // Minimum allowed value
"max_value": 100 // Maximum allowed value
}
Choice Fields¶
{
"choices": ["Draft", "Review", "Final"] // Available options (required)
}
Date/DateTime Fields¶
{
"min_date": "2024-01-01", // Earliest allowed date
"max_date": "2024-12-31" // Latest allowed date
}
Database Schema¶
Column Model Fields¶
class Column(models.Model):
name = models.CharField(max_length=256) # Field name
data_type = models.CharField(choices=METADATA_DATA_TYPES, max_length=32, null=True, blank=True) # Data type for metadata
validation_config = NullableJSONField() # Validation rules
is_manual_entry = models.BooleanField(default=False) # True for metadata, False for extraction
default_value = NullableJSONField() # Default value
help_text = models.TextField(null=True, blank=True) # User guidance
display_order = models.IntegerField() # Display ordering
fieldset = models.ForeignKey(Fieldset) # Parent fieldset
# Legacy extraction fields (null for metadata columns)
query = models.TextField(null=True, blank=True)
match_text = models.TextField(null=True, blank=True)
output_type = models.TextField() # Required for all columns
task_name = models.CharField(max_length=1024) # Task for extraction
Datacell Model Fields¶
class Datacell(models.Model):
data = NullableJSONField() # The actual value (structured as {"value": actual_value})
document = models.ForeignKey(Document) # Associated document
column = models.ForeignKey(Column) # Field definition
extract = models.ForeignKey(Extract, null=True, blank=True) # Null for metadata, set for extracted data
creator = models.ForeignKey(User) # Who created it
# Additional fields
data_definition = models.TextField() # Description of the data
started = models.DateTimeField(null=True, blank=True) # Processing timestamps
completed = models.DateTimeField(null=True, blank=True)
failed = models.DateTimeField(null=True, blank=True)
approved_by = models.ForeignKey(User, null=True, blank=True, related_name="approved_cells")
rejected_by = models.ForeignKey(User, null=True, blank=True, related_name="rejected_cells")
corrected_data = NullableJSONField(null=True, blank=True)
Constraints¶
- Unique Metadata: One metadata value per document/column combination (enforced by
unique_manual_metadata_per_doc_column
constraint) - Validation on Save: Automatic validation based on column configuration through
Datacell.clean()
method - Corpus Isolation: Metadata schemas are corpus-specific through fieldset-corpus relationship
- Extract Nullability: Metadata datacells have
extract=NULL
, extracted datacells haveextract
set - Manual Entry Flag: Only columns with
is_manual_entry=True
can be used for manual metadata entry
Validation System¶
The system performs comprehensive validation at multiple levels:
Model-Level Validation¶
Validation occurs in the Datacell.clean()
method and includes:
- Type Checking: Ensures values match the column's data type
- Constraint Validation: Applies min/max values, string lengths, regex patterns
- Required Field Validation: Enforces required fields
- Choice Validation: Ensures selected values are from allowed options
- Format Validation: Validates URLs, emails, dates, and JSON structure
Validation Methods¶
def _validate_manual_entry(self):
"""Comprehensive validation for manual metadata entries."""
# Type-specific validation
if self.column.data_type == 'BOOLEAN':
self._validate_boolean_value()
elif self.column.data_type in ['INTEGER', 'FLOAT']:
self._validate_numeric_range()
elif self.column.data_type in ['STRING', 'TEXT']:
self._validate_string_constraints()
# ... and so on
Example Validation Errors¶
# String too short
ValidationError: "Product Code must be at least 3 characters"
# Invalid choice
ValidationError: "Document Type values must be from: ['Contract', 'Invoice', 'Report']"
# Invalid date format
ValidationError: "Due Date must be in YYYY-MM-DD format"
# Required field missing
ValidationError: "Author is required"
# Type mismatch
ValidationError: "Status must be a boolean value"
GraphQL API Integration¶
The metadata system exposes a complete GraphQL API for managing schemas and values. See Metadata API Examples for detailed usage examples.
Key Mutations¶
createMetadataColumn
: Define new metadata fieldsupdateMetadataColumn
: Modify existing metadata fieldssetMetadataValue
: Set metadata values for documentsdeleteMetadataValue
: Remove metadata values
Key Queries¶
corpusMetadataColumns
: Get metadata schema for a corpusdocumentMetadataDatacells
: Get metadata values for a documentmetadataCompletionStatusV2
: Check completion status
Best Practices¶
Schema Design¶
- Plan Your Schema: Design your metadata schema before adding documents
- Use Specific Types: Choose the most appropriate data type (e.g., EMAIL instead of STRING)
- Provide Guidance: Add clear help_text and default values
- Order Fields: Use display_order for logical field arrangement
Data Entry¶
- Validate Early: Set appropriate constraints to catch errors at entry time
- Use Defaults: Provide sensible default values for optional fields
- Required vs Optional: Carefully consider which fields should be required
Performance¶
- Bulk Operations: Use bulk operations for setting metadata on multiple documents
- Indexing: Metadata fields are automatically indexed for efficient querying
- Lazy Loading: Validation occurs on save, not on read
Migration from Legacy Systems¶
When migrating from annotation-based metadata systems:
- Extract Existing Data: Export current metadata values
- Define New Schema: Create columns matching existing metadata fields
- Migrate Values: Transform annotation data to datacell format
- Cleanup: Remove old annotation-based metadata
Integration Points¶
Document Import/Export¶
- Metadata schemas are included in corpus export/import
- Values are preserved during document transfer
- Validation ensures data integrity during import
Permissions¶
- Metadata access inherits from corpus permissions
- Fine-grained control over who can modify schemas vs values
- Audit trail for all metadata changes
Search and Filtering¶
- Metadata values are searchable and filterable
- Type-aware search (e.g., date range queries)
- Integration with corpus-wide search functionality
Technical Notes¶
Database Constraints¶
UniqueConstraint
ensures one metadata value per document/columnQ(extract__isnull=True)
identifies metadata vs extracted data- Foreign key relationships maintain data integrity
Performance Considerations¶
- Metadata queries are optimized with proper indexing
- Bulk operations supported for large-scale metadata entry
- Lazy validation reduces read-time overhead
Extensibility¶
- New data types can be added by extending
METADATA_DATA_TYPES
- Custom validation can be added to the validation chain
- GraphQL schema automatically reflects model changes