Metadata Fields¶
Overview¶
OpenContracts provides a unified data model for both extracted data and manual metadata entry at the corpus level. This feature allows you to define structured metadata schemas for documents within a corpus, ensuring consistent data collection and validation across your document sets.
Architecture¶
The metadata system is built on the same foundation as the data extraction system, providing a DRY (Don't Repeat Yourself) approach to structured data:
Core Components¶
- Fieldset: Container for columns, with different uses for metadata vs extraction
 - Metadata Fieldsets: Linked to corpus via 
OneToOneField(corpus), automatically created per corpus - Extraction Fieldsets: Independent fieldsets for data extraction tasks, used in Extract operations
 - Columns: Define individual fields with data types, validation rules, and display configuration
 - Metadata Columns: Have 
is_manual_entry=Trueanddata_typeset - Extraction Columns: Have 
is_manual_entry=Falseand usequery,match_text,output_type - Datacells: Store the actual values for both metadata and extracted data
 - Metadata Datacells: Have 
extract=NULLand are manually created/edited - Extraction Datacells: Have 
extractset and are created by extraction tasks - Unified Validation: Consistent validation logic shared between extracted data and manual metadata
 
Key Benefits¶
- Unified Architecture: Same models handle both extracted data and manual metadata
 - Type Safety: Strong typing with comprehensive validation
 - Corpus-Level Schema: Consistent metadata structure across all documents in a corpus
 - Automatic Management: Metadata fieldsets are automatically created when first metadata column is added to a corpus
 
Fieldset Relationships: Metadata vs Extraction¶
OpenContracts uses the same Fieldset model for both metadata schemas and extraction schemas, but they serve different purposes:
Metadata Fieldsets¶
- Relationship: 
OneToOneFieldbetween Corpus and Fieldset viacorpus.metadata_schema - Creation: Automatically created when first metadata column is added to a corpus
 - Naming: Typically named "{Corpus Title} Metadata"
 - Purpose: Define manual metadata entry schema for all documents in the corpus
 - Columns: All columns have 
is_manual_entry=Trueanddata_typevalues - Lifecycle: Tied to the corpus lifecycle, deleted when corpus is deleted
 
Extraction Fieldsets¶
- Relationship: No direct corpus relationship, used via Extract objects
 - Creation: Created manually or via extraction workflows
 - Naming: User-defined or auto-generated based on extraction name
 - Purpose: Define what data to extract from documents using AI/NLP
 - Columns: All columns have 
is_manual_entry=Falseand extraction configuration (query,match_text, etc.) - Lifecycle: Independent, can be reused across multiple extractions
 
Key Differences¶
| Aspect | Metadata Fieldsets | Extraction Fieldsets | 
|---|---|---|
| Corpus Link | corpus.metadata_schema |  None (via Extract) | 
| Column Type | is_manual_entry=True |  is_manual_entry=False |  
| Data Source | Manual user input | AI/NLP extraction | 
| Schema Fields | data_type, validation_config |  query, match_text, output_type |  
| Datacell Link | extract=NULL |  extract references Extract |  
| Use Case | Document metadata | Document data extraction | 
Supported Data Types¶
The system supports 12 different data types for metadata fields:
| Data Type | Description | Example Values | Validation | 
|---|---|---|---|
STRING |  Single-line text | "John Doe", "Contract-123" | Length, regex patterns | 
TEXT |  Multi-line text | Long descriptions, notes | Length constraints | 
BOOLEAN |  True/False values | true, false | Type validation | 
INTEGER |  Whole numbers | 42, -10, 0 | Range validation | 
FLOAT |  Decimal numbers | 3.14, -0.5, 100.0 | Range validation | 
DATE |  Date values (YYYY-MM-DD) | "2024-01-15" | Date format, range | 
DATETIME |  Date and time values (ISO 8601) | "2024-01-15T10:30:00Z" | DateTime format, range | 
URL |  Web addresses | "https://example.com" | URL format validation | 
EMAIL |  Email addresses | "user@example.com" | Email format validation | 
CHOICE |  Single selection from predefined options | "Draft", "Final" | Choice validation | 
MULTI_CHOICE |  Multiple selections from predefined options | ["Legal", "Financial"] | Multi-choice validation | 
JSON |  Arbitrary JSON data | {"key": "value"} | JSON format validation | 
Column Configuration¶
Each metadata column can be configured with various constraints and options through the validation_config field:
Common Configuration Options¶
{
  "required": true,              // Whether the field is mandatory
  "help_text": "Enter the author's full name",  // Help text for users
  "default_value": "Unknown"    // Default value if not provided
}
Type-Specific Configuration¶
String/Text Fields¶
{
  "min_length": 3,              // Minimum character length
  "max_length": 100,            // Maximum character length
  "regex_pattern": "^[A-Z]{2}-\\d{4}$"  // Regex validation pattern
}
Numeric Fields (Integer/Float)¶
{
  "min_value": 0,               // Minimum allowed value
  "max_value": 100              // Maximum allowed value
}
Choice Fields¶
{
  "choices": ["Draft", "Review", "Final"]  // Available options (required)
}
Date/DateTime Fields¶
{
  "min_date": "2024-01-01",     // Earliest allowed date
  "max_date": "2024-12-31"      // Latest allowed date
}
Database Schema¶
Column Model Fields¶
class Column(models.Model):
    name = models.CharField(max_length=256)          # Field name
    data_type = models.CharField(choices=METADATA_DATA_TYPES, max_length=32, null=True, blank=True)  # Data type for metadata
    validation_config = NullableJSONField()         # Validation rules
    is_manual_entry = models.BooleanField(default=False)  # True for metadata, False for extraction
    default_value = NullableJSONField()             # Default value
    help_text = models.TextField(null=True, blank=True)  # User guidance
    display_order = models.IntegerField()           # Display ordering
    fieldset = models.ForeignKey(Fieldset)          # Parent fieldset
    # Legacy extraction fields (null for metadata columns)
    query = models.TextField(null=True, blank=True)
    match_text = models.TextField(null=True, blank=True)
    output_type = models.TextField()                # Required for all columns
    task_name = models.CharField(max_length=1024)   # Task for extraction
Datacell Model Fields¶
class Datacell(models.Model):
    data = NullableJSONField()                      # The actual value (structured as {"value": actual_value})
    document = models.ForeignKey(Document)          # Associated document
    column = models.ForeignKey(Column)              # Field definition
    extract = models.ForeignKey(Extract, null=True, blank=True) # Null for metadata, set for extracted data
    creator = models.ForeignKey(User)               # Who created it
    # Additional fields
    data_definition = models.TextField()            # Description of the data
    started = models.DateTimeField(null=True, blank=True)     # Processing timestamps
    completed = models.DateTimeField(null=True, blank=True)
    failed = models.DateTimeField(null=True, blank=True)
    approved_by = models.ForeignKey(User, null=True, blank=True, related_name="approved_cells")
    rejected_by = models.ForeignKey(User, null=True, blank=True, related_name="rejected_cells")
    corrected_data = NullableJSONField(null=True, blank=True)
Constraints¶
- Unique Metadata: One metadata value per document/column combination (enforced by 
unique_manual_metadata_per_doc_columnconstraint) - Validation on Save: Automatic validation based on column configuration through 
Datacell.clean()method - Corpus Isolation: Metadata schemas are corpus-specific through fieldset-corpus relationship
 - Extract Nullability: Metadata datacells have 
extract=NULL, extracted datacells haveextractset - Manual Entry Flag: Only columns with 
is_manual_entry=Truecan be used for manual metadata entry 
Validation System¶
The system performs comprehensive validation at multiple levels:
Model-Level Validation¶
Validation occurs in the Datacell.clean() method and includes:
- Type Checking: Ensures values match the column's data type
 - Constraint Validation: Applies min/max values, string lengths, regex patterns
 - Required Field Validation: Enforces required fields
 - Choice Validation: Ensures selected values are from allowed options
 - Format Validation: Validates URLs, emails, dates, and JSON structure
 
Validation Methods¶
def _validate_manual_entry(self):
    """Comprehensive validation for manual metadata entries."""
    # Type-specific validation
    if self.column.data_type == 'BOOLEAN':
        self._validate_boolean_value()
    elif self.column.data_type in ['INTEGER', 'FLOAT']:
        self._validate_numeric_range()
    elif self.column.data_type in ['STRING', 'TEXT']:
        self._validate_string_constraints()
    # ... and so on
Example Validation Errors¶
# String too short
ValidationError: "Product Code must be at least 3 characters"
# Invalid choice
ValidationError: "Document Type values must be from: ['Contract', 'Invoice', 'Report']"
# Invalid date format
ValidationError: "Due Date must be in YYYY-MM-DD format"
# Required field missing
ValidationError: "Author is required"
# Type mismatch
ValidationError: "Status must be a boolean value"
GraphQL API Integration¶
The metadata system exposes a complete GraphQL API for managing schemas and values. See Metadata API Examples for detailed usage examples.
Key Mutations¶
createMetadataColumn: Define new metadata fieldsupdateMetadataColumn: Modify existing metadata fieldssetMetadataValue: Set metadata values for documentsdeleteMetadataValue: Remove metadata values
Key Queries¶
corpusMetadataColumns: Get metadata schema for a corpusdocumentMetadataDatacells: Get metadata values for a documentmetadataCompletionStatusV2: Check completion status
Best Practices¶
Schema Design¶
- Plan Your Schema: Design your metadata schema before adding documents
 - Use Specific Types: Choose the most appropriate data type (e.g., EMAIL instead of STRING)
 - Provide Guidance: Add clear help_text and default values
 - Order Fields: Use display_order for logical field arrangement
 
Data Entry¶
- Validate Early: Set appropriate constraints to catch errors at entry time
 - Use Defaults: Provide sensible default values for optional fields
 - Required vs Optional: Carefully consider which fields should be required
 
Performance¶
- Bulk Operations: Use bulk operations for setting metadata on multiple documents
 - Indexing: Metadata fields are automatically indexed for efficient querying
 - Lazy Loading: Validation occurs on save, not on read
 
Migration from Legacy Systems¶
When migrating from annotation-based metadata systems:
- Extract Existing Data: Export current metadata values
 - Define New Schema: Create columns matching existing metadata fields
 - Migrate Values: Transform annotation data to datacell format
 - Cleanup: Remove old annotation-based metadata
 
Integration Points¶
Document Import/Export¶
- Metadata schemas are included in corpus export/import
 - Values are preserved during document transfer
 - Validation ensures data integrity during import
 
Permissions¶
- Metadata access inherits from corpus permissions
 - Fine-grained control over who can modify schemas vs values
 - Audit trail for all metadata changes
 
Search and Filtering¶
- Metadata values are searchable and filterable
 - Type-aware search (e.g., date range queries)
 - Integration with corpus-wide search functionality
 
Technical Notes¶
Database Constraints¶
UniqueConstraintensures one metadata value per document/columnQ(extract__isnull=True)identifies metadata vs extracted data- Foreign key relationships maintain data integrity
 
Performance Considerations¶
- Metadata queries are optimized with proper indexing
 - Bulk operations supported for large-scale metadata entry
 - Lazy validation reduces read-time overhead
 
Extensibility¶
- New data types can be added by extending 
METADATA_DATA_TYPES - Custom validation can be added to the validation chain
 - GraphQL schema automatically reflects model changes