Data Extraction Tutorial¶
This tutorial walks through extracting structured data from documents using OpenContracts' extraction system.
Prerequisites¶
- OpenContracts instance running with Celery workers
- Documents uploaded to a corpus
- Admin or appropriate permissions for creating fieldsets
Step 1: Create a Fieldset¶
A fieldset defines what data you want to extract from documents.
Via Django Admin¶
- Navigate to Admin → Extracts → Fieldsets
- Click "Add Fieldset"
- Enter:
- Name: "Contract Terms"
- Description: "Extract key terms from contracts"
- Save
Via GraphQL¶
mutation CreateContractFieldset {
createFieldset(
name: "Contract Terms"
description: "Extract key terms from contracts"
) {
ok
objId
message
}
}
Step 2: Define Columns¶
Columns specify individual fields to extract.
Example: Extract Multiple Contract Fields¶
Let's create columns for common contract data points:
Party Names Column¶
mutation CreatePartiesColumn {
createColumn(
fieldsetId: "your-fieldset-id"
name: "Parties"
query: "Who are the contracting parties in this agreement?"
outputType: "list[str]"
mustContainText: "PARTIES"
) {
ok
objId
}
}
Effective Date Column¶
mutation CreateDateColumn {
createColumn(
fieldsetId: "your-fieldset-id"
name: "Effective Date"
query: "What is the effective date of this contract?"
outputType: "str"
instructions: "Format as YYYY-MM-DD"
limitToLabel: "date-clause"
) {
ok
objId
}
}
Payment Terms Column¶
mutation CreatePaymentColumn {
createColumn(
fieldsetId: "your-fieldset-id"
name: "Payment Terms"
query: "What are the payment terms and amounts?"
outputType: "str"
mustContainText: "PAYMENT"
) {
ok
objId
}
}
Termination Conditions Column¶
mutation CreateTerminationColumn {
createColumn(
fieldsetId: "your-fieldset-id"
name: "Termination Conditions"
query: "Under what conditions can this contract be terminated?"
outputType: "list[str]"
extractIsList: true
) {
ok
objId
}
}
Step 3: Create an Extract¶
An extract links your fieldset to specific documents.
Via Django Admin¶
- Navigate to Admin → Extracts → Extracts
- Click "Add Extract"
- Configure:
- Name: "Q4 Contract Analysis"
- Fieldset: Select "Contract Terms"
- Corpus: Select your corpus
- Documents: Select documents to process
- Save
Via GraphQL¶
mutation CreateExtract {
createExtract(
name: "Q4 Contract Analysis"
fieldsetId: "your-fieldset-id"
corpusId: "your-corpus-id"
documentIds: ["doc1", "doc2", "doc3"]
) {
ok
objId
}
}
Step 4: Run the Extraction¶
Start the extraction process to populate datacells.
Via GraphQL¶
mutation RunExtraction {
startExtract(extractId: "your-extract-id") {
ok
message
}
}
Via Python Script¶
from opencontractserver.tasks.extract_orchestrator_tasks import run_extract
from opencontractserver.extracts.models import Extract
# Get the extract
extract = Extract.objects.get(id="your-extract-id")
# Start extraction
run_extract.delay(extract.id, user.id)
Step 5: Monitor Progress¶
Check Extract Status¶
query ExtractStatus {
extract(id: "your-extract-id") {
name
started
finished
error
datacells {
edges {
node {
id
column {
name
}
document {
title
}
started
completed
failed
}
}
}
}
}
Monitor via Django Admin¶
- Navigate to Admin → Extracts → Extracts
- Click on your extract
- View the status fields and related datacells
Step 6: Access Results¶
Once extraction completes, retrieve the structured data.
Query All Results¶
query GetExtractResults {
extract(id: "your-extract-id") {
datacells {
edges {
node {
document {
title
}
column {
name
}
data
completed
}
}
}
}
}
Export to CSV¶
import csv
from opencontractserver.extracts.models import Extract
extract = Extract.objects.get(id="your-extract-id")
with open('extract_results.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
# Write header
columns = extract.fieldset.columns.all()
header = ['Document'] + [col.name for col in columns]
writer.writerow(header)
# Write data rows
for document in extract.documents.all():
row = [document.title]
for column in columns:
datacell = extract.extracted_datacells.filter(
document=document,
column=column
).first()
row.append(datacell.data if datacell else '')
writer.writerow(row)
Advanced Examples¶
Using Custom Output Types¶
Extract structured data with specific types:
# Boolean extraction
column = Column.objects.create(
fieldset=fieldset,
name="Is Confidential",
query="Does this contract contain confidentiality clauses?",
output_type="bool"
)
# Integer extraction
column = Column.objects.create(
fieldset=fieldset,
name="Contract Value",
query="What is the total contract value in dollars?",
output_type="int",
instructions="Extract numeric value only, no currency symbols"
)
# List extraction
column = Column.objects.create(
fieldset=fieldset,
name="Deliverables",
query="List all deliverables mentioned in the contract",
output_type="str",
extract_is_list=True
)
Filtering with Constraints¶
Use constraints to improve extraction accuracy:
# Only extract from specific sections
column = Column.objects.create(
fieldset=fieldset,
name="Warranty Period",
query="How long is the warranty period?",
output_type="str",
must_contain_text="WARRANTY",
limit_to_label="warranty-clause"
)
# Multiple constraints
column = Column.objects.create(
fieldset=fieldset,
name="Arbitration Location",
query="Where will arbitration take place?",
output_type="str",
must_contain_text="ARBITRATION",
instructions="Extract city and state/country",
limit_to_label="dispute-resolution"
)
Batch Processing¶
Process multiple extracts efficiently:
from celery import group
from opencontractserver.tasks.extract_orchestrator_tasks import run_extract
# Create multiple extracts
extracts = [
Extract.objects.create(
name=f"Batch {i}",
fieldset=fieldset,
corpus=corpus
) for i in range(5)
]
# Add documents to each extract
for i, extract in enumerate(extracts):
docs = documents[i*10:(i+1)*10] # 10 docs per extract
extract.documents.set(docs)
# Run all extracts in parallel
job = group(
run_extract.si(extract.id, user.id)
for extract in extracts
)
result = job.apply_async()
# Wait for completion
result.get()
Using WebSocket for Corpus Queries¶
For interactive exploration, use the WebSocket API:
// Connect to corpus WebSocket
const ws = new WebSocket(`wss://your-server/ws/corpus/${corpusId}/`);
ws.onopen = () => {
// Query about extracted data
ws.send(JSON.stringify({
query: "Summarize the payment terms across all contracts"
}));
};
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === 'ASYNC_CONTENT') {
console.log('Response:', data.delta);
} else if (data.type === 'ASYNC_SOURCES') {
console.log('Sources:', data.sources);
}
};
Troubleshooting¶
Common Issues¶
Extraction Fails Immediately¶
Check Celery workers are running:
celery -A config worker -l info -Q celery,extract,ml
No Data Extracted¶
- Verify documents contain expected text
- Check
must_contain_text
constraints aren't too restrictive - Review
limit_to_label
- ensure annotations exist with that label
Incorrect Data Types¶
Ensure output_type
is a valid Python type: - Primitives: str
, int
, float
, bool
- Lists: list[str]
, list[int]
, etc. - Or use extract_is_list=True
with base type
Slow Extraction¶
- Increase Celery worker count
- Reduce
max_token_length
if context is too large - Use more specific queries to reduce search scope
Debugging¶
Enable detailed logging:
# settings.py
LOGGING = {
'loggers': {
'opencontractserver.tasks': {
'level': 'DEBUG',
},
},
}
Check datacell errors:
failed_cells = Datacell.objects.filter(
extract=extract,
failed__isnull=False
)
for cell in failed_cells:
print(f"Column: {cell.column.name}")
print(f"Document: {cell.document.title}")
print(f"Error: {cell.stacktrace}")
Best Practices¶
- Start Small: Test with a few documents before processing entire corpus
- Iterate on Queries: Refine column queries based on initial results
- Use Constraints: Apply
must_contain_text
andlimit_to_label
for accuracy - Monitor Progress: Check extraction status regularly
- Handle Failures: Implement retry logic for failed datacells
- Validate Results: Spot-check extracted data for accuracy
- Export Regularly: Save results to avoid data loss
Next Steps¶
- Learn about Vector Store Architecture
- Explore Corpus Queries
- Review API Reference