Open Contracts¶

The Free and Open Source Document Analytics Platform¶


CI/CD
Meta

What Does it Do?¶

OpenContracts is an Apache-2 Licensed enterprise document analytics tool. It provides several key features:

Manage Documents - Manage document collections (Corpuses)
Custom Metadata Schemas - Define structured metadata fields with validation for consistent data collection
Layout Parser - Automatically extracts layout features from PDFs
Automatic Vector Embeddings - generated for uploaded PDFs and extracted layout blocks
Pluggable microservice analyzer architecture - to let you analyze documents and automatically annotate them
Human Annotation Interface - to manually annotated documents, including multi-page annotations.
Data Extract - ask multiple questions across hundreds of documents using complex LLM-powered querying behavior. Our sample implementation uses our battle-tested agent framework for precise data extraction and natural language querying.
Custom Data Extract - Custom data extract pipelines can be used on the frontend to query documents in bulk.

Key Docs¶

Quickstart Guide - You'll probably want to get started quickly. Setting up locally should be pretty painless if you're already running Docker.
Basic Walkthrough - Check out the walkthrough to step through basic usage of the application for document and annotation management.
Metadata System - Learn how to define custom metadata schemas for your documents with comprehensive validation and type safety.
PDF Annotation Data Format Overview - You may be interested how we map text to PDFs visually and the underlying data format we're using.
Vector Store Architecture We've used the latest open source tooling for vector storage in postgres to make it almost trivially easy to combine structured metadata and vector embeddings with an API-powered application.
Write Custom Data Extractors - Custom data extract tasks are automatically loaded and displayed on the frontend to let users select how to ask questions and extract data from documents.

Architecture and Data Flows at a Glance¶

Core Data Standard¶

The core idea here - besides providing a platform to analyze contracts - is an open and standardized architecture that makes data extremely portable. Powering this is a set of data standards to describe the text and layout blocks on a PDF page:

Robust PDF Processing Pipeline¶

We have a robust PDF processing pipeline that is horizontally scalable and generates our standardized data consistently for PDF inputs (We're working on adding additional formats soon):

Special thanks to Nlmatics and nlm-ingestor for powering the layout parsing and extraction.

Limitations¶

At the moment, it only works with PDFs. In the future, it will be able to convert other document types to PDF for storage and labeling. PDF is an excellent format for this as it introduces a consistent, repeatable format which we can use to generate a text and x-y coordinate layer from scratch.

Adding OCR and ingestion for other enterprise documents is a priority.

Acknowledgements¶

Special thanks to AllenAI's PAWLS project and Nlmatics nlm-ingestor. They've pioneered a number of features and flows, and we are using their code in some parts of the application.