automation / ai_ml

Intelligent Document Processor

AUTOMATION ADVANCED Python · GPT-4 · OCR

The Problem

A commercial real estate firm was processing over 800 lease agreements, invoices, and vendor contracts every single month. The operation ran entirely on human effort — four staff members spent their days opening PDFs, hunting for critical fields, and copy-pasting values into spreadsheets. At an average of 35 minutes per document, the team was buried.

The error rate sat at 8%. Dates were transposed. Amounts were misread from scanned pages. Renewal clauses were missed. When a $2M lease renewal slips through unnoticed because a data entry clerk misread a field on a low-resolution scan, the cost far exceeds any technology investment.

The backlog was constant. New documents piled up faster than the team could process them. Critical data — rent escalation dates, payment terms, break clauses, counterparty names — was buried inside scanned PDFs and emailed attachments that no system could read automatically. The company needed a way to extract structured data from unstructured documents at scale, without sacrificing accuracy.

The Solution

We built a multi-stage AI document processing pipeline that handles the full lifecycle: from raw document ingestion through structured data extraction, confidence scoring, human review, and automatic export to their ERP system.

The pipeline accepts any format the firm receives — native PDFs, scanned images, TIFF files, Word documents, and email attachments. From there, documents are automatically classified by type, routed to the appropriate extraction engine, and processed in minutes rather than half an hour.

The five stages of the pipeline work in sequence:

  • Ingest: Accept any format via web upload, REST API, or monitored email inbox — documents land automatically without manual routing
  • OCR: AWS Textract handles structured, machine-readable PDFs with high layout fidelity; Tesseract handles scanned and photographed documents with preprocessing to improve quality
  • GPT-4 reasoning: The extracted text is passed to GPT-4 with a field schema and few-shot examples — handling ambiguous clauses, relative date references, and non-standard document layouts
  • Confidence scoring: Every extracted field gets a confidence score; documents below the threshold are flagged for human review rather than auto-approved
  • Export: High-confidence results flow automatically to the ERP; flagged documents enter a lightweight React review UI where a human confirms or corrects in seconds

Technical Architecture

// document_processor — system architecture
[Document Upload — Web UI / API / Email ingest]
        ↓
[FastAPI Ingestion Service]
        ↓
[Document Router — detect type: invoice / lease / contract]
        ↓                            ↓
[AWS Textract               [Tesseract OCR
 — Structured PDFs]          — Scanned images]
        ↓
[Python NLP + GPT-4 — Field extraction & reasoning]
        ↓
[Confidence Scorer — flag < 95% for human review]
        ↓                            ↓
[Auto-export                  [Human Review UI
 to ERP/CRM]                   — React Dashboard]
        ↓
[PostgreSQL — Document store + audit trail]

Celery with Redis handles background job queuing, which is what enables batch processing — the firm can drop 500 documents into the system at 2am and have results ready before the team arrives at 6am. FastAPI exposes both the ingestion endpoints and the internal APIs consumed by the React review dashboard.

The AI Reasoning Layer

OCR alone is not enough for documents with complex language. Raw text extraction can tell you what words appear on the page — it cannot tell you what they mean in context. This is where GPT-4 earns its place in the pipeline.

The reasoning layer handles cases that rule-based extraction systems routinely fail on:

  • Date formats across jurisdictions: "the 15th day of March, 2024", "15/03/24", "March 15th", and "15-Mar-24" all represent the same date — the model normalises them to ISO 8601 regardless of format
  • Relative date references: "60 days from the date of execution" requires knowing the execution date found elsewhere in the same document and computing the resulting calendar date
  • Conditional clauses: "unless extended by mutual agreement of both parties, the initial term shall expire…" — the model extracts the base date and flags the conditional as a structured attribute
  • Multi-party contracts: Non-standard layouts where the landlord, tenant, guarantor, and managing agent appear in different orders and formats across different document templates
  • Ambiguous financial figures: When a document contains multiple dollar amounts — base rent, additional rent, security deposit, operating cost estimates — the model assigns each figure to the correct field based on surrounding context

The prompt includes the full OCR text, the target extraction schema (field names, types, and descriptions), and three to five few-shot examples drawn from the client's own historical documents. GPT-4 returns a structured JSON object. Any field it cannot extract with sufficient certainty is returned as null with a reason string, which feeds directly into the confidence scoring system.

Human-in-the-Loop Design

Full automation is the wrong goal. The right goal is eliminating unnecessary human effort while keeping humans in control of decisions that matter. The confidence scoring system achieves this by routing documents to the appropriate level of review — not forcing everything through a single queue.

Three tiers govern how every extracted document is handled:

  • Above 95% confidence: All fields auto-approved and exported to the ERP immediately, with no human touchpoint required. These documents are logged and available for audit but need no active review.
  • 80–95% confidence: Flagged for quick review. A reviewer sees the document side-by-side with the extracted fields, with low-confidence fields highlighted. A single click confirms a correct extraction; one keystroke corrects a wrong one. Average review time: under 90 seconds.
  • Below 80% confidence: Sent to full manual review. The reviewer sees the raw document alongside the extraction attempt and corrects all fields. These cases are automatically added to the training dataset to improve future extractions.

In practice, only 12% of documents require any human review at all — compared to 100% before the system was built. Of that 12%, the majority fall into the quick-review tier and are resolved in under two minutes. Full manual review accounts for fewer than 3% of total volume.

Key Features

  • Universal format support: Native PDF, scanned PDF, DOCX, PNG, TIFF, JPEG, and email attachments — any document the firm receives is handled without manual preprocessing
  • 40+ field types extracted: Parties (landlord, tenant, guarantor), execution dates, commencement dates, expiry dates, renewal options, base rent, escalation clauses, payment terms, break clauses, permitted use, and more
  • Batch processing: Upload 500 documents at 2am via the web UI or API; Celery distributes the work across worker nodes and results are ready by 6am
  • Full audit trail: Every extraction is logged with a timestamp, the confidence score for each field, the OCR engine used, and — where applicable — which reviewer confirmed or corrected the result
  • Exception dashboard: A real-time view of all documents currently awaiting human review, sorted by priority and age, so nothing sits in a queue unnoticed
  • ERP & CRM integration: Pre-built connectors for Salesforce and QuickBooks, plus a generic REST API for custom ERP systems — extracted data flows out automatically without manual export steps

Outcomes

Six months after the system went live:

90%
reduction in processing time — from 35 minutes to 3 minutes average per document
50k+
documents processed in the first six months of production operation
10×
improvement in error rate — from 8% down to 0.8% extraction errors
4 mo
payback period — implementation cost recovered entirely through labour savings

The four staff members previously dedicated to data entry have been redeployed to contract analysis and lease abstraction work — tasks that actually require human judgment and generate more value for the business. The backlog is gone. Documents are processed the same day they arrive. And the 0.8% error rate is now primarily caught by the confidence scoring system before it reaches the ERP, not discovered weeks later during a reconciliation.

Timeline & Investment

Timeline: 10 weeks from kickoff to production deployment

Investment: $45,000 – $75,000 depending on document complexity, number of field types, and ERP integration requirements

What's included: document audit and extraction schema design, OCR pipeline build, GPT-4 integration and prompt engineering, confidence scoring system, React review UI, ERP integration, staff training, and 60 days of post-launch monitoring and tuning.

Ready to Eliminate Your Document Backlog?

Send us three of your most common document types and we'll show you exactly what the extraction output would look like — before you commit to anything.

Book Free Discovery Session →